nimforum mirror - TextOverlapFinder

Hobbyman (orginal) [2025-05-28T21:02:27+02:00] view original

Here I want to present my program TextOverlapFinder to find textual matches between two text-files. As opposed to the Linux diff command, which finds differences between texts, TOF finds matches between texts.

https://github.com/some-avail/TextOverlapFinder

My primary use-case was to compare two journalistic stories on the same subject to see which parts overlap (are identical) and which are unique to each story.

An AI-generator has provided the initial algorithm which was file-based. But it used a huge amount of memory and OOM-ed out at texts of 40K+ (on my modest 8GB laptop). After that I have converted the algorithm to a line-based one and it can now handle much larger texts.

By the way I still think sciency apps can be a strong field for Nim (besides other fields), also because a lot of sciency apps are written in python from which the switch to Nim seems natural.

cblake (orginal) [2025-05-28T22:56:05+02:00] view original

You|others interested in this might also appreciate (or be able to use aspects of) https://github.com/c-blake/ndup

khaledh-nim (orginal) [2025-05-29T00:09:47+02:00] view original

This reminds me of work I did in my PhD around document similarity and clustering¹˒². It was a fun topic :)

¹ https://ieeexplore.ieee.org/document/1324634

² https://link.springer.com/article/10.1007/s10115-003-0118-5

kobi (orginal) [2025-05-29T17:47:35+02:00] view original

if this algorithm matches texts without specifying a value, in other words, tries to find all phrases that repeat (compares one text file to multiple or concatenated files), it can be super useful for translation work - where you want to translate sentences and repeating word sequences in the same way. Such a tool is used at the beginning, for example, you want to translate a booklet, but use the same terms used by the author in previous works or books.

Hobbyman (orginal) [2025-05-29T19:28:41+02:00] view original

@ cblake / khaled

Apparently you have been active in file-bulk-comparison and clustering (for different reasons i guess) which is also a very interesting and usefull activity, while my app focuses on detailed comparison of two specific docs... As a researcher one could start with the first and end with the second.

Larger exact matches may indicate a common source, while fuzzy matches indicate a common subject or pattern.

@ kobi

it might have lingual uses. I still want to make it unicode-enabled for additional languages / character-systems.

Hobbyman (orginal) [2025-05-29T19:43:01+02:00] view original

By the way I wonder why document-clustering is not used more at the large search-engines.

I found a website:

https://search.carrot2.org

which shows a pretty cool clustering-interface aswell. Previously thoe it had only limited access to a search-engine-database (more for showcase-purposes), but i dont know if that has changed.

khaledh-nim (orginal) [2025-05-29T21:16:25+02:00] view original

By the way I wonder why document-clustering is not used more at the large search-engines.

For some time this was a hot research topic, but as most NLP research it didn't take hold outside of academia. Now that LLMs have won, there's almost no need for it. An LLM can look at a bunch of documents and answer questions about them directly (or summarize the topics covered by them), withtout the need to pre-categorize/cluster them.

Hobbyman (orginal) [2025-05-29T21:33:45+02:00] view original

Allthoe LLMs / AI-s have great use (and maybe risk as some say), clustering is also about presentation of knowledge, and help in decision-making what you want to learn. For example, when i type "trump tariffs" it can subdivide in like "trade war" or "market-protection" depending on the politcal view of respective writers...

Hobbyman (orginal) [2025-05-31T13:52:22+02:00] view original

I have actually tried it at a major LLM/AI and when I ask:

cluster-analyse top N search-results: searchterm1 searchterm2

it does the request. Pretty cool. I assume pre-compiled specialized programs are used for that but i am not sure. (otherwise it would take to much computer-time??)...

Hobbyman (orginal) [2025-05-31T14:05:23+02:00] view original

But a specialized interface is still preferable to me...

Mirror of forum.nim-lang.org

13039 :: TextOverlapFinder