Here I want to present my program TextOverlapFinder to find textual matches between two text-files. As opposed to the Linux diff command, which finds differences between texts, TOF finds matches between texts.
https://github.com/some-avail/TextOverlapFinder
My primary use-case was to compare two journalistic stories on the same subject to see which parts overlap (are identical) and which are unique to each story.
An AI-generator has provided the initial algorithm which was file-based. But it used a huge amount of memory and OOM-ed out at texts of 40K+ (on my modest 8GB laptop). After that I have converted the algorithm to a line-based one and it can now handle much larger texts.
By the way I still think sciency apps can be a strong field for Nim (besides other fields), also because a lot of sciency apps are written in python from which the switch to Nim seems natural.
This reminds me of work I did in my PhD around document similarity and clustering¹˒². It was a fun topic :)
¹ https://ieeexplore.ieee.org/document/1324634
² https://link.springer.com/article/10.1007/s10115-003-0118-5
@ cblake / khaled
Apparently you have been active in file-bulk-comparison and clustering (for different reasons i guess) which is also a very interesting and usefull activity, while my app focuses on detailed comparison of two specific docs... As a researcher one could start with the first and end with the second.
Larger exact matches may indicate a common source, while fuzzy matches indicate a common subject or pattern.
@ kobi
it might have lingual uses. I still want to make it unicode-enabled for additional languages / character-systems.
By the way I wonder why document-clustering is not used more at the large search-engines.
I found a website:
which shows a pretty cool clustering-interface aswell. Previously thoe it had only limited access to a search-engine-database (more for showcase-purposes), but i dont know if that has changed.
By the way I wonder why document-clustering is not used more at the large search-engines.
For some time this was a hot research topic, but as most NLP research it didn't take hold outside of academia. Now that LLMs have won, there's almost no need for it. An LLM can look at a bunch of documents and answer questions about them directly (or summarize the topics covered by them), withtout the need to pre-categorize/cluster them.
I have actually tried it at a major LLM/AI and when I ask:
cluster-analyse top N search-results: searchterm1 searchterm2
it does the request. Pretty cool. I assume pre-compiled specialized programs are used for that but i am not sure. (otherwise it would take to much computer-time??)...