I am releasing Cello, a library of string algoritms using succinct data structures.
It should appear shortly on Nimble. For now it implements walevet trees, FM indices, the Burrow-Wheeler transform, suffix arrays and a little more.
Check out the README for more.
Notice that it is not Unicode-aware: think more of searching large genomic strings or symbolized time series, rather then using it for text.
Hi Krux02, I have updated the README to provide more explanation of the role of succinct data structures.
Notice that the library is not really about text. Think of strings as long sequences of symbols from a finite (small) alphabet. The prototypical example would be the genome as a string of A, C, T, G. Other examples are time series, when they are restricted to a finite alphabet by rounding to a few values (such as low, medium or high)
Thanks a lot for the improvement, but I am still a bit puzzled when I get the the part where you start talking about rank and select. By rereading it several times I think I know now what this is all about. But the intro could be much smoother (if you care about that). For example here you instantly start to talk about bit sequences.
For bit sequences, rank(i) counts the number of 1 bits in the first i places.
Because normally I never care about bit sequences, my brain completely skipped the word "bit" and I was puzzled why you were now talking so much about bits. I think a sentence that you use a simplification for the introduction, a sequence with only 0 and 1 values, a bit sequence.
Thanks for the link to the theory, I think this will make everything more clear as I read it. And I think it is not all that unfamiliar to me, I just never called it by that name. Now at least I know it's about compression.
Wow, this is really great! Do you support the 2 bit representation for DNA/RNA sequences? I looked and didn't see it.
I think writing tools for genome assembly in Nim would be worthwhile. Bioinformaticians use Python a lot but of course need to write fast code in C and C++. Nim's Pythonesque syntax is a big win.
@cdunn2001 Thank you. I am no bioinformatician myself, but I am trying to learn some of the standard algorithms as they may have applications elsewhere. Hope Cello turns out to be useful, though!
@bpr If you see here, one can turn a string into a wavelet tree, which under the hood is represented with indexed bit vectors. This means that, when turning a DNA string into this representation, it will take 2 bits per base. I am not sure if this is what you mean. Cello does not support natively reading a format which stores 4 bases per byte, but it could easily do that once it becomes generic (right now I support only strings or char sequences, possibly disk backed, but it would make sense to implement all this for any collection over some alphabet)
@andrea Yes, I think you understand me, the alphabet of DNA and RNA can and should be represented with two bits per character. There's a 'standard' for storing FASTA files as .2bit files for compression, but I am befuddled as to why they chose T-00, C-01, A-10, G-11. If they chose A-T and C-G to be bitwise complements of each other then certain operations become much simpler (e.g., you can reverse complement a kmer stored in a 32 or 64 bit value looplessly with bitops) and just makes more sense. I use A-00, C-01, G-10, T-11 which is easy to remember because of order.
I'll also note that while the thesis you point to is good and interesting, some things have changed since 2012 and I suggest you look at this for some other perspective on where genome assembly is headed, with much longer and noisier reads.
You are of course free to release your software under any license you want, and I really don't want to hijack this thread, but your summation of the Apache License is not accurate.
Your summation ("it is as free as you can get, with the only requirement that if you actually do modifications to the library itself, you have to give credit to the original authors") actually describes the licenses that I encourage the Nim community to stick to, like MIT.
AL2's three pages of legalese contain other restrictions and side-effects, and mixing them into an MIT-licensed software ecosystem creates complications. As some other copyfree advocates had put it:
The current prevalence of copyfree licenses in Nim's ecosystem (85% of Nimble's package records are MIT, BSD, ISC, or public domain) makes it the ideal "safer than C" language for various userland projects. It's a category in which Nim is far ahead of its competitors, and it's a noteworthy niche for Nim to fill. Those pure-copyfree projects would have to filter out any restrictive licenses (including AL2, LGPL, MPL, OpenSSL, Zlib) and anything that depends on them.
I know it's annoying, but it's easier to make the right licensing decisions early on than later. Just stick to copyfree licenses, and all this legalese annoyance goes away.
Well it is for genetics. No wonder that I initially didn't get what it was about. I am not a genetic engineer. I saw that you had a spill datatype. I think it is related to this issue, it would be nice, if you could also give a comment on it, because I think not all libraries should define this with their own name for it.
Well it is for genetics. No wonder that I initially didn't get what it was about. I am not a genetic engineer.
Genetics is one of the applications, but these algorithms are generally applicable in text analysis. In the C++ world, SDSL is a comparable library. One of the reasons bioinformaticians are interested in this is that genomic data is large: > 3GB for an H. Sapiens genome and you typically want coverage 20-50X so it starts to get really tough to keep all of that in RAM which is where you'd like it.
FYI, the sort of people who'd use this in genome assembly (which Andrea said he's interested in in the Cello docs) are bioinformaticians, not genetic engineers.
... and I really don't want to hijack this thread
I think a better solution is for you to post in another thread which refers to this one, where you can put forth your arguments. Be careful though, as you nag you run the risk of alienating people who might otherwise be sympathetic. Please consider Andrea's very polite request, which I violated in replying here :-(
@ Krux02 As bpr remarked, the library is really about strings - especially large ones. One key application is searching, both exact and approximate, but I have also started adding similarity measures for strings, and data structures such as suffix arrays have many other applications (for example the Burrows-Wheeler transform can be used in string compression).
Not everything is there yet, but I hope to cover a few more topics and be able to make it work on data on disk for the cases where the datasets are too big with respect to the available memory.
It just happens that bioinformatics is a sector where large strings are particularly prevalent, but it is not the only one. The library could as well be used to, well, do actual searching - that is finding stuff in large text corpora - or finding recurrent patterns in time series - e.g. the latest 100 data points follow a pattern of UP UP DOWN ... has a similar pattern ever appeared before?
About the issue you linked, I am not sure what to comment on. I started writing spills because I found it convenient to have some kind of disk based sequence, and it was pretty easy to do by memory mapping files. I called it like this because when memory is scarce you can just spill the sequence to disk. The topic discussed there does not seem to have anything to do with memory-mapped data structures, but maybe I am misunderstanding your remark
@andrea
About the issue. You are right I should explain a bit more. I try to get something as std::span from c++ into the standard library of Nim. That document explains it much better that I could do it in this post. The idea is that you could easily transform a memory mapped files, strings and seq types into a span, and then use this as an interface for your API. There is no need to know that the data is actually from a memory mapped file in your library.
I should actually put that link to the pdf to the original issue, too. I was not able to find this originally when I wrote the issue.