nimforum mirror - Cello: a library of string algoritms using succinct data structures

andrea (orginal) [2017-03-31T18:39:26+02:00] view original

I am releasing Cello, a library of string algoritms using succinct data structures.

It should appear shortly on Nimble. For now it implements walevet trees, FM indices, the Burrow-Wheeler transform, suffix arrays and a little more.

Check out the README for more.

Notice that it is not Unicode-aware: think more of searching large genomic strings or symbolized time series, rather then using it for text.

Krux02 (orginal) [2017-03-31T18:58:13+02:00] view original

I just looked over the introduction you just posted. I am not really very deep into string algorithms, I rather avoid all string operations in my applications except string format for output. Therefore I can already see that this library is rather not for me, but still it would be nice to read one or two sentences about what the “succinct data structure” actually is. Be aware I already know I am not your target audience, so take my advice with a grain of salt.

mashingan (orginal) [2017-03-31T19:53:26+02:00] view original

I thought of this after glance the thread title :)

andrea (orginal) [2017-04-01T20:00:16+02:00] view original

Hi Krux02, I have updated the README to provide more explanation of the role of succinct data structures.

Notice that the library is not really about text. Think of strings as long sequences of symbols from a finite (small) alphabet. The prototypical example would be the genome as a string of A, C, T, G. Other examples are time series, when they are restricted to a finite alphabet by rounding to a few values (such as low, medium or high)

Krux02 (orginal) [2017-04-03T00:35:16+02:00] view original

Thanks a lot for the improvement, but I am still a bit puzzled when I get the the part where you start talking about rank and select. By rereading it several times I think I know now what this is all about. But the intro could be much smoother (if you care about that). For example here you instantly start to talk about bit sequences.

For bit sequences, rank(i) counts the number of 1 bits in the first i places.

Because normally I never care about bit sequences, my brain completely skipped the word "bit" and I was puzzled why you were now talking so much about bits. I think a sentence that you use a simplification for the introduction, a sequence with only 0 and 1 values, a bit sequence.

Thanks for the link to the theory, I think this will make everything more clear as I read it. And I think it is not all that unfamiliar to me, I just never called it by that name. Now at least I know it's about compression.

andrea (orginal) [2017-04-03T08:13:11+02:00] view original

Thank you for your feedback! I will try to improve the wording in the next days. The terminology is quite standard in the field, but I agree that there is room for improvement!

cdunn2001 (orginal) [2017-04-06T02:37:51+02:00] view original

This is really good. Forked into bio-nim. You're welcome to join that GitHub org with me and bpr, if you want.

https://github.com/bio-nim

bpr (orginal) [2017-04-06T06:39:52+02:00] view original

Wow, this is really great! Do you support the 2 bit representation for DNA/RNA sequences? I looked and didn't see it.

I think writing tools for genome assembly in Nim would be worthwhile. Bioinformaticians use Python a lot but of course need to write fast code in C and C++. Nim's Pythonesque syntax is a big win.

andrea (orginal) [2017-04-06T09:42:05+02:00] view original

@cdunn2001 Thank you. I am no bioinformatician myself, but I am trying to learn some of the standard algorithms as they may have applications elsewhere. Hope Cello turns out to be useful, though!

@bpr If you see here, one can turn a string into a wavelet tree, which under the hood is represented with indexed bit vectors. This means that, when turning a DNA string into this representation, it will take 2 bits per base. I am not sure if this is what you mean. Cello does not support natively reading a format which stores 4 bases per byte, but it could easily do that once it becomes generic (right now I support only strings or char sequences, possibly disk backed, but it would make sense to implement all this for any collection over some alphabet)

bpr (orginal) [2017-04-06T17:09:40+02:00] view original

@andrea Yes, I think you understand me, the alphabet of DNA and RNA can and should be represented with two bits per character. There's a 'standard' for storing FASTA files as .2bit files for compression, but I am befuddled as to why they chose T-00, C-01, A-10, G-11. If they chose A-T and C-G to be bitwise complements of each other then certain operations become much simpler (e.g., you can reverse complement a kmer stored in a 32 or 64 bit value looplessly with bitops) and just makes more sense. I use A-00, C-01, G-10, T-11 which is easy to remember because of order.

I'll also note that while the thesis you point to is good and interesting, some things have changed since 2012 and I suggest you look at this for some other perspective on where genome assembly is headed, with much longer and noisier reads.

Libman (orginal) [2017-04-10T16:58:00+02:00] view original

You are of course free to release your software under any license you want, and I really don't want to hijack this thread, but your summation of the Apache License is not accurate.

Your summation ("it is as free as you can get, with the only requirement that if you actually do modifications to the library itself, you have to give credit to the original authors") actually describes the licenses that I encourage the Nim community to stick to, like MIT.

AL2's three pages of legalese contain other restrictions and side-effects, and mixing them into an MIT-licensed software ecosystem creates complications. As some other copyfree advocates had put it:

"I prefer to avoid contributing to projects where a momentary lapse or scripting bug could expose me to the risks of litigation, and to avoid incorporating anything into my own projects that would require other people to shoulder those risks (and the burden of using commit hook scripts to help insulate them against legal risk)." *

"The Apache license has cumbersome requirements such as the requirement to 'cause any modified files to carry prominent notices stating that You changed the files'."

"I hate software licenses. When I read a software license, what I see is a bunch of officious, mind-numbing lawyerly doublespeak. Blah, blah, blah.. kill me now." *

"The MIT License is pretty short, and pretty simple [...]. The Apache License has a few more restrictions. Notably, if you change any Apache-licensed code, you must state so. There are also rules about use of the project name [...]" *

The current prevalence of copyfree licenses in Nim's ecosystem (85% of Nimble's package records are MIT, BSD, ISC, or public domain) makes it the ideal "safer than C" language for various userland projects. It's a category in which Nim is far ahead of its competitors, and it's a noteworthy niche for Nim to fill. Those pure-copyfree projects would have to filter out any restrictive licenses (including AL2, LGPL, MPL, OpenSSL, Zlib) and anything that depends on them.

I know it's annoying, but it's easier to make the right licensing decisions early on than later. Just stick to copyfree licenses, and all this legalese annoyance goes away.

andrea (orginal) [2017-04-10T17:31:20+02:00] view original

I ask you the favour to be polite and leave the thread for the discussion of the library itself

Krux02 (orginal) [2017-04-10T18:03:00+02:00] view original

Well it is for genetics. No wonder that I initially didn't get what it was about. I am not a genetic engineer. I saw that you had a spill datatype. I think it is related to this issue, it would be nice, if you could also give a comment on it, because I think not all libraries should define this with their own name for it.

https://github.com/nim-lang/Nim/issues/5437

bpr (orginal) [2017-04-10T19:48:21+02:00] view original

Well it is for genetics. No wonder that I initially didn't get what it was about. I am not a genetic engineer.

Genetics is one of the applications, but these algorithms are generally applicable in text analysis. In the C++ world, SDSL is a comparable library. One of the reasons bioinformaticians are interested in this is that genomic data is large: > 3GB for an H. Sapiens genome and you typically want coverage 20-50X so it starts to get really tough to keep all of that in RAM which is where you'd like it.

FYI, the sort of people who'd use this in genome assembly (which Andrea said he's interested in in the Cello docs) are bioinformaticians, not genetic engineers.

bpr (orginal) [2017-04-10T19:53:54+02:00] view original

... and I really don't want to hijack this thread

I think a better solution is for you to post in another thread which refers to this one, where you can put forth your arguments. Be careful though, as you nag you run the risk of alienating people who might otherwise be sympathetic. Please consider Andrea's very polite request, which I violated in replying here :-(

Krux02 (orginal) [2017-04-10T22:54:05+02:00] view original

@bpr sorry, bioinformatics then. What I originally mean is 'people who write programs that handle genes' and I naïvely used the term "genetic engineer" which is of course wrong. I am just an application developer. All strings I handle are the strings I have written manually that are necessary for the program to some text, and maybe some grep and sed on the command line. So no need for any external library.

andrea (orginal) [2017-04-11T10:24:26+02:00] view original

@ Krux02 As bpr remarked, the library is really about strings - especially large ones. One key application is searching, both exact and approximate, but I have also started adding similarity measures for strings, and data structures such as suffix arrays have many other applications (for example the Burrows-Wheeler transform can be used in string compression).

Not everything is there yet, but I hope to cover a few more topics and be able to make it work on data on disk for the cases where the datasets are too big with respect to the available memory.

It just happens that bioinformatics is a sector where large strings are particularly prevalent, but it is not the only one. The library could as well be used to, well, do actual searching - that is finding stuff in large text corpora - or finding recurrent patterns in time series - e.g. the latest 100 data points follow a pattern of UP UP DOWN ... has a similar pattern ever appeared before?

About the issue you linked, I am not sure what to comment on. I started writing spills because I found it convenient to have some kind of disk based sequence, and it was pretty easy to do by memory mapping files. I called it like this because when memory is scarce you can just spill the sequence to disk. The topic discussed there does not seem to have anything to do with memory-mapped data structures, but maybe I am misunderstanding your remark

Krux02 (orginal) [2017-04-11T14:47:29+02:00] view original

@andrea

About the issue. You are right I should explain a bit more. I try to get something as std::span from c++ into the standard library of Nim. That document explains it much better that I could do it in this post. The idea is that you could easily transform a memory mapped files, strings and seq types into a span, and then use this as an interface for your API. There is no need to know that the data is actually from a memory mapped file in your library.

I should actually put that link to the pdf to the original issue, too. I was not able to find this originally when I wrote the issue.

andrea (orginal) [2017-04-11T17:00:06+02:00] view original

@Krux02 I could just use a concept and be done. BUT it turns out I actually need to know that I am working with data from disk, because in this case I want to use disk-based sequences also for the intermediate data structures that I use in the constructions - something that I am not doing right now but I plan to support.

Mirror of forum.nim-lang.org

2896 :: Cello: a library of string algoritms using succinct data structures