nimforum mirror - zipping strings

cdunn2001 (orginal) [2016-07-12T10:10:55+02:00] view original

I have this:

from sequtils import nil
from unicode import nil
var
  dna_norm = "ACGTacgtNn-"
  dna_comp = "TGCAtgcaNn-"
  a = sequtils.toSeq(unicode.runes(dna_norm))
  b = sequtils.toSeq(unicode.runes(dna_comp))
  rcmap = sequtils.zip( a, b )

echo(rcmap)

That works:


[(a: A, b: T), (a: C, b: G), (a: G, b: C), (a: T, b: A), (a: a, b: t), (a: c, b: g), (a: g, b: c), (a: t, b: a), (a: N, b: N), (a: n, b: n), (a: -, b: -)]

But I was wondering if there was a way to do that with UTF8 values. Or, put another way, how can I iterate over the UTF8 values of a string, and turn those into a seq?

zielmicha (orginal) [2016-07-12T10:14:17+02:00] view original

If you need to iterate over Unicode characters, use runes iterator in unicode module (http://nim-lang.org/docs/unicode.html).

cdunn2001 (orginal) [2016-07-12T17:03:00+02:00] view original

Sorry. I forgot to copy the import unicode into my code snippet.

Yes, I currently use runes as I see no alternative. How do I iterate over utf8 values instead?

Krux02 (orginal) [2016-07-12T17:27:20+02:00] view original

What do you understand as utf8 value? Do you mean a rune? Since the documentation about runes in nim is a bit sparse maybe the go documentation might help you more. Go has a very similar definition for strings and runes, as nim has.

wiffel (orginal) [2016-07-12T20:00:35+02:00] view original

I'm not sure why you would need it, but I think the following code does what you are looking for.

import sequtils, unicode

const
  dna_norm = "ACGTacgtNn-"
  dna_comp = "TGCAtgcaNn-"
  a = dna_norm.toRunes().mapIt(it.toUTF8)
  b = dna_comp.toRunes().mapIt(it.toUTF8)

echo a.zip(b)

jibal (orginal) [2016-07-12T22:04:32+02:00] view original

How do I iterate over utf8 values instead?

There's no such thing as a utf8 value. UTF8 is a multibyte encoding of Unicode code points.

OderWat (orginal) [2016-07-13T00:02:55+02:00] view original

Iterating over UTF-8 "values" would be iterating over the bytes (chars) which form an valid unicode code point (rune). So it would advance 1 - 8 bytes in the UTF-8 encoded string. This means no decoding to unicode but working in the encoded format.

See this... I still think stuff like this should be added to the unicode module.

Edit: I made a PR for the simple way to do it: https://github.com/nim-lang/Nim/pull/4481

import unicode

iterator utf8(str: string): string =
  var offset = 0
  while offset < str.len:
    let len = runeLenAt(str, offset)
    yield str[offset.. (offset+len-1)]
    offset += len

let text = "Öhm? 漢兩秦先"

for c in text.utf8:
  write stdout, c & " "

Output:


Ö h m ?   漢 兩 秦 先

Run It

cdunn2001 (orginal) [2016-07-13T00:35:35+02:00] view original

Folks, I'm aware of the definition of UTF8. Nim strings are stored as arrays of 8-bit values, whatever you want to call them. In fact, when you index a string, you get the 8-bit value, not the unicode character. Given those facts, what astonishes me is the difficulty of zipping two strings interpreted as ASCII, or even as 8-bit integers.

In my case, all the values are ASCII, so UTF8 is precisely the 8-bit character. That's why I do not care about encodings. I like the runes function, but I don't see why I cannot call toSeq(string) to get a sequence of 8-bit numbers -- char, or uint8, or something like that.

@OderWat, very interesting. Thanks. Is that equivalent to (but less efficient than) converting each of runes() to a string?

@wiffel, thanks. That works. But do I really need map?

These also work:

proc charSeq(s: string): seq[char] =
  result = newSeq[char](s.len)
  for i in 0 .. s.high:
    result[i] = s[i]
a = charSeq(dna_norm)
b = charSeq(dna_comp)
rcmap = sequtils.zip( a, b )

iterator charYield(s: string): char {.inline.} =
  for i in 0 .. s.high:
    yield s[i]
a = sequtils.toSeq(charYield(dna_norm))
b = sequtils.toSeq(charYield(dna_comp))
rcmap = sequtils.zip( a, b )

I kind of think that charSeq or charYield should be in the standard library, since it is a common goal to convert a string to sqeuence of char.

And this is interesting, if we want an array:

template toArrayChars(s: string{`const`}): expr =
  type
    x = array[0..s.high, char]
  var
    res: x
  for i in 0 .. s.high:
    res[i] = s[i]
  res
var
  a = toArrayChars(dna_norm)
  b = toArrayChars(dna_comp)

That uses Parameter Constraints. I'm starting to see the power of Nim.

OderWat (orginal) [2016-07-13T01:12:42+02:00] view original

I think the utf-8 iterator is kinda more efficient if you want to stay in the utf-8 domain. It could be written more efficient with the usage of some kind of real slices and could omit a call. But all in all it should be more efficient as it does not convert for and back and uses less memory for many "standard" languages esp. the iso-8859-x family.

Lol. I did not see that you just want to zip ASCII strings. Why not just writing it?

proc zip(a, b: string): seq[char] =
  newSeq(result, a.len + b.len)
  
  var i = 0
  var o = 0
  while i < a.len or i < b.len:
    if i < a.len:
      result[o] = a[i]
      inc o
    if i < b.len:
      result[o] = b[i]
      inc o
    inc i

let a = "abcd"
let b = "1234"

echo zip(a, b)

cdunn2001 (orginal) [2016-07-13T01:48:56+02:00] view original

Why not just writing [zip]?

Because in Python we need only this:

 So now I have a sequence of tuples of char. I can construct a table via the "pairs" constructor. And I can view it.

proc charSeq(s: string): seq[char] =
  result = newSeq[char](s.len)
  for i in 0 .. s.high:
    result[i] = s[i]
const
  dna_norm = "ACGTacgtNn-"
  dna_comp = "TGCAtgcaNn-"
  rclist = sequtils.zip(charSeq(dna_norm), charSeq(dna_comp))
var
  rcmap = tables.newTable(rclist) # cannot be const
#echo(rclist)
echo(rcmap)


    {A: T, a: t, C: G, c: g, G: C, g: c, -: -, N: N, n: n, T: A, t: a}
 But how can I serialize it to JSON? The %* macro is not working.

import json
...
var j = %* rcmap[]



    graph_to_utgs.nim(20, 9) template/generic instantiation from here
    lib/pure/json.nim(729, 42) Error: undeclared field: 'data'
 Any ideas? ... Oh. JSON dictionaries requires strings, not characters, as keys. Nvm.


      
    
      
        
          

            
              wiffel
             

            
              (orginal)
             

            [2016-07-13T11:47:08+02:00] 

            
              view original
            
          

          @cdunn2001: For what (I guess) you are trying to achieve, I would go for something like:

import tables

proc complementsFrom(normal, complement: string): Table[char,char] =
  result = initTable[char,char]()
  for ix, ch in normal:
    result[ch] = complement[ix]

proc reverseComplement(str: string): string =
  const complements = complementsFrom("ACGTacgtNn-", "TGCAtgcaNn-")
  result = newString(str.len)
  for ix, ch in str:
    result[str.high - ix] = complements[ch]

echo reverseComplement("Gattaca")

nim has the nice feature that const are evaluated at compile time. So,

const complements = complementsFrom("ACGTacgtNn-", "TGCAtgcaNn-")
 will only be evaluated at compile time.

        
      
      
    
      
        
          

            
              jibal
             

            
              (orginal)
             

            [2016-07-14T07:20:24+02:00] 

            
              view original
            
          

          
 In my case, all the values are ASCII
People don't have ESP. You should have stated this up front.


 so UTF8 is precisely the 8-bit character.
This has nothing to do with UTF8.

        
      
      
    
      
        
          

            
              Krux02
             

            
              (orginal)
             

            [2016-07-14T16:05:03+02:00] 

            
              view original
            
          

          I can only second what jibal said. Your question was wrong, you should have asked for ASCII not UTF8. But I highly discourage you to do so, unless you are writing a toy ascii art editor, where vertical alignment is important, and you don't want to deal with varaying lengths in utf8. But even then you might be easier of by using utf32. The extra 3 byte per char won't kill you. In all other situations, please learn to use utf8 properly.

        
      
      
    
      
        
          

            
              wiffel
             

            
              (orginal)
             

            [2016-07-14T16:26:10+02:00] 

            
              view original
            
          

          @Krux02 : I was also put on the wrong leg by the original question.

But, looking more closely to the content of the program, it looks like cdunn2001 is dealing with DNA sequencing data. That is one of those exceptions. The alfabet used in such encodings is very limited (well within the ASCII range).

Files with such an encoding can easily be 40GB in size or more. Moving to UTF32 would increase the size to 160GB (or use an enormous amount of program memory). In such a case, it does add a lot of pain with no gain.

So I understand that cdunn2001 is looking at the byte encoding in this specific case.

        
      
      
    
      
        
          

            
              Krux02
             

            
              (orginal)
             

            [2016-07-14T16:47:31+02:00] 

            
              view original
            
          

          Ok, when memory consuption is the problem and should be minimized, then there are also better alternatives to ASCII. But that again is also a different question.

If I understand the Wikipedia article about the "Nucleic acid notation" correctly, and ignoring Uracil (U) (which is in RNA, not DNA according to Wikipedia), then each letter can be represented by 4 bits. It's even possible to give each bit a meaning (ACG and T in Bases Represented), then DNA sequences can be represented with half of the storage as ASCII. Indexing is still possible, just [] []= and iterator need to be redefined for a dna sequence.

        
      
      
    
      
        
          

            
              cdunn2001
             

            
              (orginal)
             

            [2016-07-14T17:35:33+02:00] 

            
              view original
            
          

          To summarise, given that indexing of strings is for the bytes, it could be easier to treat strings as a sequence of bytes.

toSeq(foo.items) did the trick. However, foo.items.toSeq did not work. I think that's odd. But all I really need are examples.

I definitely appreciate that Nim handles Unicode so well via the runes() iterator, and that it stores in bytes. That's all very sensible.

Someone asked why I want to use Python idioms. That's easy: I see Nim as the single best choice to replace the pattern of Python + extension modules. People use Python because it's easy, with a huge standard library, and they use extension modules when they need speed. And they might integrate with C or C++ executables. I need a single language so I can simplify installation and integration. It must be statically typed because people tend to use such gigantic test-cases (against my admonitions) that typos are seen too late. And it needs to be terse, because researchers do a lot of experimental prototyping; there is never a "spec". (R is very popular.)

I think Nim is easy and concise enough to replace Python for researchers (yes, in biosciences) who would never have time to learn something like Ocaml. Julia is a bit too limiting. I think Rust is too verbose (and the borrow-checker too finicky) for these users. Go is a fair alternative, but I really think these folks are better off with exceptions and generics. I guess Java is possible too, but it's so verbose. Scala could be ok if it didn't take eons to compile. C# could suffice, but it's not portable. Languages like Swift allow static typing but do not require it, and they might be slow for numerics. So I've decided to try Nim.

In order to bring colleagues into this world, I need to be able to say, "Where you did this in Python, you can do this in Nim." I'll have to build up a small library of helpers.

        
      
      
    
      
        
          

            
              OderWat
             

            
              (orginal)
             

            [2016-07-14T20:45:35+02:00] 

            
              view original
            
          

          Strings are indexed by char, at least if you use [] or items() on them. Nim has a "byte" type but it is not used frequently and just an alias for "uint8". It also has int8, uint8 and char which are all "a byte".

items() is not a function so you can't call it like one. It is an iterator. Look up how it is implemented for strings: here and in general here

toSeq() is not a function either but a template which will generate just the code one would write manually. I think you should really look up how it is implemented: here

I really mean you to look at the  linked code in 2 and 3. That is "Nim" and how it works. Thats important! It also hopefully makes some stuff clear when you think about point 5.

All "systems library" code is literally used when you compile. Same as the code you have written by yourself. There is no "better (faster) system library implementation" as in Python. Well, besides that those may sometimes have some clever tricks ... or not.

Using constructs like the ones in Python may be as slow as they are in Python! Pythons implementation of some language constructs are just what Nim uses too. It is just native code working with garbage collected memory in a mostly naive fashion. So you may just get equivalent speeds. The real deal is to really use Nim.

IMHO: Using Nim "counterfeit" as Python will bite you pretty soon. Nim is not Python and never will be. Learning Nim after having used Python is a good thing though.


EDIT: I hope this does not sound rude. I just want to point out that Nim is so much more than "a better Python". It could be used to make Python.

        
      
      
    
      
        
          

            
              cdunn2001
             

            
              (orginal)
             

            [2016-07-18T19:11:23+02:00] 

            
              view original
            
          

          
 I hope this does not sound rude.
Not at all. I appreciate the pointers. The distinction between function and iterator is important.

I have to find ways to help researchers "migrate" to Nim. It's a daunting language if you look at the full manual.

        
      
      
    
      
        
          

            
              OderWat
             

            
              (orginal)
             

            [2016-07-18T20:53:34+02:00] 

            
              view original
            
          

          As always I suggest some kind of Cookbook examples. One way to do this would be a blog series on how you approached Nim for the tasks you need. I still think that something like this has the biggest impact. Especially if there is a comments section in the Blog or platform used to present code + examples.

So you may just start with a collection of problems they recognize and explain how it can be done in Nim. Building a little library of tool or just examples. I think that could be done in a way so that not only the first solution is being discussed, but also the background on how it works and what the options are. You could let people look over the code which have more experience with Nim and maybe get some insights otherwise missed.

Mirror of forum.nim-lang.org

2371 :: zipping strings