nimforum mirror - Fastest way to count number of lines

alfrednewman (orginal) [2017-10-20T01:07:11+02:00] view original

Hello,

Before processing a giant txt file, I need to know in advance how many lines that file has. Since I will have to process multiple files it would be important to perform this line counting operation as quickly as possible.

What is the fastest way to know how many lines a txt file has?

I am currently using the following:

for line in lines "input.txt":
    inc(i)

For some reason I think the code is very simple and should have some way to do it in a faster way...

cblake (orginal) [2017-10-20T01:44:04+02:00] view original

It sounds like you will have many regular files (i.e., random access/seekable inputs as opposed to things like Unix pipes). On Linux with glibc, memfiles.open is probably the fastest approach which uses memchr internally to find line boundaries. E.g. (right from memfiles documentation),

import memfiles
for line in lines(memfiles.open("foo")):
  inc(i)

Your mileage on this may vary from OS to OS or libc to libc. I have no idea which if any Microsoft/Windows versions have well-optimized libc memchr() implementations.

jlp765 (orginal) [2017-10-20T04:04:41+02:00] view original

Even faster (avoiding some string allocations)


import memfiles
for line in memSlices(memfiles.open("foo")):
  inc(i)

cblake (orginal) [2017-10-20T08:07:28+02:00] view original

@jlp765 - good catch. I thought of that, too (I actually wrote that memSlices stuff), and almost went back and added a note later, but you beat me to it. :-)

I still am unaware about relative timings on platforms other than what I personally use and would be interested to hear reports, but on Linux/glibc memSlices (or more generally mmap+memchr however that is invoked) is always fastest in my tests.

Stefan_Salewski (orginal) [2017-10-20T09:00:39+02:00] view original

If speed is really important for you, you may consider SIMD instructions.

Lemire gave an example for this in his nice blog:

https://lemire.me/blog/2017/02/14/how-fast-can-you-count-lines/

alfrednewman (orginal) [2017-10-20T11:51:56+02:00] view original

Guys, thank you for your help.

@Stefan_Salewski, yes speed is an important point for me. I found the link you provided (about SMID) very interesting ... however, I do not know how to do this using Nim. Could you please help?

Even to help newbies like me, thought to include the response of this thread in the cookbook wiki being created .... as per https://forum.nim-lang.org/t/3259

Stefan_Salewski (orginal) [2017-10-20T12:41:52+02:00] view original

Please also compare this thread:

https://forum.nim-lang.org/t/1164#18006

I have not yet used SIMD instructions myself in Nim, but there are some hints in the Forum already.

For line counting, the different end-of-line marks for Unix/Windows/Mac makes it a bit more complicated unfortunately.

cblake (orginal) [2017-10-20T22:42:15+02:00] view original

So, one other thing that is probably obvious but bears mentioning just in case it didn't occur to @alfrednewman - the number of addressable/seekable bytes in a file is usually maintained by any modern filesystem on any modern OS as cheaply accessed metadata. So, if what you really need is not an exact count but a "reasonable bound" that could be violated once in a while then you may be able to really accelerate your processing.

As one example text corpus that we all have access to, the current Nim git repo's lib/nim/**.nim files have an average length of 33 bytes and a standard deviation of about 22 bytes as assessed by this little program:

import os, memfiles
when isMainModule:
  var sumSq = 0
  for slc in memSlices(inp):
    inc(counter)
    sumSq += slc.size * slc.size
  echo "num lines: ", counter
  let mean = float(inp.size) / float(counter)
  echo "mean length: ", mean
  let meanSq = float(sumSq) / float(counter)
  echo "var(length): ", meanSq - mean*mean

You could probably reasonably bound the average number of bytes per line as, say (average + 4*stdev) which in this case is about 33+22*4 =~ 121 bytes..maybe round up to 128. Then you could do something like:

import os
var reasonableUpperBoundOnLineCount = int(float(getFileInfo(myPath).size) / float(128))

If you use that bound to allocate something then you are unlikely to over allocate memory by more than about 4X which isn't usually considered "that bad" in this kind of problem space. Depending on what you are doing you can tune that parameter and you might need to be prepared in your code to "spill" past a very, very rare 4 standard deviations tail event. This optimization will beat the pants off any even AVX512 deal that iterates over all the file bytes at least for this aspect of the calculation. It basically eliminates a whole pass over the input data in a case that is made common by construction.

Since you have seemed pretty dead set on an exact calculation in other posts, a small elaboration upon the "embedded assumptions" in this optimization may be warranted. All that is really relied upon is that some initial sample of files can predict the distribution of line lengths "well enough" to estimate some threshold (that "128" divisor") that has "tunably rare" spill overs where they are rare enough to not cause much slowdown in whatever ultimate calculation you are actually doing which you have not been clear about.

Another idea along these lines, if, say, the files are processed over and over again, is to avoid re-computing all those memchr() s by writing a little proc/program to maintain an off-to-the-side file of byte indexes to the beginnings of lines. The idea here would be that you have two sets of files, your actual input files and some paired file "foo.idx" with foo.idx containing just a bunch of binary ints in the native format of the CPU that are either byte offsets or line lengths effectively caching the answer of the memchr.

If you had such index files then when you want to know how many lines a file is you can getFileInfo on the ".idx" file and know immediately. You could be careful and check modification times on the .idx and the original data file and that sort of thing, too. Why, you could even add a "smarter" memSlices that checked for such a file and skipped almost all its work and an API call numSlices that skipped all the work if the ".idx" file is up-to-date according to time stamps.

Basically, except for actually "just computing file statistics", it seems highly unlikely to me that you should really be optimizing the heck out of newline counting in and of itself beyond what memSlices/memchr() already do.

jlp765 (orginal) [2017-10-20T23:43:23+02:00] view original

How giant is a "giant text file"?

On my machine a 75M file takes roughly 0.12 sec to count the lines (it is dummy data, so not very random).

If GigaBytes in size, then close enough might be good enough ;-)

I didn't see @cblake mention it, but you could count the bytes to read 100 lines of a big file and use that to estimate the overall number of lines.

cblake (orginal) [2017-10-21T00:10:04+02:00] view original

Yeah..Depending on what he's doing, same-file dynamic estimation might also work. Good point, @jlp765.

On my system the 5.4 MB of /usr/share/nim/**.nim gets counted in about 4 milliseconds - over 1.3 GB/sec, probably faster than all but the most powerhouse nvme/disk array IO. This is why I suspect @alfrednewman might be re-calculating things instead of saving the answer either in RAM or as files.

I'm sure a pre-pass calculating the number of lines can avoid certain complexities. However, once you start doing assembly hijinks that are not even portable through a given CPU family (e.g., using SSE, AVX2, AVX512, ...) performance becomes very deployment sensitive. Meanwhile, eliminating the entire pre-pass by merging it with per-line allocations/whatever costs complexity, too, but yields portable performance gains. If it's really ineliminable then have at it, asm-wise, I guess. I just suspect it's off-track.

alfrednewman (orginal) [2017-10-21T00:32:19+02:00] view original

Hi, thanks for the help of all of you.

Yes, I'm pre-calculating things. In the data orchestration process I'm involved in, I can usually estimate the time of a rendering based on the number of rows I'm processing. It is a linear process and the processing time is typically not much affected as a function of the size of each line.

The files have an average of several gigabytes in size. Some have tens of millions of lines. Knowing the amount of lines before processing guarantees some interesting benefits to the user.

I'm lucky because the data is on an SSD server... I already had a performance gain using the tips provided by @cblake (using import memfiles).

All the best AN

boia01 (orginal) [2017-10-21T15:32:45+02:00] view original

Just for fun, I ported and hacked together a self-contained Nim version of Daniel Lemire's avxcount: https://gist.github.com/aboisvert/3f89bc0ae0a2168fcf35ccca98177f6a

(I didn't bother with the loop-unrolled versions)

cblake (orginal) [2017-10-21T23:41:16+02:00] view original

Nice, boia01! In my same /usr/share/nim/**.nim test I get 768 microseconds for your version and 2080us for just doing the memSlices approach..So, 2.7X speed up, a bit less than the 4X I saw when I last compared the approach in two C versions..maybe unrolling. Dunno.

@alfrednewman - if the statistics of these data files are stable over the whole file, you could always stop after the first gigabyte (or maybe less), figure out the average line length and use the file size to estimate the number of lines, and then it would only be a ~1 second delay. A MemFile knows its size, as does each slice...So, this is pretty easy to code. Of course, if the first gig is not always representative of the remainder that might not work, but it sounds like you probably don't need an exact answer. This is sort of a simplified version of one of my/jlp765's suggestions. 2.7 * 1.3 GB/s =~ 3.5 GB/s is faster IO than many people have. So, even using boia01's code you may not see a great speed-up.

Mirror of forum.nim-lang.org

3261 :: Fastest way to count number of lines