nimforum mirror - Fastest way to pack CSV file

stbalbach (orginal) [2017-04-21T23:18:36+02:00] view original

What would be the fastest way to remove lines from a medium-size CSV file?

For example given a CSV:


Bill,0,0,0
Bill,1,1,1
Mary,0,0,0
Todd,0,0,0
etc..

And a list of names to delete:


Mary
Todd

Remove the lines with those names. The CSV's are anywhere from 20,000 to 100,000 lines long. The list of names to remove are typically from 1 to 500 names long. The list of names are deleted across ~50 different CSV files. Reading a CSV into memory is OK for each file, but not all CSVs at once (the real CSVs are larger).

The method I use now is straight forward: read the CSV and list of delete-names into an array, for each CSV line check if it's in the deletion list and if not write the line back out otherwise skip.

The existing code in GNU Awk:


  
  c = split(readfile(namefile), a, "\n")
  d = split(readfile(csvfile), b, "\n")
  
  while(i++ < d) {
    if(b[i] == "") continue
    split(b[i],g,",")
    mark = j = 0
    while(j++ < c) {
      if(a[j] == "") continue
      regname = "^" regesc2(a[j]) "$"  # regex match first field
      if( strip(g[1]) ~ regname) {
        mark = 1                       # In delete list - skip
        break
      }
    }
    if(!mark) {                        # Not in delete list - write out line
      print b[i]
    }
  }

Looking for new method ideas before I rewrite in Nim.

Araq (orginal) [2017-04-21T23:48:22+02:00] view original

Import the CSVs into an sqlite database and run a DELETE query on it. Then use the database for further queries and throw away "The Art of Unix Programming".

cheatfate (orginal) [2017-04-22T00:08:58+02:00] view original

First of all you need to memory map your CSV file. After that you can access your file contents in most optimized way. https://en.wikipedia.org/wiki/Memory-mapped_file

stbalbach (orginal) [2017-04-22T04:07:45+02:00] view original

Import the CSVs into an sqlite database and run a DELETE query on it. Then use the database for further queries and throw away "The Art of Unix Programming".

It's funny.. I'm working on a project with someone else who is creating a similar app and he went with the SQL approach. Because of the abstraction of the database he misses tons of problems in the data. Meanwhile with my clear text files and unix tools I'm constantly finding new ways to explore the data and find problems. It just depends what you're doing and what's important. The speed issue isn't a big deal.

stbalbach (orginal) [2017-04-22T04:12:49+02:00] view original

First of all you need to memory map your CSV file

Got it, thanks. Found Module memfiles and will give it a try.

jlp765 (orginal) [2017-04-22T07:00:56+02:00] view original

Another way would be to have the program produce a grep regular expression that is the concatenation of the namefile values into a string

and run a grep command in a shell, like


grep -Ev '^(name1|name2|name3|...),' csvfile > newfile

but then you could do this in nimscript or bash or ...

Araq (orginal) [2017-04-22T09:36:51+02:00] view original

Because of the abstraction of the database he misses tons of problems in the data. Meanwhile with my clear text files and unix tools I'm constantly finding new ways to explore the data and find problems. It just depends what you're doing and what's important.

"abstraction of the database"? well you are supposed to know the basics of SQL at least. And learning SQL is far less work than learning a combination of awk, grep, bash and you're rewarded with a tool that understands you're dealing with a list of columns instead of blobs of bytes. It's really not rocket science to do select * from table limit 1000 to get an impression of the data.

You cannot compare employee A with tools T and employee B with tools S and draw any conclusions from that. You need to compare employee A with tools T vs tools S. (Or employee A vs employee B using the same tool T.)

chemist69 (orginal) [2017-04-22T09:45:35+02:00] view original

you could also look into the nimble packages and see, if there is something suitable for your needs (cough csvtable cough). There are other libs for dealing with csv files there, too and there is also one in the stdlib (parsecsv).

stbalbach (orginal) [2017-04-23T18:13:31+02:00] view original

@jlp765 - implemented the grep solution and is ridiculously fast (10 mins for awk vs 2 seconds for grep). Not sure why the awk script is slow.. ah I see now, it doesn't use "|". For a 30,000 line log file and 1000 line names file = 30,000,000 regex matches.. ugh.

@Araq - you're right SQL (or maybe JSON) is the right way but this is working without the plumbing. They are just log files basically without much dynamic access other than an occasional delete before re-running the app.

jlp765 (orginal) [2017-04-23T21:51:19+02:00] view original

grep doesn't use the PCRE back-tracking library, so is probably faster than implementing it in Nim (which uses PCRE, I believe).

How it would compare with Nim matching strings rather than doing RE matching? I would guess grep would be quicker than Nim unless you can avoid string allocations in Nim (use all cstring stuff).

You would have to try it and see.

Of course, the grep solution doesn't let you play with the data like the SQL solution does :-)

Libman (orginal) [2017-04-24T04:01:04+02:00] view original

Araq is always right... Except when he tells you to throw away TAoUP. (Or makes me use RST, which doesn't let me italicize a link.) :?

There's the minimalist school of software design, perhaps best summarized by cat-v.org's "Harmful Software" page. A lot of people dismiss it as a hairshirt cult, but it's still important to understand their position. I think the comparison of Nim and Golang is a great counter-argument against taking minimalism to impractical extremes, but their arguments should still be treated with respect.

Anyway the question was "what would be the fastest way" (not the most elegant, powerful, interoperable, future-proof, etc) to filter 100,000 CSV records. This seems like a real-world business problem, and re-engineering the whole business process to do it all "the right way" isn't always an option. And there are downsides to using SQLite for high-volume logs that could screw up much more critical operations than this log filter. So, regardless of where one stands on Ivory Tower anti-cat -v-ism questions, this is a matter of finding the fastest grep.

I think this rustland article provides some insights: blog.burntsushi.net/ripgrep :)

Araq (orginal) [2017-04-24T09:47:03+02:00] view original

There's the minimalist school of software design, perhaps best summarized by cat-v.org's "Harmful Software" page. A lot of people dismiss it as a hairshirt cult, but it's still important to understand their position.

Throwing away the structure of your data (or pretending that you can ignore it) has nothing to do with "minimalism".

Anyway the question was "what would be the fastest way" (not the most elegant, powerful, interoperable, future-proof, etc) to filter 100,000 CSV records. This seems like a real-world business problem, and re-engineering the whole business process to do it all "the right way" isn't always an option.

"I cannot import the data into a database because then it wouldn't be a real-world business problem anymore and it would be a re-engineering of a whole business process". Come on, you should know by now that you cannot fool me with big words.

I think the comparison of Nim and Golang is a great counter-argument against taking minimalism to impractical extremes, but their arguments should still be treated with respect.

Oh there is no reason to not respect Pike, since he mostly agrees with me anyway: http://doc.cat-v.org/bell_labs/good_bad_ugly/slides.pdf

"Compare the famous spell pipeline with an interactive spell-checker."
"Tool approach works only at a particular level; for instance, integration is limited in scope. Again, compare spell vs. spell-checkers; pipes of commands vs. COM; pushing data into troff vs. importing through OLE, etc."
"Once, the commonest Unix program was grep. Today, it’s emacs or mozilla. People prefer integrated environments and browsers."

So, regardless of where one stands on Ivory Tower anti-cat -v-ism questions, this is a matter of finding the fastest grep.

No, the original problem was about removing entries from a CSV file, something which grep cannot do particularly well... ;-)

RayoGundead (orginal) [2017-04-25T13:49:42+02:00] view original

SQLite is ridiculously fast and CLI sqlite3 and the resilience cemented it as my go to Application File Format for my projects(even app config files are sqlite files).

@Araq Ya, best way to handle CSV files so far.

Krux02 (orginal) [2017-05-02T11:38:00+02:00] view original

No, the original problem was about removing entries from a CSV file, something which grep cannot do particularly well...

challenge accepted:


grep -f delete -v data.csv > filtered.csv
rm data.csv
mv filtered.csv data.csv

delete is the file with the names to delete from the data, and data.csv is just your data to operate on. filtered.csv is a temporary file. It can also be done without a temporary file by using sed instead of grep, but then it would not be grep anymore.

I don't want to criticise the SQL solution, though.

My suggestion to process the data in nim would be the following, assuming that you wat to further process the filtered list in Nim:

read the names you want to delete and transform them into a single regular expression (concatenate them with or)

create an iterator that uses the line iterator but emits real data objects:

type
  DataElement = object
    name:string
    x,y,z: int

use the previous iterator and match the name with the regular expression (this can declare another another iterator or you can simply use your filtered data at this point)

Mirror of forum.nim-lang.org

2925 :: Fastest way to pack CSV file