nimforum mirror - Performance of fastRows at module db

alfrednewman (orginal) [2016-10-09T03:02:21+02:00] view original

Greetings,

I have a big production SQLite table with 4 Gig+ and around 12M records. (named mydb.db3)

The code below is looking for 10 records (of lookup_input.csv) at mydb using the FTS feature available at SQLite.

Given the thread http://forum.nim-lang.org/t/2539#15760 I am using the latest DLL of SQlite in my project.

The Nim code is spending around 14 secs to do the job. My Python code does exactly the same spending just 9 secs.

Just some thoughts:

Does the Nim module db_sqlite is properly optimized ? (comparing with the Python one)

Am I doing something wrong in my Nim code ?

Best,

import strutils, db_sqlite, streams, parsecsv, times

proc lookup_match(): string =
  var
    p: CsvParser
    company : string
    lookup_id = ""
    lookup_name_source = ""
  
  let mydb = open(r"C:\Users\AlfredN\mydb.db3", nil, nil, nil)
  p.open(r"C:\Users\AlfredN\lookup_input.csv")
  p.readHeaderRow()
  
  while p.readRow():
    for col in items(p.headers):
      lookup_id = p.rowEntry(col).split('|')[0]
      lookup_name_source = p.rowEntry(col).split('|')[1].toUpperAscii()
      for row in mydb.fastRows(sql"select count(*) from tb_reference_pj where company match ?", (lookup_name_source)):
        if row[0] != "":
          for r in mydb.fastRows(sql"select * from tb_reference_pj where company match ?", (lookup_name_source)):
            if len(r[1]) <= 80:
              company = r[1]
              if lookup_name_source == company:
                echo company
  
  mydb.close()
  p.close()

when isMainModule:
  let t0 = cpuTime()
  echo lookup_match()
  let t1 = cpuTime()
  echo "* Time elapsed *: ", $(t1 - t0)

flyx (orginal) [2016-10-09T11:27:10+02:00] view original

I cannot really answer the question whether there are missed optimization opportunities in Nim's SQLite wrapper, but being a wrapper, it sounds unlikely.

There are some things about your code I can comment on:

lookup_id = p.rowEntry(col).split('|')[0]
lookup_name_source = p.rowEntry(col).split('|')[1].toUpperAscii()

This is inefficient because you are splitting the string multiple times. You can do this instead:


let colContent = p.rowEntry(col).split('|')
template lookup_id(): untyped = colContent[0]
template lookup_name_source(): untyped = colContent[1]

But that concerns the CSV and should not be much of a performance hog (unless your CSV is very large).

Now the actual part where you could use optimization are these lines:

for row in mydb.fastRows(sql"select count(*) from tb_reference_pj where company match ?", (lookup_name_source)):
  if row[0] != "":

What are they for? the result will never be "", because count(*) always returns some number (correct me if I'm wrong, I haven't done SQL for some time). You are issuing a potentially expensive query here for no reason.

Onto the next query:

for r in mydb.fastRows(sql"select * from tb_reference_pj where company match ?", (lookup_name_source)):

You are querying every column value from a table although you are using only one value afterwards. Okay, to be fair, the following lines are probably not your actual code, right? Note that if you did the template thing I showed, you have to use lookup_name_source() here.

Your code would probably be faster by just removing the first query.

Stefan_Salewski (orginal) [2016-10-09T12:03:34+02:00] view original

The split was the only optimization I had discovered myself just before the detailed answer of flyx.

But we do not need a template. This should work too:

var lookup_id, lookup_name_source: string

(lookup_id, lookup_name_source) = p.rowEntry(col).split('|')[0..1]
lookup_name_source = lookup_name_source.toUpperAscii

I would prefer this solution, it is simpler, no template involved. I think that unfortunately both solutions do a full split and may generate a large temporary array of strings, while only the first two are desired. And for toUpperAscii() we may like to have a variant which works in place instead allocating a new string.

flyx (orginal) [2016-10-09T12:35:25+02:00] view original

Yes, Stefan's solution is somewhat nicer, didn't think about that. If there is more data involved, split may be further accelerated by setting maxsplit:

let (lookup_id, lookup_name_source) = p.rowEntry(col).split('|', 2)[0..1]

abdulhaq (orginal) [2016-10-09T18:13:48+02:00] view original

You could profile both the python and nim code and compare percentages of run time for each line of code between the two languages.

If you do that then I'd like to see the results :-)

alfrednewman (orginal) [2016-10-09T21:02:47+02:00] view original

Thank you for the support guys.

The weird statements surrounding the operations with SQL were created just for benchmarking's sake with Python.

Anyways, optimizing the handlers of the CSV, due it is not that big, brought no real gain of performance. Comparing the elapsed time of Nim and Python, even so, the second is still faster (around 20%).

So, I decided to make some modifications in both codes, as per the snipped below, where I removed the SQL statement related to the "select count(*)".

For my surprise, now, both codes are running at the same speed !!

I will profile both asap... however, one doubt: - What is the best way to profile a code in Nim ? Can you pls provide me some directions ?

Cheers,

while p.readRow():
    for col in items(p.headers):
      (lookup_id, lookup_name_source) = p.rowEntry(col).split('|')[0..1]
      lookup_name_source = lookup_name_source.toUpperAscii
      
      counter = 1
      for r in stingraydb.fastRows(sql"select cnpj, company from tb_reference_pj where company match ?", (lookup_name_source)):
        if counter > 20:
          echo r[1]
          break
        
        counter += 1
        if len(r[1]) <= 80:
          company = r[1]
          if lookup_name_source == company:
            echo company

Libman (orginal) [2016-10-09T21:29:27+02:00] view original

Just curious about a few things things:

How do Nim compiler settings (--d:release --opt:speed) affect this benchmark? Did you make any tweaks to nim.cfg that might explain slow performance? Are you using PyPy for Python?

Y U no use PostgreSQL? Sometimes staying with SQLite is perfectly valid, even for 4GB+ in production, but sometimes it's not...

alfrednewman (orginal) [2016-10-09T21:58:55+02:00] view original

Im using Nim 0.15 with -d:release and --opt:speed.

My Python is a 64 bit v2.7.12 without Pypy.

I need SQLite mainly because it is a part of a desktop app, where any other SQL server is not suitable. And BTW this .db3 is used just for readonly operations. No CUD is done there.

Mirror of forum.nim-lang.org

2571 :: Performance of fastRows at module db_sqlite