I wrote some string slurping code in Nimrod to see how it would work out for the task and I was so surprised by the poor performance compared to Java and D that I figured that I must be doing something wrong.
The program reads a text file of tab separated fields, one of which is a label. It takes some of the other fields, which are whitespace separated (not tabs) words and stores the number of occurrences of words on both a per label and global basis. Very straightforward, and the Nimrod code looks easy too, but it is unbelievably slow compared to a similar Java version I wrote which uses the Java collections. Is there something obviously wrong with the code below?
import os, re, strutils, streams, tables
addK,V
label : string = fields[1] title : string = fields[2] desc : string = fields[3]
words = split(toLower(title & desc), wsRegex)
replace(classTable, label, classInfo)
close(f)
Not sure, putting everything in a main proc so that it doesn't use global variables would be a start. Avoiding copies for the TTable would be next. Using a count table for counting also is an idea. And instead of
words = split(toLower(title & desc), wsRegex)
for word in words:
use for word in split(...) which keeps things lazy. Use -d:release and a profiler.
I think you're right about the copying, Java made me lazy with virtually everything being by pointer.
I see the count table, but I'd like to make it work with plain old hash tables first. Making everything a ref leads to some issues though. When I have a hash table who's values are refs to another table, I'm having a bit of trouble making it work. If someone could tell me what's wrong with the following shortened example, I'd appreciate it.
import tables
type
TClassInfo = tuple
total : int64
count : ref TTable[string, int64]
PClassInfo = ref TClassInfo
TClassifier = tuple
wordCount : ref TTable[string, int64]
classes : ref TTable[string, PClassInfo]
PClassifier = ref TClassifier
proc newPClassifier() : PClassifier =
var result : PClassifier
new(result)
new(result.wordCount)
new(result.classes)
return result
let label : string = "foo"
var classifier : PClassifier = newPClassifier()
if classifier == nil:
echo("classifier is nil")
else:
echo("classifier is not nil")
if classifier.classes == nil:
echo("classifier.classes is nil")
else:
echo("classifier.classes is not nil")
if classifier.wordCount == nil:
echo("classifier.wordCount is nil")
else:
echo("classifier.wordCount is not nil")
echo("classifier.classes has ", len(classifier.classes[]), " elements")
if hasKey(classifier.classes[], label): # SEGFAULT here
echo("The classifier contains ", label)
else:
echo("The classifier does not contain ", label)
That code prints the following when run
classifier is not nil classifier.classes is not nil classifier.wordCount is not nil classifier.classes has 0 elements Traceback (most recent call last) mytables.nim(40) mytables tables.nim(108) hasKey tables.nim(73) RawGet SIGSEGV: Illegal storage access. (Attempt to read from nil?)
Assuming that I do want to continue along with the ref TTable, directly translating my Java code into Nimrod, how would I initialize it? I saw initTable, but it doesn't work with refs.
It was trivial to write the Java code (no doubt because I know Java) but writing performant Nimrod is quite a bit more involved it seems. If you translated that segfaulting code above into Java it would be a bit more verbose but it would work as expected.
result.classes[] = initTable[string, PClassInfo](1)
in proc newPClassifier before return works. Number of items in table (the argument to initTable) must be any power of two. Fails if 0.
Also, just FYI, in your newPClassifier() proc, you don't need the var result : PClassifier or return result lines, since in Nimrod any procedure which returns something has an implicit result variable (of type returned) which is implicitly returned at the end.
And one more thing, for performance, don't execute code in global space. Put your code in a proc and call it. Can't remember exactly why that is slower (global vars vs thread heap or something), but it is apparently.
I knew about the syntax for result. I don't particularly care for it, and it seems that I get
nbayes.nim(24, 6) Hint: 'result' is declared but not used [XDeclaredButNotUsed]
warnings when I do it. As I reworked the code, I also noticed that omitting the return result led to segfaults, so different behavior just by commenting out a 'return result' line.
Using refs in many places and using the 'all code in a proc' suggestion gets it to take about 40% longer than the Java, which isn't too bad. The code is much prettier to my eyes too. I'm curious, is there a way to get a qualified name from some module, like tables.initTable or some such?
In order to make the code more Java like, I added a generic creation routine (analogous to Java's new) like this
proc create[T](value : T) : ref T =
var result : ref T
new(result)
result[] = value
return result
Hopefully that isn't anti-Nimrodic.
deviluno said: Directly translating my Java code into Nimrod
That's your problem, buddy. You can't directly translate code in any language to another. If you translated that Java to C++ you'd see drastic reductions in performance as well. Design a Nimrod program, don't design a Java program and write it in Nimrod. Furthermore, you need to learn a bit more Nimrod before you start claiming it's hard to write performant code - especially since we have idiomatic code outperforming C++ in ray tracers and games, Ada in text operations, etc.
And let me go back and emphasize - Ada is incredibly fast with text, you will not find Java going faster than Ada on text operations, but Nimrod outperforms the Ada in every single text procedure in the Martyr Mega Project (I use it to get familiar with a language), sometimes by an order of magnitude.
As well, have you bothered to profile the code for either forms? Where is your bottleneck? What compiler flags are you using? What are your testing methods? Are you taking system time for Nimrod and only run time for Java? Are you not running them a hundred thousand times to get signficant data?
P.S. I'm trying not to be rude, please don't take it that way. I'm just naturally a terse person on the internet.