Can I do this code faster? On my computer for the file in the 3 million lines it runs 14 seconds. When using memfiles - one second faster. While text editors produce the same operation is much faster: AkelPad - 7 seconds, Notepad ++ - 2-3 seconds.
PS: Attempts to be played with a size of the buffer (bufferSize parameter) didn't give any effect.
import nre,os,times
proc main()=
var
fIn,fOut:File
ARGS = commandLineParams()
line:TaintedString = ""
fileTmp = joinPath(getEnv("TEMP"), "tmp.tmp")
pattern,fileSource,fileDest:string
countRepl,countLine = 0
bufferSize = 1024 * 10
if (ARGS.len < 3 ):
echo "repl - utility to replace the lines in the file"
echo "Copyright(C): Gary Galler, 2016. All rights reserved\n"
echo "Error: Not enough arguments\n"
echo "Example: repl source.txt \"search string\" \"line for replacement\" dest.txt"
echo "Example: repl source.txt \"search string\" \"line for replacement\""
quit()
fileSource = expandFilename(ARGS[0])
if not open(fIn,ARGS[0], fmRead, bufferSize): echo "Error: Could not open file:", fileSource;quit()
if not open(fOut,fileTmp,fmWrite, bufferSize): echo "Error: Could not open file:", fileTmp; quit()
pattern = ARGS[1]
while(fIn.readLine(line)):
countLine+=1
if line.match(pattern.re).isSome:
countRepl+=1
fOut.writeLine(line.replace(pattern.re, ARGS[2]))
else: fOut.writeLine(line)
fOut.close()
fIn.close()
if (ARGS.len > 3 ):
fileDest = expandFilename(ARGS[3])
removeFile(fileDest); moveFile(fileTmp, fileDest)
else:
removeFile(fileSource); moveFile(fileTmp, fileSource)
echo "Produced: line " & $countLine & ", replacements " & $countRepl
# end main
var startTime = epochTime()
main()
var endTime = epochTime()
echo "Time: ",endTime - startTime," seconds"
Do you want to search for regular expressions or for plain strings?
Plain strings should be faster of course.
if line.match(pattern.re).isSome:
looks not really smart for me. You may create a new regex for each loop iteration.
What Stefan means is doing something like this:
pattern = ARGS[1]
let rex = pattern.re
and then using rex instead of pattern.re in the two spots inside the while(fIn.readline(line)) loop.
That one change made this run about 4X faster on my machine. [ With memfiles you might be able to get it faster than the editors you mention. ]
Edit: Fixes like yglukhov mentions may also help, but how much depends a lot on the density of matches in the file.
You're welcome, Garry.
Stefan - a simple check for whether pattern has any regex metacharacters (e.g. '.', '*', etc.) can decide if a pattern is definitely not a regular expression. If there are any metacharacters in pattern, well only the user can know and there would have to be a command-line switch to force one interpretation vs. another.
In terms of absolute maximum performance, one usually wants to not do the search "line at a time" at all. Rather, one wants to use an assembly-optimized "strchr"/"memchr"-like search on the biggest chunks of input possible. In this context that means scanning the input for the least likely character that is definitely necessary for a match. There may be characters much less likely than newlines, and so a big boost may be possible (though note that input data sets do vary). Once that byte offset is known, from there one determines if it is part of a real match. That is all quite a bit more bookkeeping/logic/hassle, obviously.
cblake,thanks
I was just trying to work with memSlices, but I got strange results. I'm at a loss. How to properly write? This test code:
import memfiles
import strutils
import nre
var count = 0
var line:string
var fMem = memfiles.open("1.txt", fmRead)
let pattern = "a"
let rex = pattern.re
for slice in fMem.memSlices:
if slice.size > 0 :
line = $(cast[cstring](slice.data)) # correct to you?
if line.match(rex).isSome:
#echo nre.replace(line, rex, "b") # this does not work - the program hangs
echo strutils.replace(line,pattern, "b") # it works
inc(count)
echo count
fMem.close
I think you can't simply cast the slice.data to a cstring. It has to be delimited by a "0" which may "happen" for accident in your code. But it would be really to have behind slice.size bytes in memory.
Looking into the memfiles module you can see how it is done to return lines as strings here I guess that should help to work out on how to create a string from the slice.
OderWat Thank you, everything was actually easier than I thought. In MemFiles have stringify operator $ for type MemSlice. What is interesting, with lines iterator code works somehow faster than code with iterator memSlices. Why?
Works 4 seconds for a file in rows 3_000_000
pattern = ARGS[1]
rex = pattern.re
for line in fMem.lines:
countLine+=1
if line.match(rex).isSome:
countRepl+=1
fOut.writeLine(line.replace(rex, ARGS[2]))
else: fOut.writeLine(line)
# end for
Works 4.8 seconds for a file in rows 3_000_000
pattern = ARGS[1]
rex = pattern.re
for slice in fMem.memSlices:
if slice.size > 0:
countLine+=1
line = $slice
if line.match(rex).isSome:
countRepl+=1
fOut.writeLine(line.replace(rex, ARGS[2]))
else:
fOut.writeLine(line)
# end for