Hi there, aspiring programmer and total Nim-Noob is asking for your wisdom. In order to dive deeper in to the adventure that programming is I decided to have a go on Nim. Thought it would be a good idea to re implement s.the. I already did in Bash, just to have the logic out of the way. Now this project involves indexing large parts of / and ~ but I want to leave out logs, cashes and some other stuff.
Given following dir-structure,
.__ file_a
|
|__folder_1
| |__file_1.a
| |__file_1.b
|
|__folder_2
|__file_2.a
|__file_2.b
I tried the following:
import os, re
for file in walkDirRec ".":
if file.match (re"\S*folder_1\S*"):
echo "NO!"
continue
echo file
this would output:
./file_a
NO!
NO!
./folder_2/file_2.a
./folder_2/file_2.b
Now I can achieve my goal with that and e.g. write only the paths I want to a file or something. But there are two things bugging me about this: First: In Bash, I used find to pipe the paths into a file. With the prune flag I could stop find from descending into those directories completely and therefore save quite a bit of time. The above method would iterate over every single file anyway. Second: I'd prefer to have an array where the folders to be excluded are stored. As strings maybe? (I played a bit with Python and their os.walk can do that). But my attempts to get this working based on string comparison were futile, to say the least.
Now I know that as a newcomer to programming I might very well be of on a completely wrong track and I have to tackle things in another way to begin with. But it somehow stumps me, that I was able to figure this out in two other languages and am so lost here in Nim. I guess the easy to pick up part is more in the syntactic part than the approach?
Anyway, any little bit of guidance, regardless of direction, would be much appreciated. Greetings, Markus
The python os.walk is exceptionally convenient and supports such a use case - the iterator returns 3 components: "path", "dirs" and "files"; the user has to enumerate "files" (or dirs) themselves, and join themm to the "path" for the list of files, but can also ignore dirs or modify it - the iterator will only recurse into those still listed into dirs when re-called, so: if you ignore dirs, you get a standard recursion; if you empty it out, you get no recursion down from this path; and if you filter it, you get selective recursion. It has a few more bells and whistles that cover just about all use cases I've encountered: https://docs.python.org/2/library/os.html?highlight=walk#os.walk
Worth adding to standard library, I think.
I guess the link to the Python 2 version of the library was only by accident. If some new functionality in Nim should be modeled after Python, refer to the documentation for Python 3.
For most older libraries, there shouldn't be a big difference, but for newer libraries there may be, and even older modules might be improved in Python 3.
So, here's the link: https://docs.python.org/3/library/os.html#os.walk . Note that I used /3/ in the URL, so you'll get the documentation for the most recent Python 3 version. If you select a specific version (e. g. 3.8) from the drop-down menu at the top of the page, you'll get the documenation as of this version.
@chalybeum ..the feature being mentioned seems to be the FilterDescend predicate function of the referenced package. You would just load up a Nim HashSet from sets with to be skipped paths and pass some predicate like path notin blacklist, with blacklist probably being a captured closure variable.
While we are resurrecting a zombie-ish thread to promote a package ;-), I can say something about performance expectations that may be uncommon knowledge. The GNU coreutils find goes through contortions to be able to traverse file hierarchies that are more deep than the limit on open file descriptors. This results in that find using like 3.5x the syscalls, 2.5x the CPU time, and 3x the RAM of more direct implementations. If that data must be read off a persistent IO device those usages will not be bottlenecks, but on a fully cached run they will be. So, a decent speed-up relative to GNU find on non-pathological file trees is sometimes possible, if that sort of speed-up motivates anyone.
Of course, that doesn't control recursive descent which was @chalybeum's driving use case but you did use a smiley. :-)
Based on possible broader interest and a general trend lately of trying to be less abstract, I just added a template-based tree iteration to cligen/dents.nim: https://github.com/c-blake/cligen/commit/633da63a997269486f3e00432ec4ce37521fb530 with a fully worked out example utility in examples/chom.nim as well as 4 inline cligen.dispatchMulti-driven example usages.
The short of if it is that you can make things about 2x-8x faster on Linux if you just trust d_type and you only need path names, not, say, i-node data from lstat/stat/etc. Performance only matters for large directory hierarchies, obviously.
And just to close out the example more fully for @chalybeum, since he said he was a beginning programmer, he could probably start from the code below (after a nimble install 'cligen@#head') to do whatever it was he wanted to six months ago if he even stuck with programming, with Nim and/or this forum:
import sets, posix, cligen/[dents, statx, posixUt]
proc chalybeum(prunePath="", recurse=0, chase=false,
xdev=false, roots: seq[string]) =
## ``prunePath`` file fmt is one base name per line.
var prune: HashSet[string]
for line in lines(prunePath): prune.incl line
for root in roots:
forPath(root, recurse, false, chase, xdev, depth,
path, nameAt, ino, dt, lst, st, recFail):
case errno
of EXDEV, ENOTDIR: discard # Expected sometimes
of EMFILE, ENFILE: return # Too deep;Stop recurse
else:
let m = "chalybeum: \"" & path & "\")"
perror cstring(m), m.len
do:
echo path # chalybeum logic on `path` goes here
do: # Pre-recurse: skip dirs w/base names in prune
if dt == DT_DIR and path[nameAt..^1] in prune:
continue
do: # Post-recurse; only `path` valid now
if recFail: echo "did not recurse into ", path
when isMainModule:
import cligen; cligen.dispatch(chalybeum)
Replacing HashSet checking with regex prunes/excludes is not so hard. It is fast, does avoid symlink infinite loops (when optionally chasing), and conditionally avoids cross-device links which is kind of the standard set of Unix tree walk functionality, BUT I totally admit it's not a very easy to use programming interface. The logic of the recursive loop leaks out plenty. I just threw it together. I'm not sure it's much better than the example expanded recursion code would be. Maybe a little.For what it's worth, at least for systems where one can assume post-POSIX.2008 APIs like openat and fstatat (really any vaguely recent Linux/BSD), it is possible to roll your own recursion that, in my timings, is about 4x faster hot cache than walkDirRec (note no trailing 't'). What boost one gets depends on whether you need that Stat metadata (e.g. file times, sizes, owner, perms, etc.) or just path names. Those ideas are in the current cligen/dents.nim:forPath template for Unix users. (It could actually be sped up a couple ways still, but not very portably.)
Of course, depending on the scenario/hotness of caches, the boost may not matter much. Costs from the recursion may be tiny compared to IO/other work. Or it could dominate. Personally, I do a lot of work out of a tmpfs /dev/shm bind mount to /tmp which never has any IO.
Mostly I was just giving yet another syntax for packaging up recursions..one that lets the guts hang out more and the calling code has to/gets to be aware of that while maybe having delegated the low-level system stuff to the template author. Nim is pretty great like that.
BTW, I did re-arrange the order of the 4 event clauses to always, preRec, postRec, recFail and provide a recFailDefault template to make things read more nicely. So, my above code example won't quite work as written anymore. Best to start from one of the 4 worked out examples after the template in dents.nim if you want to use it.