nimforum mirror - Can I "prune" directories with walkDirRect?

chalybeum (orginal) [2019-11-14T03:11:06+01:00] view original

Hi there, aspiring programmer and total Nim-Noob is asking for your wisdom. In order to dive deeper in to the adventure that programming is I decided to have a go on Nim. Thought it would be a good idea to re implement s.the. I already did in Bash, just to have the logic out of the way. Now this project involves indexing large parts of / and ~ but I want to leave out logs, cashes and some other stuff.

Given following dir-structure,


.__ file_a
|
|__folder_1
|     |__file_1.a
|     |__file_1.b
|
|__folder_2
       |__file_2.a
       |__file_2.b

I tried the following:


import os, re

for file in walkDirRec ".":
  if file.match (re"\S*folder_1\S*"):
    echo "NO!"
    continue
  echo file

this would output:


./file_a
NO!
NO!
./folder_2/file_2.a
./folder_2/file_2.b

Now I can achieve my goal with that and e.g. write only the paths I want to a file or something. But there are two things bugging me about this: First: In Bash, I used find to pipe the paths into a file. With the prune flag I could stop find from descending into those directories completely and therefore save quite a bit of time. The above method would iterate over every single file anyway. Second: I'd prefer to have an array where the folders to be excluded are stored. As strings maybe? (I played a bit with Python and their os.walk can do that). But my attempts to get this working based on string comparison were futile, to say the least.

Now I know that as a newcomer to programming I might very well be of on a completely wrong track and I have to tackle things in another way to begin with. But it somehow stumps me, that I was able to figure this out in two other languages and am so lost here in Nim. I guess the easy to pick up part is more in the syntactic part than the approach?

Anyway, any little bit of guidance, regardless of direction, would be much appreciated. Greetings, Markus

sky_khan (orginal) [2019-11-14T15:01:00+01:00] view original

walkDirRec does recursively search all files/dirs. If you dont want to enter some directories, I guess you need to implement your own recursive "walking" logic with walkDir, then you can have an "exclude_dirs : seq[string]" variable

chalybeum (orginal) [2019-11-14T16:38:24+01:00] view original

Thanks, I was hoping to avoid that. But it will be a good exercise. I am also thinking of I just call the existing find and read what I need from a temp file. Or could I get significantly faster with implementing it myself?

juancarlospaco (orginal) [2019-11-14T16:59:21+01:00] view original

One of the overloads of walkdir takes a Posix Glob that can be recursive and filtering at the same time, kinda "**/*.pyc" or similar.

chalybeum (orginal) [2019-11-15T11:22:45+01:00] view original

But that would still not omit certain directories, or am I mistaking s.th.?

cumulonimbus (orginal) [2019-11-17T12:51:27+01:00] view original

The python os.walk is exceptionally convenient and supports such a use case - the iterator returns 3 components: "path", "dirs" and "files"; the user has to enumerate "files" (or dirs) themselves, and join themm to the "path" for the list of files, but can also ignore dirs or modify it - the iterator will only recurse into those still listed into dirs when re-called, so: if you ignore dirs, you get a standard recursion; if you empty it out, you get no recursion down from this path; and if you filter it, you get selective recursion. It has a few more bells and whistles that cover just about all use cases I've encountered: https://docs.python.org/2/library/os.html?highlight=walk#os.walk

Worth adding to standard library, I think.

sschwarzer (orginal) [2019-11-17T21:16:06+01:00] view original

I guess the link to the Python 2 version of the library was only by accident. If some new functionality in Nim should be modeled after Python, refer to the documentation for Python 3.

For most older libraries, there shouldn't be a big difference, but for newer libraries there may be, and even older modules might be improved in Python 3.

So, here's the link: https://docs.python.org/3/library/os.html#os.walk . Note that I used /3/ in the URL, so you'll get the documentation for the most recent Python 3 version. If you select a specific version (e. g. 3.8) from the drop-down menu at the top of the page, you'll get the documenation as of this version.

HVN (orginal) [2020-05-25T08:25:05+02:00] view original

I was asking the same question on IRC and just found this. Come from Python, I tried my first Nim program by convert existing script which scans a directory of 300k files to filter out 25k files. The Nim version would takes ~ 17s to run, as it scans all the directories while using find -prune or python os.walk and remove excluded dirs from dirs, which run in 1s. This would be really great feature to have in stdlib.

timothee (orginal) [2020-05-27T01:51:17+02:00] view original

this is probably what you're looking for https://github.com/citycide/glob but IMO there should be something equivalent in stdlib

cblake (orginal) [2020-05-27T15:25:48+02:00] view original

@chalybeum ..the feature being mentioned seems to be the FilterDescend predicate function of the referenced package. You would just load up a Nim HashSet from sets with to be skipped paths and pass some predicate like path notin blacklist, with blacklist probably being a captured closure variable.

While we are resurrecting a zombie-ish thread to promote a package ;-), I can say something about performance expectations that may be uncommon knowledge. The GNU coreutils find goes through contortions to be able to traverse file hierarchies that are more deep than the limit on open file descriptors. This results in that find using like 3.5x the syscalls, 2.5x the CPU time, and 3x the RAM of more direct implementations. If that data must be read off a persistent IO device those usages will not be bottlenecks, but on a fully cached run they will be. So, a decent speed-up relative to GNU find on non-pathological file trees is sometimes possible, if that sort of speed-up motivates anyone.

juancarlospaco (orginal) [2020-05-27T17:13:24+02:00] view original

... but walkPattern() does take a glob pattern. :)

cblake (orginal) [2020-05-27T19:14:31+02:00] view original

Of course, that doesn't control recursive descent which was @chalybeum's driving use case but you did use a smiley. :-)

Based on possible broader interest and a general trend lately of trying to be less abstract, I just added a template-based tree iteration to cligen/dents.nim: https://github.com/c-blake/cligen/commit/633da63a997269486f3e00432ec4ce37521fb530 with a fully worked out example utility in examples/chom.nim as well as 4 inline cligen.dispatchMulti-driven example usages.

The short of if it is that you can make things about 2x-8x faster on Linux if you just trust d_type and you only need path names, not, say, i-node data from lstat/stat/etc. Performance only matters for large directory hierarchies, obviously.

kaushalmodi (orginal) [2020-05-27T20:04:34+02:00] view original

Wow!.. why is it in cligen though? It looks like it can be a separate find-competition package :)

cblake (orginal) [2020-05-27T20:08:34+02:00] view original

I put various things that "might be useful" for Unix CLI utilities under cligen/ so client code can just have cligen as a leaf/sole dependency. Directory tree recursion fits that pattern, and stdlib walkDir* never seemed quite right to me. There is already cligen/posixUt.recEntries, for example.

cblake (orginal) [2020-05-27T21:12:05+02:00] view original

And just to close out the example more fully for @chalybeum, since he said he was a beginning programmer, he could probably start from the code below (after a nimble install 'cligen@#head') to do whatever it was he wanted to six months ago if he even stuck with programming, with Nim and/or this forum:

import sets, posix, cligen/[dents, statx, posixUt]
proc chalybeum(prunePath="", recurse=0, chase=false,
              xdev=false, roots: seq[string]) =
  ## ``prunePath`` file fmt is one base name per line.
  var prune: HashSet[string]
  for line in lines(prunePath): prune.incl line
  for root in roots:
    forPath(root, recurse, false, chase, xdev, depth,
            path, nameAt, ino, dt, lst, st, recFail):
      case errno
      of EXDEV, ENOTDIR: discard # Expected sometimes
      of EMFILE, ENFILE: return  # Too deep;Stop recurse
      else:
        let m = "chalybeum: \"" & path & "\")"
        perror cstring(m), m.len
    do:
      echo path # chalybeum logic on `path` goes here
    do: # Pre-recurse: skip dirs w/base names in prune
      if dt == DT_DIR and path[nameAt..^1] in prune:
        continue
    do: # Post-recurse; only `path` valid now
      if recFail: echo "did not recurse into ", path
when isMainModule:
  import cligen; cligen.dispatch(chalybeum)

Replacing HashSet checking with regex prunes/excludes is not so hard. It is fast, does avoid symlink infinite loops (when optionally chasing), and conditionally avoids cross-device links which is kind of the standard set of Unix tree walk functionality, BUT I totally admit it's not a very easy to use programming interface. The logic of the recursive loop leaks out plenty. I just threw it together. I'm not sure it's much better than the example expanded recursion code would be. Maybe a little.

timothee (orginal) [2020-06-01T23:54:07+02:00] view original

see also walkDirRecFilter since https://github.com/nim-lang/Nim/pull/14501 but for now it's internal use only until API is deemed good

cblake (orginal) [2020-06-08T17:41:05+02:00] view original

For what it's worth, at least for systems where one can assume post-POSIX.2008 APIs like openat and fstatat (really any vaguely recent Linux/BSD), it is possible to roll your own recursion that, in my timings, is about 4x faster hot cache than walkDirRec (note no trailing 't'). What boost one gets depends on whether you need that Stat metadata (e.g. file times, sizes, owner, perms, etc.) or just path names. Those ideas are in the current cligen/dents.nim:forPath template for Unix users. (It could actually be sped up a couple ways still, but not very portably.)

Of course, depending on the scenario/hotness of caches, the boost may not matter much. Costs from the recursion may be tiny compared to IO/other work. Or it could dominate. Personally, I do a lot of work out of a tmpfs /dev/shm bind mount to /tmp which never has any IO.

Mostly I was just giving yet another syntax for packaging up recursions..one that lets the guts hang out more and the calling code has to/gets to be aware of that while maybe having delegated the low-level system stuff to the template author. Nim is pretty great like that.

BTW, I did re-arrange the order of the 4 event clauses to always, preRec, postRec, recFail and provide a recFailDefault template to make things read more nicely. So, my above code example won't quite work as written anymore. Best to start from one of the 4 worked out examples after the template in dents.nim if you want to use it.

Mirror of forum.nim-lang.org

5506 :: Can I "prune" directories with walkDirRect?