Incremental compilation (IC) has been a long-standing goal for Nim—the ability to recompile only what changed, dramatically speeding up development cycles. I planned to surprise you with a somewhat working IC feature for Christmas, but unfortunately the progress while steady is slower than anticipated.
So instead I will talk about how it's developed and how it will work. The design is taken from Nimony: There is a pre-pass over the program that inspects the import structure of your program and produces a dependency graph. This step uses a helper tool called "Nifler" and is incremental too: It does not reparse your entire program but only the modules that changed. We can do this reliably because the pure parsing step into a tree structure can be done without a symbol table. The dependency graph is then turned into a "nifmake"-file (similar to makefiles, but for .nif files); nifmake gives us incremental and parallel builds!
The nifmake-file mostly consists of nim m invocations. The new m switch is a key feature here. nim m x.nim precompiles the file to a x.nif internal representation but x's dependencies are not recompiled, they are loaded from their respective nif files instead. The m switch does not detect which files are outdated, that is nifmake's job. (The m switch is also what I use for development of the IC feature.)
After every module has its precompiled .nif file, the code generation backend runs over the set of .nif files and produces C/C++ code as usual. This step is not incremental for the first versions of the IC feature. Instead it uses Nim's existing dead-code-elimination (DCE) infrastructure and only produces code that ends up in the binary. DCE creates a challenge for incremental compilation: If you start using a symbol from a precompiled module that was previously unused (and thus eliminated), you have to recompile that module to regenerate the code.
The bad news is that it's not working yet. The good news is that this implementation of IC finally seems to meet my performance goals: The precompiled modules are not too large, they are fast to load from disk and the loading is lazy: what is not needed is not processed at all.
Merry Christmas! I'm looking forward to sharing more progress in the new year.
I hope macros fully complete and it's only a fairly straightforward process to create C code left?
Macros are fully precompiled and cause no overhead, yes. The process of creating C code is still quite involved though as we have to inject destructors etc and that step is not cached. It can be but every cache means more files and disk IO so it's always unclear of whether it's beneficial.
It can be cached but every cache means more files and disk IO so it's always unclear of whether it's beneficial.
Perhaps it could be an option. Some users may choose to mount a filesystem from memory to make I/O overhead almost zero, so I could see that being very useful if tuned properly.
Macros are fully precompiled and cause no overhead, yes.
This sounds like a bit of a trap / highly case-dependent - we can assume that parsing nim and nif is about the same for unexpanded code - so if we look at generic expansion, template expansion, macro expansion, it's easy to see that one can construct cases that are both faster or slower with or without pre-processing - it's easiest to argue for a macro: a macro that outputs a lot of code and doesn't do any significant computation is not going to benefit from pre-expansion because parsing the outcome will be slower than performing the computation, but a macro that reduces the output (say .. hashes the AST or whatever) will benefit.
Similar with a template: a small template might be more efficient to expand if it appears in a context with complex overload matching (which is a complex computation), but in many cases, I'd assume that it's more efficient to load an unexpanded template and expand it during compilation since the template version is a more compact representation of the same code - specially when considering nested expansions - a nif file that expands templates recursively will be much larger than its corresponding unexpanded version.
This brings us to generics - the C++ route of expanding everything in every TU and then reconciling is one of the main reasons compiling C++ is slow - it takes time to generate the same code over and over. In the C++ case, LTO works around many of the performance problems related to this since optimization can happen post-deduplication and it's the optimization that is computation-heavy - in fact, what one would really like to cache is the post-optimization version of each function using a key derived from the code structure.
Similarly, pre-expanding per module would cause an explosion in nif size, and all of that redundancy must later be removed in a process not dissimilar to C++'s one definition rule - in C++, it's very easy to violate that rule and end up with broken compiles and rendering the whole process brittle.
So the trap here really is to perform the IC measurements on a simple codebase (where IC arguably doesn't matter) and expect the results to carry over to a complex codebase - where compounded expansions may very well change the equation of what ends up being a "total compilation time win". Beyond getting it to work at all (due to the additional brittleness of the pipeline), this is where similar efforts in both the LTO caching and JIT worlds have struggled and they solve more or less the same problem.
a macro that outputs a lot of code and doesn't do any significant computation is not going to benefit from pre-expansion because parsing the outcome will be slower than performing the computation, but a macro that reduces the output (say .. hashes the AST or whatever) will benefit.
What you seem to be missing here is that we don't necessarily have to parse the expanded NIF code in subsequent runs at all. A NIF file has an index and we use mmap and offsets into the file and load sections on demand. It's not a purely based text format, it's actually a hybrid.
Also later iterations on the design can optimize it further: The text aspect of NIF that currently helps tremendously during development and debugging doesn't have to remain. Also a code generator that does not require the old PNode structures but instead operates on the NIF token streams is feasible, it is what Nimony already does after all.
don't necessarily have to parse the expanded NIF code in subsequent runs at all.
This is great - but what happens when you do have to parse it?
This is the main question above and it seems to me it arises every time you have to monomorphise a generic or re-expand a template or re-apply a macro which in nim happens quite often, ie even for seemingly trivial changes.
The thing I hope to get from IC (apart from its more obvious tooling applications, where again, you need to be able to expande templates and macros fast and not just work with post-expansion data) is a semantic per-function cache - basically what you get after performing the per-function simplification passes, and here, the really interesting thing is that you can de-duplicate the code - basically, Table.len is the same for all monomorphizations of Table - most table instantiations are also the same for all ref types since the size of the entry is always the same (sizeof ptr) and most ref types semantically behave the same from the point of view of the table and so on - in this world, each function would be checked for semantic equivalence and cached accordingly, but above all, it could then be optimised only once by the later optimization passes (imagine integrating this with the C/LLVM optimization passes).
Is the index fine-grained enough for something like this? Deduplicating semantically similar to how `treetab` does it would allow drastically reducing the amount of work that the compiler has to do end-to-end.