nimforum mirror - GuildenStern 0.9: Modular multithreading web server for Linux

mratsim (orginal) [2020-11-10T16:22:45+01:00] view original

For those interested into latency-optimized / IO-optimized multithreading here are a couple of interesting pointers.

Budget system to avoid one task hogging the scheduler, used in Tokio the Rust's main IO multithreading runtime.

https://github.com/tokio-rs/tokio/blob/0b3918b/tokio/src/coop.rs#L60-L99

For IO, I find the documentation from WIndows I/O Completion Ports (IOCP) quite interesting:

https://docs.microsoft.com/en-us/windows/win32/fileio/i-o-completion-ports

https://www.microsoftpressstore.com/articles/article.aspx?p=2224047&seqNum=5

So basically they advocate a threadpool waiting on an IOCP (or ACP, Asynchronous Procedure Call https://docs.microsoft.com/en-us/windows/win32/api/processthreadsapi/nf-processthreadsapi-queueuserapc) and when any is ready continue processing on an available thread.

On Linux, it seems way messier with epoll, io_uring or even AIO:

https://lwn.net/Articles/743714/,

https://blog.cloudflare.com/io_submit-the-epoll-alternative-youve-never-heard-about/

https://stackoverflow.com/questions/13407542/is-there-really-no-asynchronous-block-i-o-on-linux/57451551#57451551,

https://brandur.org/nanoglyphs/011-shelter

https://thenewstack.io/how-io_uring-and-ebpf-will-revolutionize-programming-in-linux/

From those I get that an efficient multithreading runtime for IO will need:

An abstraction over low-level kernel API (not in Weave / threadpool), i.e. similar to selectors https://nim-lang.org/docs/selectors.html + asyncnet https://nim-lang.org/docs/asyncnet.html or Chronos AsyncFD https://github.com/status-im/nim-chronos/blob/d3018ae9/chronos/asyncloop.nim#L279-L312 or Flywind's ioselector https://github.com/xflywind/ioselectors. I think the API might need some change so that it's easier for any thread to grab a completed async event for further processing.

Coroutines, fibers, resumable functions, closures or continuations (https://github.com/disruptek/cps) and those should be threadsafe. Weave uses this Task structure https://github.com/mratsim/weave/blob/e5a3701/weave/datatypes/sync_types.nim#L25-L54. It can be vastly simplified for IO needs (no splittable loop) but the main downside is that the environment data size is fixed. In contrast Nim closures iterators are dynamically sized but they are managed by the GC and until ORC (?) couldn't be shared across threads.

High-performance threadsafe futures. Weave uses plain SPSC queues/channels https://github.com/mratsim/weave/blob/e5a3701/weave/datatypes/flowvars.nim#L30-L50 however I'm not sure it is sufficient for IO futures as those for example often timeout or are canceled or are combined or "timeouted".
Rust couple of writeups that are very interesting regarding this:
In particular it distinguishes between the traditional completion-based and their own poll-based futures with completion-based requiring a buffer for each future and so requiring more memory allocation (which are problematic because it stresses the GC, and lead to memory fragmentation on long running application). In particular the poll approach is attractive because it eases cancellation (don't poll) and since there is no heap indirection for the future, the compiler can do deep optimizations.

An efficient memory pool or caching system might be necessary to deal with some heavy requests scenarios or if the futures are not "zero-cost" (i.e. we use the completion-approach).

A scheduler to distribute requests/tasks over the kernel async API, the threads for compute and keep track of task budgets to avoid starvation. For example: https://www.kernel.org/doc/html/latest/block/bfq-iosched.html.

Note: this is with the perspective of writing full blown high performance multithreading IO runtime. As you can see there are many design tradeoffs to consider between futures API (completion vs poll based), kernel vs userspace (task budget tracking), OS event primitives and resumable function primitives.

Obviously you can always go the current way which is threadpool + current async and can already give decent performance (Guildenstern and httpbeast uses this).

Mirror of forum.nim-lang.org

7065 :: GuildenStern 0.9: Modular multithreading web server for Linux