nimforum mirror - Arrays and Sequences in nim

drkameleon (orginal) [2019-12-18T07:44:02+01:00] view original

Ok, this is a rather newbit question but despite all my experiments I'm still confused.

Let's say we want an array of int s.

In Nim, I would do it either like this:

var myArray: array[200, int]

or like this:

var myArray: seq[int] = newSeq[int](200)

But what are the real differences between the two options? I mean, obviously, the first one declares a fixed-size array, so if our goal is to have an array that grows and shrinks during runtime, I guess the only option would be to use a seq. But if we actually do want to set an upper limit to our array?

Long story short, in the stack-based bytecode VM I'm currently experimenting on I find myself using more of the first type (that is: "simple" fixed-size arrays) What are the drawbacks (memory/safety/performance-wise)? What should I pay attention to?

Stefan_Salewski (orginal) [2019-12-18T09:14:06+01:00] view original

I think that question has been answered many times already, and I am nearly sure you know it already. At least you are no newbie, I have seen you name in this forum for at least a few months already :-)

Nim array is basically a C array located on the stack, so just a fixed block of memory. As it is located on the stack there is no alloc/dealloc needed, allocation is just increasing the stack pointer. Dont't ask me how returning a array from a proc works exactly, but I assume it works also without alloc/dealloc.

seq and string is basically an object, at least one length field and a pointer to payload, that is the real data content. Content is dynamically allocated and deallocated. When data elements are added, data buffer may get exhausted, then a new payload buffer is allocated and data is copied there. There may be exceptions, as when area after buffer is still unused, OS may allow increasing a buffer already in use, that operation may be called realloc() or something like that. seq and string works very similar to C++ std::vector. Details my vary for Nim V2, newruntime, arc memory management and all that what may come.

Note, C may allow dynamically sized arrays living on stack. That can work for special cases, as stack pointer can be increased dynamically at runtime. There exists a forum post to this behaviour.

And finally there is openArray, basically a container for passing array or seq to procs, so I assume it is a length field and a pointer to payload.

drkameleon (orginal) [2019-12-18T09:22:57+01:00] view original

Thanks a lot for the explanation!

This is pretty much what I expected - but no harm getting a second opinion to make sure you got things right, nope? :)

Basically, I have experimented with pretty much everything A LOT and benchmarked it and checked the produced C code, but wanted to make sure... And from what I've observed and from what you say, we're definitely on the same page.

My latest experiments point to the same conclusion as well:

var arr: array[10,byte]

echo cast[uint64](addr arr)
echo cast[uint64](addr arr[0])
echo cast[uint64](addr arr[1])
echo cast[uint64](addr arr[2])

The first two - quite obviously - point at the beginning of the array. And the next ones point at the consecutive "cells" of memory, one byte apart...

In any case, again: thanks ;-)

Stefan_Salewski (orginal) [2019-12-18T09:40:33+01:00] view original

Maybe one additional note:

When we access elements of an array, that generally occurs with an offset to stackpointer. Stackpointer is generally fixed while we are inside a proc, and stackpointer is generally located in a register, so to access an array element a offset plus stackpointer is used to locacte the array element. If array index is a constant, then offset is constant too. If array index is a variable, than offset is product of element_size and offset. So there is a runtime mul in later case.

To access an element of a seq, there is some indirection: First the base address of the pointer to payload is moved into a CPU register, and an offset = (elementsize * index) is added to locate element. When we access many elements of a seq, then only one time its payload address needs to be moved into a CPU register, so accessing many elements is not really slower for seq than it is for array.

drkameleon (orginal) [2019-12-18T10:22:57+01:00] view original

If array index is a constant, then offset is constant too. If array index is a variable, than offset is product of element_size and offset. So there is a runtime mul in later case.

This is a very interesting point.

Given that what I'm currently doing is building a stack machine bytecode interpreter (to replace my AST-walking interpreter) - and already seeing some serious perfomance upgrade, I'm trying to use every trick available, so these subtle details do matter.

Currently, my simple "stack" is pretty much like that:

#[######################################################
    Constants
  ======================================================]#

const
    MAX_STACK_SIZE = 10_000_000

#[######################################################
    Types
  ======================================================]#

type
    Stack[T]    = array[MAX_STACK_SIZE,T]
    
    Value       = uint64

#[######################################################
    Global variables
  ======================================================]#

var
    MainStack*  : Stack[Value]   # my main stack
    MSP*        : int            # pointer to the last element

#[######################################################
    Implementation
  ======================================================]#

template push*(v: Value)         = inc(MSP); MainStack[MSP] = v
template pop*(): Value           = dec(MSP); MainStack[MSP+1]

template popN*(x: int)           = dec(MSP,x)

template top*(x:int=0): Value    = MainStack[MSP-x]

so... normally an ADD instruction, in my {.computedGoto.} interpreter loop, would be something like that:

case OpCode
    # other cases
    of ADD_OP: push(pop()+pop()); inc(ip)
    # inc(MSP); MainStack[MSP] = ((dec(MSP); MainStack[MSP+1]) + (dec(MSP); MainStack[MSP+1]))
    # ...

which I'm optimizing further (I think... lol) by doing it like:

case OpCode
    # other cases
    of ADD_OP: top(1) = top(0)+top(1); dec(MSP); inc(ip)
    # MainStack[MSP-1] = MainStack[MSP]+MainStack[MSP-1]; dec(MSP)
    # ...

So... lots of different things going on...

Any ideas to make it better (and more performant) are more than welcome! :)

Stefan_Salewski (orginal) [2019-12-18T10:43:21+01:00] view original

This is a very interesting point.

And to make it more precise: When we access an arbitrary element of a array/seq base address + index * element_size is calculated to determine its memory address and access it. If multiple elements are accessed one after the other, then a compiler is generally smart enough to see that only adding element_size to current address is needed to access next element. Long time ago, when compilers were not that smart, constructs like my_element_ptr++ was used to access next element. But today using a plain index generally generates the same optimized code when possible.

mratsim (orginal) [2019-12-18T12:29:57+01:00] view original

A couple of notes:

Your stack size of 10_000_000 is very big. That would only work with global arrays as those are statically assigned a size at program start. However if used within a function you will most certainly stack overflow. For example a stack of int would take 80_000_000 bytes = 80MB but max stack size is about 8MB on most machines. If this is used within a function you will have to allcoate it on the heap with a ref array or just use a seq.

In terms of speed, if the address of the seq and the address of the arrays are hot in cache, there is no difference. Data structures on the stack are always hot in cache within a function.

The base_addr_offset + index * size computation to address each elements of a seq/array doesn't matter in general for 2 reasons:
- The size of an object is power of 2 (due to padding and alignment) and x86 can do offset + index * pow2 addressing in a single instruction without a multiplication via something called SIB addressing (Scaled Index Byte) for types of size 1, 2, 4 or 8: https://wiki.osdev.org/X86-64_Instruction_Encoding#SIB
- It is very likely that the bottleneck is either memory or branch predictions and that your CPU has a lot of free time to do those computations in between waiting for data from the L1/L2 cache.

cblake (orginal) [2019-12-18T18:07:50+01:00] view original

As a point of clarification, depending a bit on OS and sysadmin settings, one can usually increase stack limits with ulimit -s unlimited builtin for most shells, maybe /etc/limits or /etc/security/limits.conf on Linux or even adjustable in-program by setrlimit (if not blocked by admin settings). (Nim's stdlib could use an RLIMIT_STACK constant to ease that last one.)

The default limits are usually much less than say system memory divided by number of expected threads. It would be very rare that all threads used their limit anyway. When the limit is exceeded your thread/process just crashes, much as if you set the limit to be unlimited and used all system memory.

The existence of low stack limit defaults historically mostly comes from systems with "spinning rust" rather than than NVMe backing store for swap. On such systems, an unbounded recursion could bring the machine to a crawl for all users. So, the limit was a safety feature. These days, no swap at all is a real possibility. Even when swapping is enabled, the limit is more about not having some unbounded recursion flush filesystem caches forcing re-reads. Also, kernel code often has small stack limits for related but distinct reasons (non-swappability of kernel stack/page fault handlers/etc.). Kernel developers tend to set default limits based on what they think is "good for you" based on their own very stack-conservative styles.

Anyway, the short of it is that if stack is how you would prefer to manage your memory (for whatever reasons) there isn't anything intrinsically wrong with that related to "size". There's just a little deployment complexity, like maybe erroring out if admins have blocked stack limit increasing and your setrlimit fails which is atypical in my experience. (This is all Unix advice. I have little to no experience for how this plays out on Windows, but some more informed person can perhaps chime in.)

Mirror of forum.nim-lang.org

5703 :: Arrays and Sequences in nim