The string implementation in version 2.0 is still not so good that I'm eager to turn it into an ABI. Here is a design that might be better: A string under the hood is a (rawlen: int, data: pointer) pair. The 0th bit of rawlen is used to signal if the string is a "constant" or an "interned" string. The len operation is then computable as proc len(s: string): int = s.rawlen shr 1.
Interned strings use bitwise copies and their destructor does nothing. Slicing can be done in O(1) by making data point into the interior of the underlying buffer. To turn a string into an "interned" string call the new proc called proc markIntern(s: var string): bool. To be able to free the memory call unsafeUnmarkIntern. This is an inherently unsafe operation as we don't keep track of the number of alive slices into the string.
However, there is also a safe variation:
template withIntern(s: var string; body: untyped) =
let needsUnmark = markIntern(s)
try:
analyse(body)
finally:
if needsUnmark: unsafeUnmarkIntern(s)
Usage example:
var s = mmap("input.csv")
withIntern(s):
var sum = 0.0
for x in splitLines(s):
for val in split(x, ';'):
sum += parseFloat(val)
echo sum
The body must be analysed that it doesn't touch globals and only uses .noSideEffect procs, etc. The analysis is pretty similar to what we need to check for "isolation".
Feeback appreciated.
Add a mandatory 0 terminator to the buffer and you suddenly get an easy compatibility with all C-shaped things (just pass the pointer to the buffer, it’s already const char*)
(No, there shouldn’t be restrictions on how many other 0 bytes are there, just that there is always one at the end. Whoever cares about integrating with C needs to make sure there are no null bytes in the middle, and Nim may just not care.)
Nim strings already are zero terminated
Damn! You cannot slice in O(1) with zero termination. :-)
As for strings having to end at '0' not enable slicing with O(1). I would argue that conversion to/from a C string is less common of an operation than slicing, i.e. just copy to a new string if you need convert to a C string.
On the ABI design: where is the capacity stored (is there a capacity?). An alternative design is to use a single pointer (char*) to declare the string. You can then over-allocate memory for additional fields w.r.t metadata. The pointer you return will have to be offset to start at the start of the buffer string.
In stb headers they use this trick for stb_array, i.e. you can do the following
float* myarr = NULL;
stb_arr_push(my_array, 0.0f);
stb_arr_push(my_array, 1.0f);
stb_arr_push(my_array, 2.0f);
for (int i = 0; i < stb_arr_len(my_array); ++i) {
printf("%f\n", my_array[i]); // NOTE: you can index like a regular array in C
}
For this case, you would store (len, capacity, is_slice) (is_slice in len if preferred). Of course, since this is at the language level, this is probably not necessary since we have operator overloading.
One other consideration: if generalizing to slices, you could store the stride as well - I imagine this would be a handy feature.
Why not enable mutations to the substring?
Because you cannot mutate readonly memory.
Can we generalize this to a slice for any container? Is the intention behind this design to generalize to other containers, such as seq?
Yes for seq but for other containers ... who knows.
Would be better to a) make strings immutable by default b) allow to make mutable strings with mut c) somehow make Nim to use something like interfaces, so all procs defined on string would also work on mutablestring.
let s = "some"
let ms = mut"some"
proc size(s: string): int = s.len
echo s.size # => 4
echo ms.size # => 4