nimforum mirror - Ideas about strings

Araq (orginal) [2023-11-30T10:13:47+01:00] view original

The string implementation in version 2.0 is still not so good that I'm eager to turn it into an ABI. Here is a design that might be better: A string under the hood is a (rawlen: int, data: pointer) pair. The 0th bit of rawlen is used to signal if the string is a "constant" or an "interned" string. The len operation is then computable as proc len(s: string): int = s.rawlen shr 1.

Interned strings use bitwise copies and their destructor does nothing. Slicing can be done in O(1) by making data point into the interior of the underlying buffer. To turn a string into an "interned" string call the new proc called proc markIntern(s: var string): bool. To be able to free the memory call unsafeUnmarkIntern. This is an inherently unsafe operation as we don't keep track of the number of alive slices into the string.

However, there is also a safe variation:

template withIntern(s: var string; body: untyped) =
  let needsUnmark = markIntern(s)
  try:
    analyse(body)
  finally:
    if needsUnmark: unsafeUnmarkIntern(s)

Usage example:

var s = mmap("input.csv")
withIntern(s):
  var sum = 0.0
  for x in splitLines(s):
    for val in split(x, ';'):
      sum += parseFloat(val)
  echo sum

The body must be analysed that it doesn't touch globals and only uses .noSideEffect procs, etc. The analysis is pretty similar to what we need to check for "isolation".

Feeback appreciated.

SmutnyNosacz (orginal) [2023-11-30T13:57:41+01:00] view original

Add a mandatory 0 terminator to the buffer and you suddenly get an easy compatibility with all C-shaped things (just pass the pointer to the buffer, it’s already const char*)

(No, there shouldn’t be restrictions on how many other 0 bytes are there, just that there is always one at the end. Whoever cares about integrating with C needs to make sure there are no null bytes in the middle, and Nim may just not care.)

enthus1ast (orginal) [2023-11-30T14:24:40+01:00] view original

Nim strings already are zero terminated

Araq (orginal) [2023-11-30T14:44:42+01:00] view original

Nim strings already are zero terminated

Damn! You cannot slice in O(1) with zero termination. :-)

termer (orginal) [2023-11-30T19:11:31+01:00] view original

Shouldn't rawlen be a hint, especially since you're using the first bit as a flag? The length will never be less than 0, so you're just throwing away size for no reason otherwise.

Araq (orginal) [2023-11-30T19:19:55+01:00] view original

I suppose you mean uint and the answer is: Probably but why care about a detail like this at this point.

morturo (orginal) [2023-11-30T19:41:03+01:00] view original

Good enough for me! I hope we can have more control over lifetimes in the future.

termer (orginal) [2023-11-30T19:54:31+01:00] view original

Yeah, I meant uint. I care because it would cut half the capacity of the string for no reason, and that's significant when designing an ABI that'll potentially affect the language forever

Araq (orginal) [2023-11-30T23:21:20+01:00] view original

Well it's a machine word in either case so an ABI doesn't care.

mig (orginal) [2023-12-01T02:04:04+01:00] view original

Questions:

Why not enable mutations to the substring?

Can we generalize this to a slice for any container? Is the intention behind this design to generalize to other containers, such as seq?

As for strings having to end at '0' not enable slicing with O(1). I would argue that conversion to/from a C string is less common of an operation than slicing, i.e. just copy to a new string if you need convert to a C string.

On the ABI design: where is the capacity stored (is there a capacity?). An alternative design is to use a single pointer (char*) to declare the string. You can then over-allocate memory for additional fields w.r.t metadata. The pointer you return will have to be offset to start at the start of the buffer string.

In stb headers they use this trick for stb_array, i.e. you can do the following

float* myarr = NULL;
stb_arr_push(my_array, 0.0f);
stb_arr_push(my_array, 1.0f);
stb_arr_push(my_array, 2.0f);

for (int i = 0; i < stb_arr_len(my_array); ++i) {
    printf("%f\n", my_array[i]); // NOTE: you can index like a regular array in C
}

For this case, you would store (len, capacity, is_slice) (is_slice in len if preferred). Of course, since this is at the language level, this is probably not necessary since we have operator overloading.

One other consideration: if generalizing to slices, you could store the stride as well - I imagine this would be a handy feature.

Araq (orginal) [2023-12-01T08:26:47+01:00] view original

Why not enable mutations to the substring?

Because you cannot mutate readonly memory.

Can we generalize this to a slice for any container? Is the intention behind this design to generalize to other containers, such as seq?

Yes for seq but for other containers ... who knows.

alexeypetrushin (orginal) [2023-12-01T13:06:09+01:00] view original

Would be better to a) make strings immutable by default b) allow to make mutable strings with mut c) somehow make Nim to use something like interfaces, so all procs defined on string would also work on mutablestring.

let s = "some"
let ms = mut"some"

proc size(s: string): int = s.len

echo s.size # => 4
echo ms.size # => 4

Araq (orginal) [2023-12-01T16:06:14+01:00] view original

That would accomplish nothing.

Mirror of forum.nim-lang.org

10706 :: Ideas about strings