in D, string literals don't allocate. (in C as well but they decay to pointers so let's leave these aside from this discussion)
D:
// D20180718T163602
import std.stdio;
extern(C) void fun(const(char)* str){
printf("std:{%s}\n", str);
}
void main(){
string a = "foo";
// a.ptr points to ROM to a block of size 3+1 (last entry is 0, to help implicit conversion without allocation to C strings)
assert(a.length == 3);
assert(a.ptr[3] == 0);
fun(a.ptr);
auto a2 = new char[](a.length);
a2[]=a;
assert(a.ptr[3] == 0); // probably fails
fun(a2.ptr); // undefined behavior, since it's not null terminated
}
Seems like Nim could use these 2 tricks as well:
Doesn't Nim do the same mistake as C++ by adding the zero termination in the string implementation?
The best solution in my opinion is to have an external "CString" implementation. That will be a penalty for libraries that use zero terminated strings. I'm willing to take that penalty for finally removing the historical zero termination mistake.
First in C++ the strings (std::string) was not zero terminated but later on added in C++11 because of the internal implementation of C-strings (method c_str()). That's the penalty they got for not having an external C-string implementation. Now because of this they had to implement halflings like string_view.
Fully sliceable strings and arrays are always better.
Ok, you're willing to pay the "penalty". I am not.
Fully sliceable strings and arrays are always better.
The terminating zero doesn't preclude slicing (you can have a flag that indicates whether the terminator exists) but slicing has inherent ownership problems that the more convoluted (string, startIndex) solution lacks. They are certainly not "always better".
In Nim string literals do not allocate either, but assignments copy.
I wasn't sure whether "assignments" also meant "let a="abc" so I checked: after further investigation by inserting logging inside copyString, looks like let a="abc" does allocate (please correct if I'm wrong):
let a="abc"
generates C code:
...
// this doesn't allocate yet
STRING_LITERAL(TM_4dVrjrGda9bN4zEljhxY9bVg_4, "abc", 3);
...
// but this does allocate; calls copyString => allocStrNoInit => newObjNoInit
a = copyString(((NimStringDesc*) &TM_4dVrjrGda9bN4zEljhxY9bVg_4));
looking at the addresses via repr(a[0]) and comparing to ones of a GC-allocated string also confirms this (the addresses are similar, not from ROM).
for i in 0..n:
let a="abc"
# as you can also see by looking at addresses changing in each iter: echo repr(unsafeAddr a[0])
The new string/seq implementation will eliminate more copies.
Great, is there any kind of link/doc/design notes to that work?
indeed, all strings are 0 terminated currently in Nim; in my idea of an ideal string type, null termination would only occur for string literals (not on slices). I'll write more on this later, it requires a full design writeup to make sure there are no holes.
eg: so the following will allocate 10 times (in D: 0 times):
Put it in a main proc. Allocates 0 times then. ;-)
in my idea of an ideal string type, null termination would only occur for string literals (not on slices). I'll write more on this later, it requires a full design writeup to make sure there are no holes.
The null termination is useful for every path related proc as paths will be passed to the OS. Yes, Nim is not D. We make our own mistakes here, we don't copy D's.
The copy doesn’t worry me if the behavior is consistent. But, there is something I don’t understand.
In this program
proc p() =
let x = "abc"
there is a string copy (and an allocation) for "x".
But in this one
proc p() =
var a = "abc"
let x = a
there is no string copy for "x", which may cause problems if "a" is modified later. "x" is indeed an alias for "x" (with all the problems of aliases which Nim normally avoids).
There is no risk to directly assign the pointer in the first case. So why is there a questionable optimization in the second case and not in the first case where it would be safe?
The terminating zero doesn't prevent slicing (you can have a flag that indicates whether the terminator >exists) but slicing has inherent ownership problems that the more convoluted (string, startIndex) >solution has not. They are certainly not "always better".
Is that how the string is implemented today, the zero is only appended if needed?