nimforum mirror - Why null terminated strings?

ReneSac (orginal) [2013-02-15T17:49:05+01:00] view original

Why, oh, why null terminated strings? Well, I'm probably not the first one to ask this, but I couldn't find a discussion about it on the forums.

The original reason for the terminating null was to save memory, and the main disadvantage was not knowing the total size. Nimrod stores the terminating null as well as the size, so this actually wastes more memory than either approach, and still has the smaller disadvantage that it don't let one store nulls safely in the string. So, why the null?

Even the example code, crafted to show the usefulness of it, can be made shorter, more generic, safer and simpler when not using the terminating null. Compare:


if s[i] == 'a' and s[i+1] == 'b' and s[i+2] == '\0':
  # no need to check whether ``i < len(s)``!
  ...

With:


if i <= len(s) and s[i] == 'a' and s[i+1] == 'b':
if s[:-2] = "ab":
if s.endswith("ab"):

A "sufficiently smart" compiler could optimize the last two cases.

And for most string manipulations, you would use a for loop. Compare this fragment from the standard library:


var i = 0
while true:
  if prefix[i] == '\0': return true
  if s[i] != prefix[i]: return false
  inc(i)

With:


for i in 0 .. len(prefix):
    if s[i] != prefix[i]: return false
return true

I can't see where this nimrod style string is a good trade-off. Besides that, interesting language!

Araq (orginal) [2013-02-15T19:06:04+01:00] view original

Interoperability with C requires a terminating zero anyway and C interop is considered much more important than saving a single byte (which you often cannot save due to alignment). If Nimrod strings woudn't be zero terminated, you would have to append \0 all the time to pass it to C. Of course the compiler could do that for you implicitly, but then there is a hidden concat operation which somebody would complain about ... ("it is a systems programming language!")

For lexing/parsing the terminating zero is nice, and no endswith etc. doesn't cut it. I should know, I wrote a couple of parsers ...

That said, you can store \0 in a string if you know which operations can handle that (add, == etc can!) and indeed the stdlib should use len more often to support that.

ReneSac (orginal) [2013-02-18T18:46:37+01:00] view original

Even considering alignment, on average you waste a byte because the extra null can be the one that makes a string overflow to the next alignment size. One can even argue that power of two sized strings are relatively more common, and thus you waste more than a byte on average. But I agree that this isn't the most important consideration.

About interoperability with C, well, that is a fair point. I guess Nimrod programs have a much closer relationship with C ones than most other languages. This may be a good trade-off. In D one should use explicitly std.string.toStringz (not the best name choice...) in order to convert D strings to C strings, and incur in some performance hit. Though one can simply pass a pointer and deal with the \0 problem by himself.

And you can't expect the programmer to memorize which operations he can or can't use with strings containing an inner \0, as they are kinda random now... unless it can be reduced to a very well defined and very short list. In any case, he also can't rely on it, as his code can break if he starts using a third party library, for example. So, I wouldn't call it safe.

And every string manipulation that you write, you must ensure that you handle the terminating null correctly. This likely isn't too hard, but is one extra overhead for the programmer.

The terminating null also prevents an effective implementation of D-sytle slices for strings.

I'm not familiar with writing parsers, but I don't think it is really more difficult to put your tokenizer code inside a loop that checks for the end of the string using the length. And in any case, for such a specialized operation, you are free to append 0 by yourself to make your algorithm simpler.

Araq (orginal) [2013-02-18T21:10:09+01:00] view original

In any case, he also can't rely on it, as his code can break if he starts using a third party library, for example. So, I wouldn't call it safe.

A third party library can also easily pass the string to some C library without you noticing it. And this is a problem for D too.

And every string manipulation that you write, you must ensure that you handle the terminating null correctly. This likely isn't too hard, but is one extra overhead for the programmer.

There is no cognitive overhead whatsoever; len doesn't include the \0, for instance. Note that strings in Delphi also have a hidden terminating zero for C interoperability.

The terminating null also prevents an effective implementation of D-sytle slices for strings.

Can't imagine how it would prevent that.

renoX (orginal) [2013-02-18T23:39:55+01:00] view original

The terminating null also prevents an effective implementation of D-style slices for strings. > Can't imagine how it would prevent that.

Well in D, strings are not 0-terminated, so a substring (slice of string) can be done "in place". This isn't possible with 0-terminated strings.

Araq (orginal) [2013-02-19T00:37:14+01:00] view original

Of course, but you can easily pass a tuple[data: pointer, len: int] around in Nimrod and any slice is not compatible to Nimrod's native string data type anyway because string is a nominal type (and - once again - the interior pointer is a bad idea for the GC) ...

It's not like Nimrod is without its flaws (yet) but this "it differs from D, it must be bad" is getting tiresome already. If I want an unsound and yet complex const system or a language design that's incredibly hostile to a GC, I know where to find D.

leledumbo (orginal) [2013-02-19T07:02:21+01:00] view original

At Nimrod level, the terminating null is "invisible" to the programmer. This is also the case from which Nimrod inherits from: Object Pascal. The terminating null is mainly only for interfacing with C, and only used in other very rare case. So the programmer has nothing to do with it. The standard string operations deal with it without the programmer needs to care.

renoX (orginal) [2013-02-19T13:52:39+01:00] view original

My post was definitedly not a "it differs from D, it must be bad" post (I knew about D before I've learned about Nimrod, yet I still find Nimrod more interesting than D) but a reply to your point "can't imagine how it would prevent that": it wouldn't totally prevent it of course but it makes it much more difficult.

This is a trade-off C compatibility vs in-place substring, you can't have both at the same time.

Your point about GC's unfriendly interior pointers is interesting:I'm not that this is useful but I think that "an interior pointer structure" is possible, it would contain two elements: "a base" pointer (for the GC) and an "interior" pointer for accessing the element.

Araq (orginal) [2013-02-19T22:12:15+01:00] view original

it wouldn't totally prevent it of course but it makes it much more difficult.

As I already said, the terminating \0 is irrelevant for an implementation of slices in Nimrod. (The forum contains examples of how to implement them in Nimrod.)

Mirror of forum.nim-lang.org

125 :: Why null terminated strings?