nimforum mirror - Immutable String Type Vs. Shallow/Freeze Procedure

Varriount (orginal) [2015-11-11T20:19:17+01:00] view original

Araq and I have already talked some about this on IRC, however since this is a rather interesting problem with two different solutions, I thought it might be fun to discuss it here.

Currently Nim's built-in string type is copy-on-assignment. This means that whenever a string is assigned or moved, a copy is made of the entire string. This limitation is due to the fact that strings have a mutable length and are represented as a reference to a dynamically allocated block of memory. When the string needs to be resized, a new block of memory must be allocated, which may reside at a different point in memory. If multiple pointers to the string were allowed, they would all need to be updated, which is technically infeasable.

The problem with this approach is that a large number of unintentional copies are made, even when the string may never be modified. To try and fix this situation, Araq and I have each come up with a solution.

Varriount (orginal) [2015-11-11T20:19:47+01:00] view original

The solution I propose is to introduce a new "ImmutableString" type to the standard library, an immutable, never-nil sequence of characters.

The new string type would be implicitly convertible to a string (by copying itself into a regular string type) and vice-versa. The plan after the introduction would be to slowly shift the standard library towards using ImmutableStrings. After sufficient adoption, we could then make "ImmutableString" the default string type, relegating the old string type to something like "StringBuffer". Additionally, a conversion program could be produced to swap the types automatically.

This has the advantage of being type safe and adding strong compile-time guarantees about string behavior, at the cost of being complex (adding a new string type, retrofitting all the string functions).

Varriount (orginal) [2015-11-11T20:20:00+01:00] view original

The solution proposed by Araq is to continue using shallowCopy/shallow, but adding run time checking in each string-modification procedure to make sure that a frozen string isn't modified. As the logic for shallow already exists, this wouldn't cost as much code-wise, however the additional checks for shallow sequences may slow down string modifications, and only occur at runtime.

Arrrrrrrrr (orginal) [2015-11-11T20:59:58+01:00] view original

imo immutable strings are easier to understand for the user, and you dont have to be annoyed with the shallow thing. And if you actually need to perform many modifications, stringbuilder would come handy.

cblake (orginal) [2015-11-11T21:39:59+01:00] view original

Probably totally obvious, but it seems like the runtime checks for Araq's proposal could be/should be something disabled by d:release, as with the many other correctness/safety checks. So performance impacts shouldn't be a huge issue.

Also, the proposals do not seem incompatible - both could be in flight at the same time for a long-time (obviously depending on implementor interest/energy/time). Sussing out the popularity of ImmutableString by getting something in the stdlib and seeing how it integrates sounds like a wise first step.

OderWat (orginal) [2015-11-11T22:29:35+01:00] view original

Related? Why again do a and c have different memory adresses?

let a = "string a"
var b = "string b"

let c = a
var d = b

echo a.repr
echo b.repr
echo c.repr
echo d.repr

jlp765 (orginal) [2015-11-11T22:30:19+01:00] view original

Nim is not v1.0 yet.

I'm all in favour of the best solution going forward, even if changes need to be made that wrecks backward compatibility, and especially if we do it before V1.0

+1 for stronger compile time, faster running.

filcuc (orginal) [2015-11-11T22:56:23+01:00] view original

Sorry if my post is not correct or if i didn't have fully understood the problem but i don't get the current behaviour.

Varriount: Currently Nim's built-in string type is copy-on-assignment. This means that whenever a string is assigned or moved, a copy is made of the entire string.This limitation is due to the fact that strings have a mutable length and are represented as a reference to a dynamically allocated block of memory. When the string needs to be resized, a new block of memory must be allocated, which may reside at a different point in memory. If multiple pointers to the string were allowed, they would all need to be updated, which is technically infeasable.

Isn't a simple COW (copy-on-write) mechanism sufficient for solving all this copies without introducing a new datatype? I don't really see the need for a copy during assignment. Instead if the string is resized then a new string should be allocated and the old one should point to the old one..

filcuc (orginal) [2015-11-11T22:57:22+01:00] view original

So taking the OderWat example


let a = "string a"
var b = "string b"

let c = a
var d = b

echo a.repr
echo b.repr
echo c.repr
echo d.repr

a and c should point to the same heap allocated string (as b and d)

Araq (orginal) [2015-11-11T23:18:39+01:00] view original

@filcuc: You're right, that is the third solution to solve the problem which Varriount forgot to mention (but which I kept talking about ;-) ). Bonus points: We can map Nim's string to C++'s STL string class with all the interoperability benefits this implies. (If only Qt and wxWidgets would use std::string ...)

jboy (orginal) [2015-11-21T08:41:43+01:00] view original

Hi @Araq, thanks for taking a look at the proposal.

The proposal was intended to be a concrete starting point for discussion on how (if possible) to meet the preferences & feature requests that I listed at the start of my post.

I split it up into parts so that these parts could be considered & discussed independently (eg, "I agree with parts 7 & 9, but I disagree with the design in part 8 because ..."). It would be possible to implement some parts without implementing others. For example, strlit might be sufficient to satisfy most people who want an immutable string, without needing an istr too. (Also, I'm not at all married to any of these names; I just picked plausible names for reference purposes, and moved forward with them.)

Yes, the proposal changes some meanings of existing language syntax (obviously, I think those changes are for the better) and yes, it would cause some breakages.

The top breakage in my mind is code like var s = "hello", where s was previously mutable but now it's immutable, because the type has changed from string to strlit. I suggest that this could be fixed using a script that inserts @ before " in all typeless variable definitions. For example, a regex that changes "([^=!])=( *)\"" to "\\1=\\2@\"" (if my off-the-top-of-my-head regex syntax is correct).

Is there any benefit to the existing behaviour that @someString -> seq[char]? Does anyone ever make use of this? In the rare cases where such behaviour is needed, would a stdlib addition of the following pack & unpack not suffice?


proc pack(sc: seq[char]): string
proc unpack(s: string): seq[char]

Is there any benefit at all to the existing behaviour that var s: string and var st: seq[T] are both default-initialized to invalid states by the language? To me, this is a language flaw.

Again, I ask sincerely whether anyone relies upon the result of $nonStringType being mutable in their code.

Are there any other significant breakages that I've missed?

Here is a brief summary of what I see as the intended benefits of this proposal:

seq[T] & string default-initialize to something that is always treated as valid.

seq[T] & string are specifically defined to behave almost-identically (if they don't already do so). (Justifiable differences would include a possible \0 on the end of a string for C-compatibility, and a possible Short String Optimization in a future string implementation.) This enables the programmer to learn & understand just one behaviour model that applies to both of these types. This makes it easier for a programmer to learn the language & use the language.

This is reinforced if @ becomes applicable to both seq[T] & string.

Also, the @ operator changes from "the operator I put in front of an array to make a seq" to "the dynamic-allocation, copy-on-write" operator.

The strlit type represents a first-class type for immutable string literals that don't ever need to be memory-allocated at runtime, and can be passed around & assigned without copying. Off the top of my head, these are useful for specifying tokens for equality-testing during parsing. (Also, strlit should be the type returned by $anEnum.)

The strlit type also meaningfully represents the string type that is stored in a const variable, in addition to InstantiationInfo.filename and a hypothetical someproc.__name__.

The proposed \0-termination of strlit (when targetting the C backend) is intended to ensure that strlit can actually be compiled to a const char [] when you're compiling to C...

The istr type is intended to satisfy a section of programmers (including myself) who for whatever reason prefer to express some programming solutions in terms of immutable strings (ie, expressiveness) that can be passed around & assigned without copying (ie, efficiency of CPU & memory).

The proposed design of the istr type implements some amount of Short String Optimization, and ensures that a default-initialized istr instance is a valid instance.

The istr type also enables some (limited) non-copying slicing.

The proposed \0-termination of istr (when targetting the C backend) is intended to make it as easy as possible to use C output functions upon the result of $someType. (I wrestled with this, believe me. I was initially hoping that the C output functions that accept a length instead of a \0 would be sufficient, but I ultimately came to believe that the functionality-coverage of these functions is insufficient.)

The proposed openStr type (which could also be openString) is purely by analogy with openArray (which I actually don't really like as a name), to make it easier to write code that can accept any sort of string for reading.

Forgive my lack of knowledge, but what results of these proposed changes would make the language like D or Java? D never impressed me enough for me to become familiar with it, and I haven't programmed in Java for 16 years (since I was a university undergrad).

I focussed on the feature implementations to demonstrate that what I was suggesting was actually feasible. I'm absolutely happy to go through some examples of what the code would end up looking like.

Araq (orginal) [2015-11-21T09:39:53+01:00] view original

Again, I ask sincerely whether anyone relies upon the result of $nonStringType being mutable in their code.

I know that my code does rely on this.

Are there any other significant breakages that I've missed?

Most likely.

Is there any benefit to the existing behaviour that @someString -> seq[char]? Does anyone ever make use of this?

Not sure, but the question is weird. It's emergent behaviour of @ and Nim's typing rules.

Is there any benefit at all to the existing behaviour that var s: string and var st: seq[T] are both default-initialized to invalid states by the language? To me, this is a language flaw.

nil as a special default state for pointers/string/seq/procs is IMO a different topic and I'm actively working on a branch where not nil becomes the default. So please leave out nil in the discussion of how Nim's strings suck.

The strlit type represents a first-class type for immutable string literals that don't ever need to be memory-allocated at runtime, and can be passed around & assigned without copying.

I can do that with today's language. There is also cstring which already acts like your strlit as far as I can tell.

jboy (orginal) [2015-11-21T10:06:08+01:00] view original

I know that my code does rely on this.

Interesting. Do you ever use $ in a situation where you couldn't use a regex to convert $ -> @$?

Most likely.

Please elaborate. I have no desire to trash Nim, so I would like to consider & address all the problems.

Here is something else that has occurred to me: It would be necessary to work out how strlit should interact with string{lit} and static[string].

Not sure, but the question is weird. It's emergent behaviour of @ and Nim's typing rules.

OK, thanks for the confirmation that it's "emergent behaviour". This means that you didn't intentionally design it for a purpose.

Could you change Nim's type rules so that string is not an openArray? Alternatively, redefine @ so that its parameter type is array[N, T] rather than openArray?

(Update: I just realised that proc `@`[IDX, T](a: array[IDX, T]): seq[T] is already different to proc `@`[T](a: openArray[T]): seq[T]. So why is the second one needed? For seq[T], string & varargs[T], I guess. Any other types?)

nil as a special default state for pointers/string/seq/procs is IMO a different topic and I'm actively working on a branch where not nil becomes the default. So please leave out nil in the discussion of how Nim's strings suck.

OK, I will.

Can I ask how the not nil solution will work? It is very useful to allow strings & seqs to be default-initialized, so it would suck if Nim required a specific initialization at the point of variable declaration.

(I just want the default initialization to be a completely-valid empty string or seq.)

I can do that with today's language.

How?

There is also cstring which already acts like your strlit as far as I can tell.

Yes, but cstring is specifically for CFFI-compatibility, which goes hand-in-hand with "impure second-class citizen" and "unsafe".

Also, correct me if I'm wrong, but a doesn't a cstring also require runtime allocation when it is created? and isn't it also garbage collected? strlit is intended to be a pointer into static const memory...

Araq (orginal) [2015-11-21T11:11:42+01:00] view original

How?

Parameter passing doesn't copy, var x = foo() doesn't copy but moves let x = y doesn't copy but moves, var x = y does copy but I can use shallowCopy instead of = for that. I can also call shallow on a string and then it's not copied ...

Please elaborate. I have no desire to trash Nim, so I would like to consider & address all the problems.

Well I cannot elaborate because I cannot foresee what the changes you outline will break what is out there. What I can say, however, is that there is always something missing that will only be detected in the real world, after we implemented your proposals. I know for a fact that the short-string optimization that doesn't change semantics will break quite some code which relies on the low level representation of strings. That's ok, we are not 1.0 yet, but again, that is something which doesn't even change semantics. Your proposals happily change semantics, so it's completely up in the air for me what it would break. In other words: Nobody can consider and address all the problems, sorry. That doesn't mean your proposals are without merit, of course!

But please, focus on the design, not on the implementation of how the unions look like under the hood, and not on whether regexes can update the code automatically or not.

And when I say "focus on the design" the first step would be to evaluate systematically if there is actually a real problem. And sorry, "I like my strings to be immutable to avoid copies" (happily ignoring the fact that every CPU cache out there works by copying data...) is not an evaluation. ;-)

Araq (orginal) [2015-11-21T11:39:06+01:00] view original

So why is the second one needed? For seq[T], string & varargs[T], I guess. Any other types?

It's for openArray. Surprising, hu?

proc takesSeq(s: seq[int]) = ...

proc takesOpenarray(a: openarray[int]) =
  takesSeq(@a)

Varriount (orginal) [2015-11-21T20:01:02+01:00] view original

For anyone who's interested, I've written up a (highly experimental!) immutable string data type here.

The data type roughly follows the semantics of Python's string data type, with the exception that it appears mutable (eg, the '&=' operator is supported).

Please note that slice and substring operations still make copies - I couldn't implement non-copying behavior and still maintain implementation simplicity.

andrea (orginal) [2015-11-22T04:03:56+01:00] view original

Interesting, I will try it. By the way, is there any reason why you importc memcpy instead of using the one in system?

jboy (orginal) [2015-11-22T05:03:29+01:00] view original

Hi @Araq: OK, I'm currently making a list of string use cases -> test cases for evaluation.

Hi @Variount: Thanks for posting that. I'm looking at it now. Based upon a very quick first skim, I have one question & one comment:

Question: In type ImmutableStringDesc, there is an attribute reserved, that doesn't seem to be used for anything. What is/was the intention of this attribute?

Comment: In proc allocImmutableString(length: int): ImmutableString, there is a comment: "Allocate an extra byte for the null, so that cstring conversion is O(1)".

In general, I would not trust non-stdlib C or C++ functions to respect constness, so I would suggest that a cstring conversion should always require a string-buffer copy.

It's possible to cast away const in C & C++, so some programmer somewhere will do it. So if the cstring is not a fresh copy, someone can modify the shared string-buffer content without informing you in advance. This is a problem regardless of whether your string is immutable or copy-on-write. (It's not even sufficient to mark a copy-on-write string as "now shared with another string". That copy-on-write mechanism expects the other string to respect that flag & make a copy rather than modifying the shared copy; general C or C++ code won't do that.)

I would trust most C stdlib functions (other than those that specifically cast away constness to return a non-const char *, such as strchr or index). Thus, I would suggest it is safe to invoke a specific set of trustworthy C stdlib functions upon our valid-C-string internals within the Nim stdlib implementation, but I would not trust any arbitrary C or C++ function that non-Nim-stdlib code calls.

(I'll continue reading the code in more detail.)

Varriount (orginal) [2015-11-22T09:52:07+01:00] view original

@jboy The shared string has the same structure as Nim's native string type. This allows me to simplify much of the implementation via casts, borrowing the behavior of the string type.

Mirror of forum.nim-lang.org

1793 :: Immutable String Type Vs. Shallow/Freeze Procedure