Araq and I have already talked some about this on IRC, however since this is a rather interesting problem with two different solutions, I thought it might be fun to discuss it here.
Currently Nim's built-in string type is copy-on-assignment. This means that whenever a string is assigned or moved, a copy is made of the entire string. This limitation is due to the fact that strings have a mutable length and are represented as a reference to a dynamically allocated block of memory. When the string needs to be resized, a new block of memory must be allocated, which may reside at a different point in memory. If multiple pointers to the string were allowed, they would all need to be updated, which is technically infeasable.
The problem with this approach is that a large number of unintentional copies are made, even when the string may never be modified. To try and fix this situation, Araq and I have each come up with a solution.
The solution I propose is to introduce a new "ImmutableString" type to the standard library, an immutable, never-nil sequence of characters.
The new string type would be implicitly convertible to a string (by copying itself into a regular string type) and vice-versa. The plan after the introduction would be to slowly shift the standard library towards using ImmutableStrings. After sufficient adoption, we could then make "ImmutableString" the default string type, relegating the old string type to something like "StringBuffer". Additionally, a conversion program could be produced to swap the types automatically.
This has the advantage of being type safe and adding strong compile-time guarantees about string behavior, at the cost of being complex (adding a new string type, retrofitting all the string functions).
Probably totally obvious, but it seems like the runtime checks for Araq's proposal could be/should be something disabled by d:release, as with the many other correctness/safety checks. So performance impacts shouldn't be a huge issue.
Also, the proposals do not seem incompatible - both could be in flight at the same time for a long-time (obviously depending on implementor interest/energy/time). Sussing out the popularity of ImmutableString by getting something in the stdlib and seeing how it integrates sounds like a wise first step.
Related? Why again do a and c have different memory adresses?
let a = "string a"
var b = "string b"
let c = a
var d = b
echo a.repr
echo b.repr
echo c.repr
echo d.repr
Nim is not v1.0 yet.
I'm all in favour of the best solution going forward, even if changes need to be made that wrecks backward compatibility, and especially if we do it before V1.0
+1 for stronger compile time, faster running.
Sorry if my post is not correct or if i didn't have fully understood the problem but i don't get the current behaviour.
Varriount: Currently Nim's built-in string type is copy-on-assignment. This means that whenever a string is assigned or moved, a copy is made of the entire string.This limitation is due to the fact that strings have a mutable length and are represented as a reference to a dynamically allocated block of memory. When the string needs to be resized, a new block of memory must be allocated, which may reside at a different point in memory. If multiple pointers to the string were allowed, they would all need to be updated, which is technically infeasable.
Isn't a simple COW (copy-on-write) mechanism sufficient for solving all this copies without introducing a new datatype? I don't really see the need for a copy during assignment. Instead if the string is resized then a new string should be allocated and the old one should point to the old one..
So taking the OderWat example
let a = "string a"
var b = "string b"
let c = a
var d = b
echo a.repr
echo b.repr
echo c.repr
echo d.repr
a and c should point to the same heap allocated string (as b and d)Hi @Araq, thanks for taking a look at the proposal.
The proposal was intended to be a concrete starting point for discussion on how (if possible) to meet the preferences & feature requests that I listed at the start of my post.
I split it up into parts so that these parts could be considered & discussed independently (eg, "I agree with parts 7 & 9, but I disagree with the design in part 8 because ..."). It would be possible to implement some parts without implementing others. For example, strlit might be sufficient to satisfy most people who want an immutable string, without needing an istr too. (Also, I'm not at all married to any of these names; I just picked plausible names for reference purposes, and moved forward with them.)
Yes, the proposal changes some meanings of existing language syntax (obviously, I think those changes are for the better) and yes, it would cause some breakages.
proc pack(sc: seq[char]): string
proc unpack(s: string): seq[char]
Here is a brief summary of what I see as the intended benefits of this proposal:
Forgive my lack of knowledge, but what results of these proposed changes would make the language like D or Java? D never impressed me enough for me to become familiar with it, and I haven't programmed in Java for 16 years (since I was a university undergrad).
I focussed on the feature implementations to demonstrate that what I was suggesting was actually feasible. I'm absolutely happy to go through some examples of what the code would end up looking like.
Again, I ask sincerely whether anyone relies upon the result of $nonStringType being mutable in their code.
I know that my code does rely on this.
Are there any other significant breakages that I've missed?
Most likely.
Is there any benefit to the existing behaviour that @someString -> seq[char]? Does anyone ever make use of this?
Not sure, but the question is weird. It's emergent behaviour of @ and Nim's typing rules.
Is there any benefit at all to the existing behaviour that var s: string and var st: seq[T] are both default-initialized to invalid states by the language? To me, this is a language flaw.
nil as a special default state for pointers/string/seq/procs is IMO a different topic and I'm actively working on a branch where not nil becomes the default. So please leave out nil in the discussion of how Nim's strings suck.
The strlit type represents a first-class type for immutable string literals that don't ever need to be memory-allocated at runtime, and can be passed around & assigned without copying.
I can do that with today's language. There is also cstring which already acts like your strlit as far as I can tell.
I know that my code does rely on this.
Interesting. Do you ever use $ in a situation where you couldn't use a regex to convert $ -> @$?
Most likely.
Please elaborate. I have no desire to trash Nim, so I would like to consider & address all the problems.
Here is something else that has occurred to me: It would be necessary to work out how strlit should interact with string{lit} and static[string].
Not sure, but the question is weird. It's emergent behaviour of @ and Nim's typing rules.
OK, thanks for the confirmation that it's "emergent behaviour". This means that you didn't intentionally design it for a purpose.
Could you change Nim's type rules so that string is not an openArray? Alternatively, redefine @ so that its parameter type is array[N, T] rather than openArray?
(Update: I just realised that proc `@`[IDX, T](a: array[IDX, T]): seq[T] is already different to proc `@`[T](a: openArray[T]): seq[T]. So why is the second one needed? For seq[T], string & varargs[T], I guess. Any other types?)
nil as a special default state for pointers/string/seq/procs is IMO a different topic and I'm actively working on a branch where not nil becomes the default. So please leave out nil in the discussion of how Nim's strings suck.
OK, I will.
Can I ask how the not nil solution will work? It is very useful to allow strings & seqs to be default-initialized, so it would suck if Nim required a specific initialization at the point of variable declaration.
(I just want the default initialization to be a completely-valid empty string or seq.)
I can do that with today's language.
How?
There is also cstring which already acts like your strlit as far as I can tell.
Yes, but cstring is specifically for CFFI-compatibility, which goes hand-in-hand with "impure second-class citizen" and "unsafe".
Also, correct me if I'm wrong, but a doesn't a cstring also require runtime allocation when it is created? and isn't it also garbage collected? strlit is intended to be a pointer into static const memory...
How?
Parameter passing doesn't copy, var x = foo() doesn't copy but moves let x = y doesn't copy but moves, var x = y does copy but I can use shallowCopy instead of = for that. I can also call shallow on a string and then it's not copied ...
Please elaborate. I have no desire to trash Nim, so I would like to consider & address all the problems.
Well I cannot elaborate because I cannot foresee what the changes you outline will break what is out there. What I can say, however, is that there is always something missing that will only be detected in the real world, after we implemented your proposals. I know for a fact that the short-string optimization that doesn't change semantics will break quite some code which relies on the low level representation of strings. That's ok, we are not 1.0 yet, but again, that is something which doesn't even change semantics. Your proposals happily change semantics, so it's completely up in the air for me what it would break. In other words: Nobody can consider and address all the problems, sorry. That doesn't mean your proposals are without merit, of course!
But please, focus on the design, not on the implementation of how the unions look like under the hood, and not on whether regexes can update the code automatically or not.
And when I say "focus on the design" the first step would be to evaluate systematically if there is actually a real problem. And sorry, "I like my strings to be immutable to avoid copies" (happily ignoring the fact that every CPU cache out there works by copying data...) is not an evaluation. ;-)
So why is the second one needed? For seq[T], string & varargs[T], I guess. Any other types?
It's for openArray. Surprising, hu?
proc takesSeq(s: seq[int]) = ...
proc takesOpenarray(a: openarray[int]) =
takesSeq(@a)
For anyone who's interested, I've written up a (highly experimental!) immutable string data type here.
The data type roughly follows the semantics of Python's string data type, with the exception that it appears mutable (eg, the '&=' operator is supported).
Please note that slice and substring operations still make copies - I couldn't implement non-copying behavior and still maintain implementation simplicity.
Hi @Araq: OK, I'm currently making a list of string use cases -> test cases for evaluation.
Hi @Variount: Thanks for posting that. I'm looking at it now. Based upon a very quick first skim, I have one question & one comment:
Question: In type ImmutableStringDesc, there is an attribute reserved, that doesn't seem to be used for anything. What is/was the intention of this attribute?
Comment: In proc allocImmutableString(length: int): ImmutableString, there is a comment: "Allocate an extra byte for the null, so that cstring conversion is O(1)".
In general, I would not trust non-stdlib C or C++ functions to respect constness, so I would suggest that a cstring conversion should always require a string-buffer copy.
It's possible to cast away const in C & C++, so some programmer somewhere will do it. So if the cstring is not a fresh copy, someone can modify the shared string-buffer content without informing you in advance. This is a problem regardless of whether your string is immutable or copy-on-write. (It's not even sufficient to mark a copy-on-write string as "now shared with another string". That copy-on-write mechanism expects the other string to respect that flag & make a copy rather than modifying the shared copy; general C or C++ code won't do that.)
I would trust most C stdlib functions (other than those that specifically cast away constness to return a non-const char *, such as strchr or index). Thus, I would suggest it is safe to invoke a specific set of trustworthy C stdlib functions upon our valid-C-string internals within the Nim stdlib implementation, but I would not trust any arbitrary C or C++ function that non-Nim-stdlib code calls.
(I'll continue reading the code in more detail.)