I was struggling to understand this zippy example
import zippy/ziparchives, std/tables
# This example shows how to easily create an in-memory zip archive that can be
# written to disk or uploaded to a server, etc.
# First, add the entries you want in the zip archive.
# The key is the path (must be relative) and the value is the content bytes.
var entries: Table[string, string]
entries["file.txt"] = "Hello, Zip!"
entries["data/blob.json"] = "{}" and I asked to Copilot to explain it... I couldn't relate Nim string with raw bytes content and Copilot explanation clarified with sentence:
"In Nim, string is binary-safe, so you can store arbitrary bytes (not just text)"
I was a bit surprised because I missed that clarification on Nim manual, while I recalled that string was basically a sequence of chars terminated by '\0'. Further explanations followed and I later asked Copilot to provide its reference and "chain of thoughts" in case of inference. This was the response:
* Initial understanding came from reading Nim’s system-library documentation and implementation notes (e.g., in stringimpl.nim).
* The deepWiki article gives insight into the string's internal structure, revealing it stores both a pointer and a length flag—no null-* termination needed. [deepwiki.com]
* The Compile7 article explicitly verifies my interpretation, stating that embedded \0 characters are fully preserved in Nim strings and that .cstring conversion, not the string itself, uses null-termination. [compile7.org]
* Practical experience further reinforced this: using readFile on a PNG or PDF, manipulating as string, then writing it back with writeFile preserves every byte exactly.
It seems consistent and reasonable, but I wish to double-check: is it correct? Any reason why it's not explicitly mentioned in Nim manual when introducing string and cstring? I apologize, my programming knowledge is quite limited and maybe this is totally evident for 99% of coders... however for a strictly typed programming language I would have expected either a different type (e.g. something like a rawbytes type, that could have been just an alias of string type) or a sentence like the one Copilot produced above about string being binary safe and able to store arbitrary bytes. Thank you.
P.S. I have to add that I am more and more impressed by LLM chatbot skills to dig into documentation and explain technical topics in general terms like in this case. This is the language programming's "abstraction" the other way round: look at the implementation ( Practical experience further reinforced this ...) and derive the underlying concept.
It's in the Manual - see String Type:
A string in Nim is very similar to a sequence of characters. However, strings in Nim are both zero-terminated and have a length field. One can retrieve the length with the builtin len procedure; the length never counts the terminating zero. The terminating zero cannot be accessed unless the string is converted to the cstring type first.
Nim’s native string type can indeed hold any binary data, since it stores the length explicitly instead of relying on null-termination.
I recalled that string was basically a sequence of chars terminated by '0'
This is how strings are implemented in C, JS and some other languages, Nim provides compatible cstring type - which should only be used for C/JS FFI.
however for a strictly typed programming language I would have expected either a different type (e.g. something like a rawbytes type, that could have been just an alias of string type) or a sentence like the one Copilot produced above about string being binary safe and able to store arbitrary bytes.
I have been advocating for years to remove string usage for raw bytes in stdlib and to use seq[byte]: https://github.com/nim-lang/RFCs/issues/32
And building an API that can handle both is easy: https://github.com/mratsim/constantine/blob/8e117d7/constantine/hashes.nim#L40-L61
func hash*[DigestSize: static int](
HashKind: type CryptoHash,
digest: var array[DigestSize, byte],
message: openArray[byte],
clearMem = false) {.genCharAPI.} =
## Produce a digest from a message
static: doAssert DigestSize == HashKind.type.digestSize
var ctx {.noInit.}: HashKind
ctx.init()
ctx.update(message)
ctx.finish(digest)
if clearMem:
ctx.clear()
func hash*(
HashKind: type CryptoHash,
message: openArray[byte],
clearmem = false): array[HashKind.digestSize, byte] {.noInit, genCharAPI.} =
## Produce a digest from a message
HashKind.hash(result, message, clearMem)
I ingest openarray[byte] by default and I have a genCharAPI macro that adds the following dispatcher:
template toOpenArrayByte[T: byte|char](oa: openArray[T]): openArray[byte] =
when T is byte:
oa
else:
oa.toOpenArrayByte(oa.low, oa.high)
Note: in pure Nim code you can use generic proc foo[T: byte|char](oa: openArray[T]) to accept both strings ban byte-sequences. But in my case I needed a library that could be consumed from C/Go/Rust hence the macro approach as generic procs cannot be exposed without wrappers.
@tcheran, are you asking if "binary safe" means a Nim string may have a byte of any value 0..255, including null-byte (all-zero bit pattern) and non-printing characters in it?
@janAkali, when I read the documentation, "However, strings in Nim are both zero-terminated and have a length field." that suggests, to me, that Nim strings terminate at the null character.
When I encounter a seeming conflict like this, I rely on Timmy. Timmy is my Nim sandbox. It's just an instance of a vsCode editor that I keep to the side with the file timmy.nim ready to compile. I throw in some code that I want to test and I see what the result is. When I learn something new, I save timmy.nim to something more meaningful.
let s1 = "abc\0def"
echo s1
echo s1.len
var s2 = s1
s2.add("\xff")
echo s2
echo s2.len
Looking at the output, I'd say Nim strings may hold any byte value 0..255. Nim's String.len() returns the byte count including any byte value 0..255. However, Nim's echo appears to print the value 255 (xff), but elide the null char.
Thank you all for the answers. Using string for bytes blob does work, no major issues except some concerns expressed in the first link posted by @mratism (thank you for the additional insights).
So I may assume that Nim manual does not expose explicitly this opportunity offered by string type because it's not necessary the most elegant / appropriate way (someone called "string abuse") to store sequence of bytes
@janAkali
Nim documentation does report this:
However, strings in Nim are both zero-terminated and have a length field. One can retrieve the length with the builtin len procedure; the length never counts the terminating zero. The terminating zero cannot be accessed unless the string is converted to the cstring type first.
@dwhall256 yeah, your experiment confirm just what reported by Copilot too:
The Compile7 article explicitly verifies my interpretation, stating that embedded \0 characters are fully preserved in Nim strings and that .cstring conversion, not the string itself, uses null-termination
Nim does not rely itself on null-termination and the null-termination character (not optional, is always there "hidden") is not counted in the string length, so when echoing the characters of string, \0 is emitted only if it is within the string length.
Strings are always 0 terminated so that cstring(s) becomes a zero-copy operation. Nim strings are binary safe, but cstring is not. Also in Nim 3 the 0 termination might disappear causing slower conversions to cstring but saving memory.
That said, I use string instead of seq[byte] all the time for newly written code as it's more convenient. Maybe we should unify byte and char for Nim 3...
@Araq:
I use string instead of seq[byte] all the time for newly written code as it's more convenient.
Is that because there are no convenient procs in the standard library for properly dealing with binary data (i.e. seq[byte])?
@Araq:
Maybe we should unify byte and char for Nim 3...
In other words, you suggest to dump the type-safe distinction between a byte used as part of UTF-8 encoded text and a byte used as a piece of arbitrary binary data. I disagree. The two are not the same.
I spent a significant part of my earlier career dealing with binary data files, and conversions between different binary encodings. I thought the Nim distinction between char and byte was a real step forward from the C/C++ conflation of the two.
More to the point, I agree with @mratsim above. I find it disappointing that binary data is treated as a second-class citizen in the standard library of a language that touts itself to be a systems programming language. I think the right thing to do is address the library shortcoming(s), not to weaken type safety in the language.
I am very willing to help by submitting PRs to the standard library, as long as I am confident that I'm not wasting of my time struggling against entrenched opinions.
I thought the Nim distinction between char and byte was a real step forward from the C/C++ conflation of the two.
As long as 123 (int) is distinguished from 123u8 (byte) I think much of the type safety is preserved. C/C++'s design has other problems on top of that, signed vs unsigned char, easy typos like if (*p) vs if (p) are possible etc.
I think for Nim this unification makes lots of sense and we don't lose much safety, if any.
In other words, you suggest to dump the type-safe distinction between a piece of UTF-8 encoded text and a piece of arbitrary binary data. I disagree. The two are not the same.
That's not what a Nim string is. It's UTF-8 by convention, but you can put UTF-16 or latin-1 or whatever else in it just as well.
Still, I don't see how char could be unified with byte either, since $c does something different if c is a char or a byte. (I say this even though I also use openArray[byte] in my character coding library; string is easier to type, but it's very inconvenient for consumers.)
since $c does something different if c is a char or a byte.
Great point, I didn't consider this.
IMO much of the string vs seq[byte] inconveniences could be resolved by having something like
proc toBytes(s: sink string): seq[byte]
proc toString(b: sink seq[byte]): string in the standard lib. The first would remove the last \0, the second would check the last element and add \0 if it isn't there already. Libraries could return whatever, and one could switch to the preferred form safely (avoiding copies and terminating \0 gotchas). For me at least, the inconvenience mainly comes from the fact that (loosely speaking) string is openArray[char|byte], but openArray[char|byte] isnot string. sink is not very useful here because I rarely have something to sink, just a view that is either coming from a fixed size array or another string.
The fix is simple even in current Nim, just stop taking string if you only need a view. Say I have a proc that handles string, then if I want to pass a fixed size array or a substring I have to copy it first. So I change it to openArray[char], but now I can't call strutils.find on it because everything in strutils is declared to take string.
It seems the question boils down to this:
Should Nim types distinguish between text data (a collection of characters, encoded or otherwise) and binary data (a collection of arbitrary binary values)?
The bike shedding part of me says yes. Text data is processed in a very different manner from binary data, so the underlying types should reflect that distinction.
The pragmatic part of me says why bother. Nim's text types are not enforced to be valid text, so the current distinction simply gets in the way of efficient, simple coding. Making char a simple alias of byte, and string a simple alias of seq[byte] would make things a lot easier. I was comfortable with this kind of thing in my C days, and I would be comfortable with this now.
Either way, I need a standard library that does not force me to do type coercion in order to use it for binary data.