nimforum mirror - Nim 2.0 -- thoughts

Araq (orginal) [2021-05-15T09:01:03+02:00] view original

It's a bit too early for an official plan/RFC, so instead I'm writing this down here. Once Nim can bootstrap via --gc:orc we should make this the default GC as it works best with destructor based custom memory management. Since this is all based on the "new runtime" which isn't ABI compatible this deserves the 2.0 version. This should also be our next LTS version.

We should take the opportunity and clean up the standard library -- system.nim is too big and I don't see why io.nim and assert.nim should be part of it. I also long for better versions of json.nim, os.nim, strutils.nim etc.

os.nim

For example, os.nim could be split up into its different tasks:

path handling (paths.nim)

directory handling (dirs.nim): creation, removal, iteration, existence checking

file handling (files.nim): creation, removal, existence checking

environment variables handling (envvars.nim)

symlinks handling (symlinks.nim)

OS error reporting (oserrors.nim)

json could be split up into:

jsonbuilder

jsonparser

jsonlexer

jsontree

jsonmapper: it maps Nim types (objects) to Json types

The bundling process

Now ... these reworked modules should be available as Nimble packages, yet we like to keep Nim's "batteries included" conveniences. A new tool is required -- I call it the "bundler". The bundler takes specific commits of Nimble packages and makes it part of the official Nim distribution (that is, the zips and tarballs you can download from our website). Documentation of these modules is included, there is CI integration and to bring a new commit into the distribution, the changes need to have been reviewed, exactly like stdlib PRs are handled. I hope this will keep the benefits of today's monorepo but with more flexibility -- we can then more easily replace modules. These modules are to be imported as import dist / jsonbuilder (for example). To use the Nimble package instead, tell Nimble to install the package and change the import to import $packagename / jsonbuilder. The old way of doing `import std / json` would be deprecated, but would continue to work!

The bundler tool will have its own "C sources" so that the Nim compiler itself can use everything that is included in the distribution, we can continue to eat our own dog food.

Language changes

We already got user definable numeric literals for better bignum and decimal support.

We already got a new concepts implementation and should try it out for the outlined new modules.

We should make --threads:on the default. (Minor change.)

We should remove the "GC safety" effect as ORC doesn't need it. Instead access to global variables should be done in a .global environment (RFC to be written)...

We will unify addr and unsafeAddr (Accepted RFC).

"FAQ"

How compatible is Nim version 2 with version 1?

Very, the import statements will change but you can easily write code that works with both versions.

Can I use the distribution version of a module while my dependency uses the Nimble version of the module?

Yes.

Will there be a version 1.6 before version 2.0?

I don't know.

Have you lost your mind?

sekao (orginal) [2021-05-15T09:43:26+02:00] view original

I'm happy to hear the source level breakages would (mostly or entirely?) be in import statements -- though for large codebases, even that could be a lot of work to account for. Ideally it would be handled automatically, maybe by specifying the "edition" your module/package uses à la rust. But all it would do is rewrite the imports for you, rather than syntax changes.

The following isn't directed at this post in particular, but to advocates of more aggressive breaking changes. I'm wary of the false sunk cost argument -- that because some breaking changes will happen, we may as well dogpile them. Each one is an additional cost, after all. The insidious thing about such changes is that they benefit newcomers while taxing those who have invested heavily in the language, the latter being the most valuable users.

Araq (orginal) [2021-05-15T10:42:06+02:00] view original

I'm wary of the false sunk cost argument -- that because some breaking changes will happen, we may as well dogpile them. Each one is an additional cost, after all.

That's a most important point, but I think there is no objective best way here -- "death by a thousand little cuts (every minor version breaks something)" vs "one big planned update you can trust to cause work". Both approaches have downsides.

JPLRouge (orginal) [2021-05-15T11:06:27+02:00] view original

We should remove the "GC safety" effect as ORC doesn't need it. Instead access to global variables should be done in a .global environment (RFC to be written)...

much more readable even if it is more verbose

zetashift (orginal) [2021-05-15T12:43:16+02:00] view original

I was always hoping to see pattern matching, nicer variants and a better pragma syntax for 2.0!

I'm glad to see compatibility as a priority for 2.0.

Clonk (orginal) [2021-05-15T13:05:01+02:00] view original

The bundler tool will have its own "C sources" so that the Nim compiler itself can use everything that is included in the distribution, we can continue to eat our own dog food.

Can you clarify this sentence ? Does that mean the generated C source of the "dist" stdlib will be committed ?

Willyboar (orginal) [2021-05-15T13:26:18+02:00] view original

I believe Nim 2.0 also needs a central package hosting. The package.json file is already huge.

Araq (orginal) [2021-05-15T14:02:12+02:00] view original

Can you clarify this sentence ? Does that mean the generated C source of the "dist" stdlib will be committed ?

No, it means the bundler helper tool can be built without a Nim compiler, so that CI can build the tool, the bundler fills dist/ and then bootstrapping can begin. It's a minor implementation detail, sorry for having it brought up.

xigoi (orginal) [2021-05-15T14:27:49+02:00] view original

Are you considering making not nil the default?

Araq (orginal) [2021-05-15T15:04:03+02:00] view original

Are you considering making not nil the default?

No. 2.0 is about --gc:orc and stdlib refactorings, I don't want to do too many things at once. I have big plans for not nil but I don't want to talk about it, it would only sidetrack the discussions. If you have too many nil related bugs, use fewer ref's in your code. ;-)

adrianv (orginal) [2021-10-14T16:14:01+02:00] view original

cool, this could open the way for other backends.

jordyscript (orginal) [2021-10-16T17:25:01+02:00] view original

Thanks for the constructive and thoughtful reply and the suggestion you made. I hadn't considered that yet. It raises many important questions on how to best model a Path type. Modelling a Path type as a simple distinct string seems to me clearly an improvement over the current situation of using a plain string, since it introduces greater type safety by neatly separating two conceptually very different entities currently using the same type.

From that point on any additional inherited subtypes or generic types are less clearly an improvement. They may well be improvements, but now you really have to consider the specifics of the problem space for this domain. Once you go beyond the naive Path = distinct string type by adding subtypes or generic type parameters you have to consider what the cost is of embedding the additional filesystem metadata into the type. The tradeoff will be slightly different for each approach.

I've given a lot of thought to the generic type approach and I've done research as I was considering your suggestion and I've come up with the following considerations.

Path manipulation is fundamentally like interacting with a database which is the source of truth that your runtime Type information has to accurately reflect at all times if you are going to embed additional metadata into the generic type. Except the DBMS is the OS/Kernel and the database is the filesystem.

When dealing with Paths we have to account for and distinguish between abstract and concrete paths. Abstract paths don't exist on the filesystem, but are valid paths that could be created on the filesystem. Concrete Paths are Paths that currently exist on the filesystem.

When dealing with Paths one inevitably has to deal with untrusted input at runtime (such as command line arguments) where the metadata is initially unknown. This calls into question how much of the implementation can rely on static parameters and when, which may adversely affect runtime performance.

Some of the metadata may get lost when performing Path operations.

5. Relevant metadata for the parameters of the Path type includes:

Existence (Abstract, Concrete)

Entity (File, Directory, Device)

Locality (Absolute, Relative, UNC)

Link (Hardlink, Softlink)

Of course the metadata for all of these axes could be unknown (see point 3 and 4), so an enum that lists all the options including Unknown for each axis would be the appropriate way to model the parameters to the type.

The only metadata that can be deduced with certainty (notwithstanding any path manipulations) is Locality, since that information can be deduced from the actual Path. Existence, Entity, and Link can all become unknown due to various manipulations.

Keeping metadata synchronized with the filesystem for every Path manipulation may be unnecessary and costly overhead, since many consecutive manipulations may be necessary and not every intermediate parameter state may need to be fully known in between.

The generic type parameters multiply the effective range of Path types that exists, such that for the parameters listed in point 5 the total amount of combinations (including unknown options for each parameter except Locality) becomes 108. This can be further reduced to 84 effective types when recognizing that the Existence of Relative Paths cannot be known until the Path is made Absolute (i.e. until you know what it is relative to, which is typically, but not always, the current working directory, you can't know if a path exists or not)

These 84 types fall into intersectional groups with respect to what operations are permitted and what return type is produced (if any). For example, some procedures can only operate on Paths where the Entity is Directory, and Existence is Concrete, but Locality and Link are irrelevant.

Aliases for these groups can be defined, for example:
type AnyDirectory = Path[
    Directory,
    Abstract | Concrete | UnknownExistence,
    Absolute | Relative | UNC,
    HardLink | Softlink |UnknownLink]
The syntax used in the above example is invalid, but is used for brevity. (Maybe it would be nice if such a syntax could be added to the language. It's certainly easier to write than type AnyDirectory = Path[Directory, ...] | Path[Directory, ...] | etc. and it's arguably easier to read as well. )

The implications of point 8 and 9 are that the type information will become very complex and great care and much thought need to be invested in getting all the type interactions correct. This means that it will become harder for users of the library to define new procs using the generic Path[Existence, Entity, Locality, Link] = distinct string type and they need to invest more time in learning to use the library.

I also question the runtime performance cost of having such a versatile generic type.

Given point 10 it may be wiser to leave some of the metadata out of the generic type's parameters and have the user perform explicit checks when needed, or throw exceptions when invalid operations are performed and have these be handled by the user.

Alternatively, maybe it is possible to define some syntactic sugar that automatically updates the generic type's parameters when explicit checks are performed which effectively makes all of the parameters "invisible" to the end user. Requiring explicit checks for highly variable metadata eliminates the need to define procs for whole groups of effective instantiations of the generic type and thereby reduces the types one is concerned with effectively. This begs the question of whether it is needed to encode the metadata in the type if you are having to check explicitly anyway, but it would still prevent operations on types that haven't been checked explicitly.

12. A reduced set of metadata for the Path type would result in a much more manageable set of types. Metadata could be reduced to:

Entity (File, Directory, Device, UnknownEntity)

Locality (Absolute, Relative, UNC)

This yields only 12 effective types. A few aliases such as the ones in the example below can be defined (this syntax doesn't work, but is again used for brevity).
type
    FileSystemEntity* = enum
      File
      Directory
      Device
      UnknownEntity
    
    FilesystemLocality* = enum
      Absolute
      Relative
      UNC
    
    Path[FileSystemEntity, FilesystemLocality] = distinct string
    
    # Aliases
    AnyDirectory = Path[Directory, Absolute | Relative | UNC]
    AnyFile = Path[File, Absolute | Relative | UNC]
    AnyDevice = Path[Device, Absolute | Relative | UNC]
    UnknownPath = Path[UnknownEntity, Absolute | Relative | UNC]
    RelPath = Path[Directory | File | Device | UnknownEntity, Relative]
    AbsPath = Path[Directory | File | Device | UnknownEntity, Absolute]
    UNCPath = Path[Directory | File | Device | UnknownEntity, UNC]
    LocalPath = Path[File | Device | Directory | UnknownEntity, Absolute | Relative]
    LocalFile = Path[File, Absolute | Relative]
    LocalDirectory = Path[Directory, Absolute | Relative]
    LocalDevice = Path[Device, Absolute | Relative]
  
  proc `$`*(p: Path): string = p.string
  
  
etc. can be created to accurately type the parameters of Procs so that these procedures can be overloaded to work with different combinations, or left unimplemented if a particular combination is invalid.

Other metadata would then need to be queried explicitly when the need arises and will not be embedded in the type.

I think a generic parameterized Path type as mentioned in point 12 has potential. I can imagine a convenient API with good use of proc overloading and some nice constructors for it. The benefit it has over the non-generic type Path = distinct string is that you don't have to check explicitly for every cornercase in every proc you define, but can instead overload procs to work correctly based on the types involved.

Any feedback? Which would be preferable type Path = distinct string or type Path[FilesystemEntity, FilesystemLocality] = distinct string ? Am I missing something from my considerations of the problem space or the cost benefit analysis? After taking into account any feedback, would a sample implementation that wraps current Standard Library functionality be desirable?

Araq (orginal) [2021-10-16T17:41:43+02:00] view original

Any feedback? Which would be preferable type Path = distinct string or type Path[FilesystemEntity, FilesystemLocality] = distinct string ?

My solution which uses 4 distinct types would still be preferable to me ;-) However, if you want fewer types, at least distinguish between directories and files which have almost nothing in common. openFile cannot open directories and even if it could, what the heck would readBytes mean for a directory...

I think the API that I use in the compiler works well and has been battle-tested as the compiler does a suprising amount of path manipulations.

After taking into account any feedback, would a sample implementation that wraps current Standard Library functionality be desirable?

Yes.

There is one more missing aspect to this though, keeping paths as string (distinct or not does not matter for this point) is not optimal for the Windows target where Paths are in UTF-16 natively and so involve a translation step. This translation step is done repeatedly whereas otherwise it could be done only once.

jordyscript (orginal) [2021-10-16T21:02:27+02:00] view original

Thank you for the feedback!

My solution which uses 4 distinct types would still be preferable to me ;-) However, if you want fewer types, at least distinguish between directories and files which have almost nothing in common. openFile cannot open directories and even if it could, what the heck would readBytes mean for a directory...

I was under the impression that it's possible to distinguish with the generic type too, and I was planning to, but it turns out that the actual behaviour is different. Here's a comparison:

# implementation in compiler/pathutils
type
  AbsoluteFile = distinct string
  AbsoluteDir = distinct string
  RelativeFile = distinct string
  RelativeDir = distinct string
  AnyPath = AbsoluteFile | AbsoluteDir | RelativeFile | RelativeDir

# some procs to test things out
proc `$`*(p: Path): string = p.string
proc open*(file: AbsoluteFile) =
  echo "opening " & file.string
proc mkdir*(baseDir: AbsoluteDir, childDir: RelativeDir) =
  echo "making " &  childDir.string & " in " & baseDir.string

# proposed implementation
type
  FileSystemEntity* = enum
    File
    Directory
    Device
    UnknownEntity
  FilesystemLocality* = enum
    Absolute
    Relative
    UNC
  
  Path[FileSystemEntity, FilesystemLocality] = distinct string
  
  AbsoluteFile = Path[File, Absolute]
  RelativeFile = Path[File, Relative]
  AbsoluteDir = Path[Directory, Absolute]
  RelativeDir = Path[Directory, Relative]

# some procs to test things out
proc `$`*(p: Path): string = p.string
proc open*(file: AbsoluteFile) =
  echo "opening " & file.string
proc mkdir*(baseDir: AbsoluteDir, childDir: RelativeDir) =
  echo "making " &  childDir.string & " in " & baseDir.string

In my mind, the two implementations should behave identically, but I guess that because different values of an enum are still considered to be of the same type then each generic parameters only has a single type no matter what value it holds... In order to make the generic approach work, then, we would have to use some sort of collection of distinct types similar to an enum, but I'm unaware of the existence of such a construct.

Also, there doesn't seem to be a way to make inheritance work with distinct string. Either way, inheritance seems to be a more complicated approach than your approach.

So then (something similar to) your approach is the only viable one to get the behaviour I wanted in the first place. I will study the pathutils and os modules and see what I can learn from there.

But now I'm left wondering about the usefulness of a construct similar to an enum which defines a collection of distinct types that could serve as type parameter flags for a generic type that encodes metadata as type information... Is there such a thing? If not, I guess it could be implemented as a macro, though I've got no clue how that would work.

There is one more missing aspect to this though, keeping paths as string (distinct or not does not matter for this point) is not optimal for the Windows target where Paths are in UTF-16 natively and so involve a translation step. This translation step is done repeatedly whereas otherwise it could be done only once.

I didn't know about that. Seems to be a good problem to solve at the same time. Could a UTF16 string type string16 be defined and then distinct string16 be used for Path types when compiling for the windows target? Or is that too naive?

jordyscript (orginal) [2021-10-16T23:24:38+02:00] view original

But now I'm left wondering about the usefulness of a construct similar to an enum which defines a collection of distinct types that could serve as type parameter flags for a generic type that encodes metadata as type information... Is there such a thing? If not, I guess it could be implemented as a macro, though I've got no clue how that would work.

I was able to implement a typed enum macro such that the previous examples using the generic type parameters behave identically to the 4 different distinct string types approach from the pathutils module. So I may still end up using the generic type parameters to embed the metadata.

I'll publish something eventually for those interested to have a look at.

Araq (orginal) [2021-10-17T05:37:36+02:00] view original

Seems to be a good problem to solve at the same time. Could a UTF16 string type string16 be defined and then distinct string16 be used for Path types when compiling for the windows target? Or is that too naive?

That should work.

ElegantBeef (orginal) [2021-10-17T07:41:34+02:00] view original

I'm not sure the issue with the enum approach aside from not using static enums, given the following:

type
  FileSystemEntity* = enum
    File
    Directory
    Device
    UnknownEntity
  FilesystemLocality* = enum
    Absolute
    Relative
    UNC
  
  Path[Entity: static FileSystemEntity, Locale: static FilesystemLocality] = distinct string
  
  AbsoluteFile = Path[File, Absolute]
  RelativeFile = Path[File, Relative]
  AbsoluteDir = Path[Directory, Absolute]
  RelativeDir = Path[Directory, Relative]

# some procs to test things out
proc `$`*(p: Path): string = p.string
proc open*(file: AbsoluteFile) =
  echo "opening " & file.string
proc mkdir*(baseDir: AbsoluteDir, childDir: RelativeDir) =
  echo "making " &  childDir.string & " in " & baseDir.string
open(AbsoluteFile("Hello"))
#open(AbsoluteDir("Hello")) # doesnt compile
mkDir(AbsoluteDir("Hello"), RelativeDir("./"))
#mkDir(AbsoluteFile("Hello"), RelativeDir("./")) # nor do these
#mkDir(AbsoluteFile("Hello"), AbsoluteDir("./"))

jordyscript (orginal) [2021-10-17T12:25:06+02:00] view original

You are absolutely right! Thanks for pointing that out. I wasn't aware, but the presence of static in front of the enum type is what makes the difference. Upon further reflection, despite some runtime untrusted input I can get away with using static after all, so I'll be using this :)

Lennart (orginal) [2021-10-20T14:13:10+02:00] view original

Incremental Compilation ☺

Mirror of forum.nim-lang.org

7983 :: Nim 2.0 -- thoughts