nimforum mirror - Looking for advices on data design for a dynamic language

gcao (orginal) [2021-01-30T05:01:07+01:00] view original

Hi,

I've been trying to create a toy dynamic language called Gene(https://github.com/gcao/gene.nim), and was able to get a lot done thanks to Nim. Now I'm running into some roadblocks and would love some expert advices on designing the data structure etc.

Gene is an interpreted language and currently supports these features:

Dynamic and flexible

Homoiconic like Lisp

Other than int, string, map, vector etc, it has a new basic type called "gene", which is like a HTML element that has a type, a list of properties and a list of children.

Support Functional Programming

Support Object Oriented Programming (single inheritance)

Support macro expansions

Support pattern matching

The problems I ran into are

Data structure is limiting

Performance is bad (most likely because of the design of data structure and the interpreter. I've used Valgrind to benchmark and optimized as much as possible but it's still pretty slow)

I would like to re-look at the data structure to make it meet below requirements

Performant

Extensible (can be extended by application developers thru Nim-based extensions, and user-types should have comparable performance as standard types)

Memory safe

My current data structure is like below. Any comments or suggestions on how this can be improved will be highly appreciated. Even if I have to re-write most of my code that's fine because this is just a hobby project.

type
  GeneKind* = enum
    GeneBool
    GeneInt
    ...
    GeneGene
    GeneInternal    # internal data types like Function, Class etc
    GeneAny         # any data type
  
  GeneValue* {.acyclic.} = ref object
    case kind*: GeneKind
    of GeneBool:
      bool*: bool
    of GeneInt:
      int*: BiggestInt
    ...
    of GeneGene:
      gene*: Gene
    of GeneInternal:
      internal*: Internal
    of GeneAny:
      anyType*: int   # Can be used to search for type information
      any*: pointer
  
  Gene* {.acyclic.} = ref object
    `type`*: GeneValue
    props*:  OrderedTable[int, GeneValue]
    data*:   seq[GeneValue]
  
  Internal* = object
    case kind*: GeneInternalKind
    of GeneApplication:
      app*: Application
    of GeneFunction:
      fn*: Function
    ...

drkameleon (orginal) [2021-01-30T07:37:05+01:00] view original

Hi!

The topic is obviously very broad. For starters, I would suggest you start looking into existing programming language implementation (and the more similar they are conceptually to what you have in mind, the better).

In Nim, there already several ones:

Arturo (shameless self-promotion lol)

Spry

A few ideas off the top of my head on how to improve performance:

Re-Implement the interpreter as a bytecode VM (stack- or register-based). https://www.smartspate.com/how-to-create-an-interpreter-of-bytecodes/

Packing values as floating-point numbers (so, basically, avoid objects and hide values - numbers, or object pointers in a IEEE-754 float). https://docs.oracle.com/cd/E19957-01/806-3568/ncg_math.html#:~:text=IEEE%20754%20specifies%3A,and%20occupies%2064%20bits%20overall.

Stefan_Salewski (orginal) [2021-01-30T07:46:21+01:00] view original

an interpreted language

Then I assume that you have already read the great free book

https://craftinginterpreters.com/

gcao (orginal) [2021-01-30T18:13:40+01:00] view original

Thank you! I'll take a look at the projects and resources you mentioned. I'm curious how fast is Arturo. Is it comparable with Ruby, Python etc or even faster?

I would expect we can build a tree-walker interpreted language that is as fast as those, but maybe not.

I would love to build a bytecode VM, or maybe compile Gene to wasm. However because Gene is a dynamic language. It might be too hard and too much work.

Any advice on how to change my data structure to support user types in a memory-safe way?

gcao (orginal) [2021-01-30T18:18:09+01:00] view original

Thank you. I read some chapters but didn't follow it to build anything. I may go back to read the whole thing.

Do you have any suggestions on what a performant data structure for a dynamic language should look like?

drkameleon (orginal) [2021-02-01T07:01:22+01:00] view original

Well, since you're asking about speed... let me share some thoughts.

At the beginning, I was totally obsessed about speed. I mean... could I create the fastest interpreted language out there? That was some challenge.

I ran my benchmark for every single code change and super-ultra-optimized every single line, to the point where the code was looking like hieroglyphics, but performant ones.

Did it beat Ruby and Python? Yes, it did (since none of them is famed for its speed). It could single-handedly beat Tcl, Rebol/Red and all those too. (Btw, the fastest, most minimal, interpreted language - no matter how much I dislike it - is Lua, so this is where you'd have to set the bar if you want to set it as high as possible.)

But, being fast per-se, doesn't mean so much.

I mean... if I really wanted absolute speed, first of all I would have written a compiler. So, remember that.

Now, being fast, of course is a good thing. But over-optimizing too much, too soon - is not.

Fast-forward to now, after 2-3 complete re-writes of Arturo (imagine: Arturo was initially written in D, then I moved to C, then to Nim, then back to C, then again back to Nim), I'll tell you something:

I decided - for once - NOT to focus on speed, and NOT to waste any time for now on silly micro-optimizations. My goal is to make a fully 100%-functional programming language and interpreter. And once I have that, optimize it as much as I like.

And you know what? Even though I haven't run any benchmarks whatsoever, the language feels reasonably fast and... most important of all: it's a real language with an interpreter that works.

Just my few cents, from someone that has been troubled by all this for too long. ;-)

drkameleon (orginal) [2021-02-01T07:09:24+01:00] view original

Regarding the data structure you mentioned, let me share some thoughts about it too, in an oversimplified way to get what I mean... (though, I'd have to say that I'm not doing it like that right now, simply because as I said, my decision is to make a working language first, and then deal with these things... it's easier this way).

Switch to your bitwise-mind and let's go.

Let's see where you could "fit" all of your values...

Would 8bits (1 byte) work? I guess not. Unless, you want to be able to store positive integer up to 256, or any number between -128/128.

Would 16bits work? I still guess not.

Now, let's take 64bits for example, which is widely support. Suppose you "split" those bits into 2 halves. The right-hand 32-bits could easily accomodate any integer (so there you have 32bit integer support), booleans too, and pointers too - since they also fit into 32 bits (in a few words, any "dynamic" object, strings, arrays, whatever). Floating-point numbers could easily use more than 32 bits, but you already have them, don't you?

And... you could use the first bits for storing any relevant information: for example, what type of data is it holding?

I know I'm over-simplifying it, but this is the whole idea in a nutshell. ;)

ynfle (orginal) [2021-02-01T12:46:27+01:00] view original

You could also use an enum and with bitsets you could have union types etc. All the binary stuff is handled for you, though you will need more memory because it's one bit per option

Mirror of forum.nim-lang.org

7449 :: Looking for advices on data design for a dynamic language