Hi,
I've been trying to create a toy dynamic language called Gene(https://github.com/gcao/gene.nim), and was able to get a lot done thanks to Nim. Now I'm running into some roadblocks and would love some expert advices on designing the data structure etc.
Gene is an interpreted language and currently supports these features:
The problems I ran into are
I would like to re-look at the data structure to make it meet below requirements
My current data structure is like below. Any comments or suggestions on how this can be improved will be highly appreciated. Even if I have to re-write most of my code that's fine because this is just a hobby project.
type
GeneKind* = enum
GeneBool
GeneInt
...
GeneGene
GeneInternal # internal data types like Function, Class etc
GeneAny # any data type
GeneValue* {.acyclic.} = ref object
case kind*: GeneKind
of GeneBool:
bool*: bool
of GeneInt:
int*: BiggestInt
...
of GeneGene:
gene*: Gene
of GeneInternal:
internal*: Internal
of GeneAny:
anyType*: int # Can be used to search for type information
any*: pointer
Gene* {.acyclic.} = ref object
`type`*: GeneValue
props*: OrderedTable[int, GeneValue]
data*: seq[GeneValue]
Internal* = object
case kind*: GeneInternalKind
of GeneApplication:
app*: Application
of GeneFunction:
fn*: Function
...
Hi!
The topic is obviously very broad. For starters, I would suggest you start looking into existing programming language implementation (and the more similar they are conceptually to what you have in mind, the better).
In Nim, there already several ones:
A few ideas off the top of my head on how to improve performance:
an interpreted language
Then I assume that you have already read the great free book
Thank you! I'll take a look at the projects and resources you mentioned. I'm curious how fast is Arturo. Is it comparable with Ruby, Python etc or even faster?
I would expect we can build a tree-walker interpreted language that is as fast as those, but maybe not.
I would love to build a bytecode VM, or maybe compile Gene to wasm. However because Gene is a dynamic language. It might be too hard and too much work.
Any advice on how to change my data structure to support user types in a memory-safe way?
Thank you. I read some chapters but didn't follow it to build anything. I may go back to read the whole thing.
Do you have any suggestions on what a performant data structure for a dynamic language should look like?
Well, since you're asking about speed... let me share some thoughts.
At the beginning, I was totally obsessed about speed. I mean... could I create the fastest interpreted language out there? That was some challenge.
I ran my benchmark for every single code change and super-ultra-optimized every single line, to the point where the code was looking like hieroglyphics, but performant ones.
Did it beat Ruby and Python? Yes, it did (since none of them is famed for its speed). It could single-handedly beat Tcl, Rebol/Red and all those too. (Btw, the fastest, most minimal, interpreted language - no matter how much I dislike it - is Lua, so this is where you'd have to set the bar if you want to set it as high as possible.)
But, being fast per-se, doesn't mean so much.
I mean... if I really wanted absolute speed, first of all I would have written a compiler. So, remember that.
Now, being fast, of course is a good thing. But over-optimizing too much, too soon - is not.
Fast-forward to now, after 2-3 complete re-writes of Arturo (imagine: Arturo was initially written in D, then I moved to C, then to Nim, then back to C, then again back to Nim), I'll tell you something:
I decided - for once - NOT to focus on speed, and NOT to waste any time for now on silly micro-optimizations. My goal is to make a fully 100%-functional programming language and interpreter. And once I have that, optimize it as much as I like.
And you know what? Even though I haven't run any benchmarks whatsoever, the language feels reasonably fast and... most important of all: it's a real language with an interpreter that works.
Just my few cents, from someone that has been troubled by all this for too long. ;-)
Regarding the data structure you mentioned, let me share some thoughts about it too, in an oversimplified way to get what I mean... (though, I'd have to say that I'm not doing it like that right now, simply because as I said, my decision is to make a working language first, and then deal with these things... it's easier this way).
Switch to your bitwise-mind and let's go.
Let's see where you could "fit" all of your values...
Would 8bits (1 byte) work? I guess not. Unless, you want to be able to store positive integer up to 256, or any number between -128/128.
Would 16bits work? I still guess not.
Now, let's take 64bits for example, which is widely support. Suppose you "split" those bits into 2 halves. The right-hand 32-bits could easily accomodate any integer (so there you have 32bit integer support), booleans too, and pointers too - since they also fit into 32 bits (in a few words, any "dynamic" object, strings, arrays, whatever). Floating-point numbers could easily use more than 32 bits, but you already have them, don't you?
And... you could use the first bits for storing any relevant information: for example, what type of data is it holding?
I know I'm over-simplifying it, but this is the whole idea in a nutshell. ;)