Programming Language: Difference between revisions

From Hegemon Wiki
Jump to navigation Jump to search
 
(64 intermediate revisions by the same user not shown)
Line 1: Line 1:
[https://blog.rust-lang.org/2016/09/08/incremental.html Check out rust's incremental build mode]
[https://vimeo.com/122066659 Adaptiva incremental builds]

=Binary Representation of Source Code=
=Binary Representation of Source Code=
* Represents the '''source code'''. Is not any kind of executable code (ASM/bytecode). Not an IR.
* Represents the '''source code'''. Is not any kind of executable code (ASM/bytecode). Not an IR.
Line 20: Line 23:
* But might be better to just store tabs/spaces and newlines in the binary format.
* But might be better to just store tabs/spaces and newlines in the binary format.


=Non Binary?=
=Schemas not 'data structures'=
* Perhaps just use something like JSON? Rust can output JSON ASTs... But less work for things like parsing. Would have bigger source code sizes (who cares)? Would let git etc work with it.

* Capnproto/flatbuffers?

* Could define the language as an API and don't care about the storage format.
** addNewFunction("function name") -> UUID
** Need to define 'types'.

=Programming language, program thyself=

Rust has build.rs and compiler plugins.

Go has 'go generate'.

Could be used to implement things similar to generics. Code generation.

='Schemas' not 'data structures'=
* struct definitions are normally mixed in with the the procedural instruction source code.
* struct definitions are normally mixed in with the the procedural instruction source code.
* Structures are a binding of **data types** to **variable names**.
* Structures are a binding of **data types** to **variable names**.
Line 36: Line 56:
** Some kind of distributed backend (Raft, Blockchain)
** Some kind of distributed backend (Raft, Blockchain)
* Could allow security definitions in the schema. Ie, who can edit this variable. Allows you to separate the security implementation stuff from the data structure.
* Could allow security definitions in the schema. Ie, who can edit this variable. Allows you to separate the security implementation stuff from the data structure.
* Would capnproto be usable? What about things like generics. iirk there is programming stuff capnproto doesn't support...

='Context' not 'arguments'=
* Define implicit contexts. Don't want to specify them every time...
* Probably need multiple contexts. A global context. Per-function context. (Maybe per thread? Initialisation context? Can a data structure have a context. Module/namespace context? Event context such as socket programming?).
* main's global context for example could include:
** Command line arguments
** OS function/syscalls calls. os.gettime, filesystem access, etc... for example.
* Contexts need to be overridable. So I could call Main from another program, and pass in the stdio streams and stuff... Unit testing and so on.

What about garbage collection/memory models being specified in the context?

What about supporting execution transfer. Moving a program across multiple cores, to other systems, etc... os calls could become RPC ones, or stay local os calls on the new machine? The transfer strategy would probably need to be part of the context.

What about a mapping from literals to datatypes?

For example int->int32/int64. But you make a function that does basic arithmetic using floating point numbers, complex numbers or some custom type for example. Allowing for dependant types.

Do literals make sense in a language without text/syntax? How are these types passed in?

What about the equivalent of 'keywords'? You could 'define' a for loop in the context. Or example a restricted for loop that has a hard timelimit.

Consider number types as mathematical 'fields'?

Compile time context vs runtime contexts? A function context would have the arguments and other stack stuff in it. Compile time would have type information. Allowing compile time at runtime would allow for a Python like dynamic programming language.

A data structures with a context could be considered generics.

='API's not 'method's=
* Similar to the 'Schemas' vs 'data structures'.
* Separate implementation from definition.
* Implicit definitions.
* Allow functions to be bound to a type. But they could be used as native function calls (like normal). Or remote RPC calls.
* Security stuff needs to be worked out. Local function calls don't need it. Remote function calls will need to be authenticated and the contents verified. This also requires logic level security (ie shouldn't have permission to modify another persons struct).
* Could there be an 'ownership' model for security?


=API Versioning=
=API Versioning=
Line 55: Line 110:
=Everything works as a module/library=
=Everything works as a module/library=
* Main is just a function gets passed an os.args, stdin, stdout, etc... into.
* Main is just a function gets passed an os.args, stdin, stdout, etc... into.
* Have a 'Enviroment' struct/context that has all that in it. Could make an implicit 'global' one but allow it to be overridden in functions. (ie remap printf to fprintf). Does that effect performance, how to deal with compiled libraries? Don't want extra bytes getting passed to every function or an extra redirect on thing like print statements... Look at how printing happens in ASM... Maybe printf just remaps to fprintf(stdio...)
** stdio
** stderr
** ENV variables
** filesystem,
** Operating system functions...
* Could sandbox things by just removing things like the os.filesystem (unless you can somehow manually do a function pointer to it).

=Do you need static libs if you have Dynamic libs?=
Can't you just produce dynamic libraries and use them as static ones if desired at compile time?


=Everything as an interface=
=Everything as an interface=
Line 69: Line 134:
***** synchronisation and atomic operations. A file stored on a disk can be fsynced so you know it's stored. An atomic operation can allow a file to be replaced in place. A database on the otherhand might be holding the file in ram.
***** synchronisation and atomic operations. A file stored on a disk can be fsynced so you know it's stored. An atomic operation can allow a file to be replaced in place. A database on the otherhand might be holding the file in ram.
***** Files are lockable. Prevent multiple processes trying to write to the same file at the same time.
***** Files are lockable. Prevent multiple processes trying to write to the same file at the same time.
* println("This is bad as it's pain to override");
* println("This is bad as it's pain to override and isn't testable...");
** fprintln(stdout, "This is better but more typing");
** fprintln(stdout, "This is better but more typing");
** Monkey patching can work but deals with globals which adds threading issues...
** Monkey patching can work but deals with globals which adds threading issues...

==Stop allowing stuff in 'empty space'==
* Syntactically quarantine globals/singletons/statics (if they are allowed). A better idea might be to make them a 'context', but syntactically hide it (Same idea as main's environment context).
globals {
int hello = 0;
}
* Do need things like function calls and so on.
* Don't syntactically need to specify the module namespace if it's determined by the filename/location.

=Generics=
* Generics seem to really just be code generation.
* How do generics in dynamic libraries work?
* Maybe allow the program to 'hook' into the compile steps. For example when a request for a struct with a specific identifier doesn't find it... Could allow for many other things (Rust has build.rs, Go has generate).
* Compiler plugins.
* Maybe make a very simple 'core' compiler and then the full compiler defines itself using the same kind of hooks.
* Maybe the hooks could allow a program define it's own keywords and so on. (not sure if keywords are a thing in a 'binary' language).

=File Structure (Ideas)=
* Make 'flat'.
* List of all symbol names.
* Changing the name should result in all uses of the name being updated.
* Use hash for id or index number?
** A name change with a hash would require all call sites to be updated as well.
** Index ID would be easier to update.
** An Index ID would allow name changes without breaking API.
** A hash might be needed for cross API compatibility.
** An ID would be quicker to lookup since it's just an index in an array. A Hash would require a hashmap.
** What about a UUID? It would be unique so cross library boundaries would be ok. No changes on rename. But slowish to look up.
** Index ID's could be put into namespaces. Would allow cross API's to be fine.
** UUID's could remove the need for namespaces.

* How to remap from symbol id to uses...
** Keep a mapping of all the sites that use the symbol?

* Could have a local identifier number, that maps to a global UUID. Could also combine the UUID+specific version. Would let you use different versions of the same function in the program. Could also save space (no 128bit number for every symbol, just a smaller one. But that might make merging problematic.)

=Binary File Design - Ideas for good Practices=
* Magic Number - Make it easy to identify files. Might as well use human readable ASCII...
* Versioning - Make it future proof.
* Backwards compatibility when possible
** Might be kind of hard.
** With a fixes structure, adding an new element would break things.
** Would need to specify each element and give them a ID.
** Could encode size of the 'block' and make sure new elements are added in order. Older consumers can skip the extra bits.
** Older consumers can leave the extra bits in place when saving to prevent loss of information.
*** Blocks left in place without understanding could break things (ie I rename something in an old version that is mentioned in a new section but don't update it's entry).
*** Could maybe be avoided by making sure the first version, ie the 'core' is ready for new things. For example if rename is a possible issue make sure there is a structure in the first version that the new block uses rather than the new block having to be updated.
** Maybe embed a 'minimum' version number as well. Ie this new feature breaks compatibility. Maybe a read only minimum version too.
** Worse case scenario, could put scripts into the files themselves and specify that they are run at appropriate times (ie on open, on save). That could massively increase binary size since you might need to implement the entire parser in the script (although this is only on a breaking change). Alternatively allow the old parser to be used by the script. Also possible security/abuse/prank issues. Probably better to just break compatibility.
** Maybe the parser could have a update able plugin system. If you have a new incompatible file, it could tell you to download the new version which is a dynamic library/script put in the correct place but external to the file. Still seems a lot of work.
* Try and use a flat structure.
* 'pointers' are a pain. Only an issue for serialisation of stuff that uses pointers. Just don't use pointers, instead have some index. Having said that pointers would be somewhat faster at runtime. Of course they would need to be serialised to an index anyway, so that's really an implementation optimisation).
* Specify a fixed [https://en.wikipedia.org/wiki/Endianness endianness]. Both RISC-V and x86 use little. That's what I like anyway.
* Have a nice specification.
* For an added bonus make it parsable (not sure of many decent binary file schemes, maybe capnproto?).
* UTF8 for strings.
* Would a 'capabilities' block be of any use (Wouldn't this be stored in the parser based on the version number)?
* Don't store date/time. Breaks hashing.
* Hash/checksum. (At the end?). Obliviously the hash/checksum shouldn't itself be hash/checksummed. Otherwise it could be stored as part of the normal fileformat but zeroed during the hashing process.
* Store expected file size? Is there any point? Might provide a quick way to find if the file is half missing. If the 2nd half is missing the checksum would be too...
* Have a look at NBT.txt (Minecraft)

=File Structure=
* File signature.
* Version.
[4244 4e49 535f 4352] <- [https://en.wikipedia.org/wiki/List_of_file_signatures File Signature]. "DBIN_SRC"
[9ff0 84a6 000a] <- These bytes should be ignored.
[0x00 0x00] <- Binary file format versioning... Is 2 bytes good? Should it be in human readable ASCII? This hopefully shouldn't change. If these are anything higher than 0x00 the parser should check the next number to see if the new version is backwards compatible.

<strike>[XXXX] <- A 2 byte offset. The parser should advance by this amount. This is to allow for an extra emergency block of information at the to top level. Normally it will be 0x0000.
[number of bytes specified above] - This will contain any emergency data that is needed. It should not be used for adding any custom user data. A future version parser might need to read information from it (is this needed, probably should just be in a generic 'extra stuff' category?).</strike>

==Types==
* 0x01 - List
[0x01][u64: number of elements]([element])... - num=Number of elements, followed by elements. TODO: Should lists include the size in bytes? What about the 'type' they contain? Mixed types?
* NoEnum - Version. Binary file version (2 bytes). Not the source code version.
* 0x02 - Name
[0x02][u8]["UTF8_Name_Goes_Here"] - u8 refers to unsigned 8-bits.

==Root List Enums==
* 0x01 - File Properties List
* 0x0X - Defined symbol names.

===Binary File Properties Types===
* 0x00 - sha3 File hash. This must be zeroed when actually calculating the hash.
[0x00][64 bytes]
* 0x01 - Minimum binary file format read/write version. If the parser wasn't coded against this version or greater, it should bail out. For the first version if these 2 bytes are anything other than 0x00 you should error with some kind of "Unsupported version" message. If this is missing assume the minimum version is the same as the one in the file header.
[0x01][version]
* 0x02 - Minimum parser file format read only version. If the parser wasn't coded against this version, it can still read it but it should prevent saving/re-serialisation of the file format as that could result in data loss. It should either error out or confront the user. It's also acceptable to just error out and not bother to implement a read only mode.
[0x02][version]

==Compiler is a Daemon==
* Watch file system for changes are start to parse/compile with cancellation.
* Continuously compile file in memory while its still being edited if supported.
* Cache all parts of the compile pipeline. No need to re-parse a file that hasn't been edited. No need to recompile a function that hasn't changed.

==Implicit unit testing==
* Quickcheck style unit tests. Results stored to inform you if the results change. Alert you to common corner errors.

==Solve the halting state problem 🤑==
* Obviously not really.
* Look into termination analysis.
* Look into dependant types, etc...
* Best effort. 'Halts', 'Doesn't halt', 'halts when input is X', 'Unknown'...
* Look into subsets of programs **can** be checked.
** When there is no loops/recursion.
** When loops are fixed at compile time.
** When there is an obvious infinite loop `for(;;){}`.
** Self modifying programs.
** A lot is dependant on external inputs.
** Limitable loops.
*** Injectable via a context.
*** A for loop that can't run longer than 10 seconds. "This function will halt, because I will halt it if it doesn't".
*** A loop that can't do more than X iterations (X could be the size of an array, that size would be limited by the size type).
*** Could be a source of bugs though...

Latest revision as of 00:44, 16 June 2017

Check out rust's incremental build mode Adaptiva incremental builds

Binary Representation of Source Code[edit | edit source]

  • Represents the source code. Is not any kind of executable code (ASM/bytecode). Not an IR.
  • Take any valid text source code, turn it into the binary representation and back again and end up with the same byte for byte file.
  • Not storing individual token (ie no LEFT_BRACE). But do need to keep things like whitespace and comments.
  • Edit source code not text.
  • But still allows for people to use standard text editors.
  • Also allows for non-text sourcecode specific editors.
    • Quick and efficient editing of the binary format (ie quickgo/quickrust concept programs).
    • Graphically represent source code (not the same as a graphical programming language, ie blocky, just an eaiser way to see read code).
    • Having things like frames around things like data structures and function definitions.
    • Could have UML like representations (Not advocating for UML specifically, but it's a possibility).
    • Easy/quick navigation of source code. Things like goto definition would be much easier to represent.
  • Makes tooling much easier. Can allow for libraries for manipulation of the code that tooling can use.
  • Down side, any time you have an invalid syntax everything breaks. But that happens anyway with normal code...
  • Could use a virtual filesystem to automatically convert stored binary to text or visa versa.
  • Any text you edit could basically have any syntax you like, although obviously a standardised version would be best.
  • Could allow for syntax changes.
  • Could allow for special keywords for editing with a basic text editor (ie 'def myfunctionname' could be hooked to actually insert a function definition nearby on file save and the 'def' keyword removed).
  • Would be easier with a well defined syntax for the source code (ie define tabs vs spaces, number of newlines between functions).
  • But might be better to just store tabs/spaces and newlines in the binary format.

Non Binary?[edit | edit source]

  • Perhaps just use something like JSON? Rust can output JSON ASTs... But less work for things like parsing. Would have bigger source code sizes (who cares)? Would let git etc work with it.
  • Capnproto/flatbuffers?
  • Could define the language as an API and don't care about the storage format.
    • addNewFunction("function name") -> UUID
    • Need to define 'types'.

Programming language, program thyself[edit | edit source]

Rust has build.rs and compiler plugins.

Go has 'go generate'.

Could be used to implement things similar to generics. Code generation.

'Schemas' not 'data structures'[edit | edit source]

  • struct definitions are normally mixed in with the the procedural instruction source code.
  • Structures are a binding of **data types** to **variable names**.
  • Separate the **representation** from the **implementation**.
    • Standard native in memory with the same performance and so on.
      • Allow for separate memory layout. Some arch (for example Cell processors require memory padding).
      • In memory ordering.
      • Endianness?.
    • Serilization.
    • Database backed.
  • Older OOP languages like C++ and Java also bind **methods/member functions** to **data structures**.
  • Newer languages like Rust and Go move away from OOP and use interfaces (ie traits) primarily.
  • Conceptually design it as **API's** not bound functions.
    • Allow for standard native function calls, or RPC calls, etc... IPC or networked.
    • Some kind of distributed backend (Raft, Blockchain)
  • Could allow security definitions in the schema. Ie, who can edit this variable. Allows you to separate the security implementation stuff from the data structure.
  • Would capnproto be usable? What about things like generics. iirk there is programming stuff capnproto doesn't support...

'Context' not 'arguments'[edit | edit source]

  • Define implicit contexts. Don't want to specify them every time...
  • Probably need multiple contexts. A global context. Per-function context. (Maybe per thread? Initialisation context? Can a data structure have a context. Module/namespace context? Event context such as socket programming?).
  • main's global context for example could include:
    • Command line arguments
    • OS function/syscalls calls. os.gettime, filesystem access, etc... for example.
  • Contexts need to be overridable. So I could call Main from another program, and pass in the stdio streams and stuff... Unit testing and so on.

What about garbage collection/memory models being specified in the context?

What about supporting execution transfer. Moving a program across multiple cores, to other systems, etc... os calls could become RPC ones, or stay local os calls on the new machine? The transfer strategy would probably need to be part of the context.

What about a mapping from literals to datatypes?

For example int->int32/int64. But you make a function that does basic arithmetic using floating point numbers, complex numbers or some custom type for example. Allowing for dependant types.

Do literals make sense in a language without text/syntax? How are these types passed in?

What about the equivalent of 'keywords'? You could 'define' a for loop in the context. Or example a restricted for loop that has a hard timelimit.

Consider number types as mathematical 'fields'?

Compile time context vs runtime contexts? A function context would have the arguments and other stack stuff in it. Compile time would have type information. Allowing compile time at runtime would allow for a Python like dynamic programming language.

A data structures with a context could be considered generics.

'API's not 'method's[edit | edit source]

  • Similar to the 'Schemas' vs 'data structures'.
  • Separate implementation from definition.
  • Implicit definitions.
  • Allow functions to be bound to a type. But they could be used as native function calls (like normal). Or remote RPC calls.
  • Security stuff needs to be worked out. Local function calls don't need it. Remote function calls will need to be authenticated and the contents verified. This also requires logic level security (ie shouldn't have permission to modify another persons struct).
  • Could there be an 'ownership' model for security?

API Versioning[edit | edit source]

  • API version as a hash of the binary representation of the API?
    • Need to deal with non-breaking things. Like changing the order of functions.
  • Function definitions and the like could be tracked, and breaking changes to syntax noted automatically.
    • Allow adding fields with defaults without api change.
    • Allow optional named arguments.
  • Implementation stuff is harder (ie we changed the format of the string this function returns but the signature is the same).
    • Functions that have no source code changes can be safely ignored.
    • Changing the implementation doesn't mean the result is different (ie optimisation).
    • Changing the implementation of a function could accidentally change the result (bug). Being told when that happens is handy.
    • Allow specifying functions for specific API versions so if you do change the implementation you can keep backwards compatibility.
    • How do consumers choose which version (ie specific version they used, or 'latest'?)... Compiled binaries could keep a list of the api version used.
    • Unit tests could provide a hint. (ie if this unit test changed...), but doing something like adding an extra test or changing the order doesn't mean the implementation's result is different.
      • Automatic 'quickcheck' when possible? Compiler can implement a unittest with no effort from the programmer and log results. But you won't know when it's possible (ie halting problem, use of globals/statics, side effects, etc...). Maybe just best effort (ie if it didn't finish in 1 second and/or used more than 512kb of ram, kill the test). Don't store the result of tests that returns a lot of stuff. Do store the meta information about killed tests and the number of items returned (or even better a hash of the items returned, pointers would be a pain though...).
      • 'quickbench'? To benchmark performance? Obvious problems of different hardware but could still be useful. Probably not for API versioning

Everything works as a module/library[edit | edit source]

  • Main is just a function gets passed an os.args, stdin, stdout, etc... into.
  • Have a 'Enviroment' struct/context that has all that in it. Could make an implicit 'global' one but allow it to be overridden in functions. (ie remap printf to fprintf). Does that effect performance, how to deal with compiled libraries? Don't want extra bytes getting passed to every function or an extra redirect on thing like print statements... Look at how printing happens in ASM... Maybe printf just remaps to fprintf(stdio...)
    • stdio
    • stderr
    • ENV variables
    • filesystem,
    • Operating system functions...
  • Could sandbox things by just removing things like the os.filesystem (unless you can somehow manually do a function pointer to it).

Do you need static libs if you have Dynamic libs?[edit | edit source]

Can't you just produce dynamic libraries and use them as static ones if desired at compile time?

Everything as an interface[edit | edit source]

  • For example, file access.
    • No global fopen("filename");
    • Instead open(os::filesystem, "filename"). Although a os.open wrapper could be used for the lazy, maybe it should be avoided since it's use should be discouraged, especially in libraries. You don't want to use a library that forces a config file to be stored in a specific location when you want to use a database as a backend store for configuration files.
    • In many ways a file over a network connection is the same as a file on a harddisk.
      • A hdd can die, become full or be removed. A network cable can be unplugged (plus the files on the other end are going to be stored on a harddrive anyway which have the same problem).
      • Differences?
        • Metadata stuff.
          • Rename files, can't normally rename HTTP. Renaming doesn't effect the actual file, it effects the filesystem's index.
          • Linking files. Can't link a HTTP file, locally (well not without the OS doing it, or a virtual filesystem library but that's out of the scope of a programming language). Once again not about files, but about the underlying filesystem.
          • Timestamps, ditto.
          • synchronisation and atomic operations. A file stored on a disk can be fsynced so you know it's stored. An atomic operation can allow a file to be replaced in place. A database on the otherhand might be holding the file in ram.
          • Files are lockable. Prevent multiple processes trying to write to the same file at the same time.
  • println("This is bad as it's pain to override and isn't testable...");
    • fprintln(stdout, "This is better but more typing");
    • Monkey patching can work but deals with globals which adds threading issues...

Stop allowing stuff in 'empty space'[edit | edit source]

  • Syntactically quarantine globals/singletons/statics (if they are allowed). A better idea might be to make them a 'context', but syntactically hide it (Same idea as main's environment context).
globals {
   int hello = 0;
}
  • Do need things like function calls and so on.
  • Don't syntactically need to specify the module namespace if it's determined by the filename/location.

Generics[edit | edit source]

  • Generics seem to really just be code generation.
  • How do generics in dynamic libraries work?
  • Maybe allow the program to 'hook' into the compile steps. For example when a request for a struct with a specific identifier doesn't find it... Could allow for many other things (Rust has build.rs, Go has generate).
  • Compiler plugins.
  • Maybe make a very simple 'core' compiler and then the full compiler defines itself using the same kind of hooks.
  • Maybe the hooks could allow a program define it's own keywords and so on. (not sure if keywords are a thing in a 'binary' language).

File Structure (Ideas)[edit | edit source]

  • Make 'flat'.
  • List of all symbol names.
  • Changing the name should result in all uses of the name being updated.
  • Use hash for id or index number?
    • A name change with a hash would require all call sites to be updated as well.
    • Index ID would be easier to update.
    • An Index ID would allow name changes without breaking API.
    • A hash might be needed for cross API compatibility.
    • An ID would be quicker to lookup since it's just an index in an array. A Hash would require a hashmap.
    • What about a UUID? It would be unique so cross library boundaries would be ok. No changes on rename. But slowish to look up.
    • Index ID's could be put into namespaces. Would allow cross API's to be fine.
    • UUID's could remove the need for namespaces.
  • How to remap from symbol id to uses...
    • Keep a mapping of all the sites that use the symbol?
  • Could have a local identifier number, that maps to a global UUID. Could also combine the UUID+specific version. Would let you use different versions of the same function in the program. Could also save space (no 128bit number for every symbol, just a smaller one. But that might make merging problematic.)

Binary File Design - Ideas for good Practices[edit | edit source]

  • Magic Number - Make it easy to identify files. Might as well use human readable ASCII...
  • Versioning - Make it future proof.
  • Backwards compatibility when possible
    • Might be kind of hard.
    • With a fixes structure, adding an new element would break things.
    • Would need to specify each element and give them a ID.
    • Could encode size of the 'block' and make sure new elements are added in order. Older consumers can skip the extra bits.
    • Older consumers can leave the extra bits in place when saving to prevent loss of information.
      • Blocks left in place without understanding could break things (ie I rename something in an old version that is mentioned in a new section but don't update it's entry).
      • Could maybe be avoided by making sure the first version, ie the 'core' is ready for new things. For example if rename is a possible issue make sure there is a structure in the first version that the new block uses rather than the new block having to be updated.
    • Maybe embed a 'minimum' version number as well. Ie this new feature breaks compatibility. Maybe a read only minimum version too.
    • Worse case scenario, could put scripts into the files themselves and specify that they are run at appropriate times (ie on open, on save). That could massively increase binary size since you might need to implement the entire parser in the script (although this is only on a breaking change). Alternatively allow the old parser to be used by the script. Also possible security/abuse/prank issues. Probably better to just break compatibility.
    • Maybe the parser could have a update able plugin system. If you have a new incompatible file, it could tell you to download the new version which is a dynamic library/script put in the correct place but external to the file. Still seems a lot of work.
  • Try and use a flat structure.
  • 'pointers' are a pain. Only an issue for serialisation of stuff that uses pointers. Just don't use pointers, instead have some index. Having said that pointers would be somewhat faster at runtime. Of course they would need to be serialised to an index anyway, so that's really an implementation optimisation).
  • Specify a fixed endianness. Both RISC-V and x86 use little. That's what I like anyway.
  • Have a nice specification.
  • For an added bonus make it parsable (not sure of many decent binary file schemes, maybe capnproto?).
  • UTF8 for strings.
  • Would a 'capabilities' block be of any use (Wouldn't this be stored in the parser based on the version number)?
  • Don't store date/time. Breaks hashing.
  • Hash/checksum. (At the end?). Obliviously the hash/checksum shouldn't itself be hash/checksummed. Otherwise it could be stored as part of the normal fileformat but zeroed during the hashing process.
  • Store expected file size? Is there any point? Might provide a quick way to find if the file is half missing. If the 2nd half is missing the checksum would be too...
  • Have a look at NBT.txt (Minecraft)

File Structure[edit | edit source]

  • File signature.
  • Version.
[4244 4e49 535f 4352] <- File Signature. "DBIN_SRC"
[9ff0 84a6 000a] <- These bytes should be ignored.
[0x00 0x00] <- Binary file format versioning... Is 2 bytes good? Should it be in human readable ASCII? This hopefully shouldn't change. If these are anything higher than 0x00 the parser should check the next number to see if the new version is backwards compatible.
[XXXX] <- A 2 byte offset. The parser should advance by this amount. This is to allow for an extra emergency block of information at the to top level. Normally it will be 0x0000.
[number of bytes specified above] - This will contain any emergency data that is needed. It should not be used for adding any custom user data. A future version parser might need to read information from it (is this needed, probably should just be in a generic 'extra stuff' category?).

Types[edit | edit source]

  • 0x01 - List
[0x01][u64: number of elements]([element])... - num=Number of elements, followed by elements. TODO: Should lists include the size in bytes? What about the 'type' they contain? Mixed types?
  • NoEnum - Version. Binary file version (2 bytes). Not the source code version.
  • 0x02 - Name
[0x02][u8]["UTF8_Name_Goes_Here"] - u8 refers to unsigned 8-bits.

Root List Enums[edit | edit source]

  • 0x01 - File Properties List
  • 0x0X - Defined symbol names.

Binary File Properties Types[edit | edit source]

  • 0x00 - sha3 File hash. This must be zeroed when actually calculating the hash.
[0x00][64 bytes]
  • 0x01 - Minimum binary file format read/write version. If the parser wasn't coded against this version or greater, it should bail out. For the first version if these 2 bytes are anything other than 0x00 you should error with some kind of "Unsupported version" message. If this is missing assume the minimum version is the same as the one in the file header.
[0x01][version]
  • 0x02 - Minimum parser file format read only version. If the parser wasn't coded against this version, it can still read it but it should prevent saving/re-serialisation of the file format as that could result in data loss. It should either error out or confront the user. It's also acceptable to just error out and not bother to implement a read only mode.
[0x02][version]

Compiler is a Daemon[edit | edit source]

  • Watch file system for changes are start to parse/compile with cancellation.
  • Continuously compile file in memory while its still being edited if supported.
  • Cache all parts of the compile pipeline. No need to re-parse a file that hasn't been edited. No need to recompile a function that hasn't changed.

Implicit unit testing[edit | edit source]

  • Quickcheck style unit tests. Results stored to inform you if the results change. Alert you to common corner errors.

Solve the halting state problem 🤑[edit | edit source]

  • Obviously not really.
  • Look into termination analysis.
  • Look into dependant types, etc...
  • Best effort. 'Halts', 'Doesn't halt', 'halts when input is X', 'Unknown'...
  • Look into subsets of programs **can** be checked.
    • When there is no loops/recursion.
    • When loops are fixed at compile time.
    • When there is an obvious infinite loop `for(;;){}`.
    • Self modifying programs.
    • A lot is dependant on external inputs.
    • Limitable loops.
      • Injectable via a context.
      • A for loop that can't run longer than 10 seconds. "This function will halt, because I will halt it if it doesn't".
      • A loop that can't do more than X iterations (X could be the size of an array, that size would be limited by the size type).
      • Could be a source of bugs though...