Skip navigation

For whatever reason, I decided to start working on a Regex Parser. Though, I did something funny… for testing I allowed for statically compiled regexes to be defined. This caused some rather crazy code:

Regex_Hack::Regex reg( "Integer", //Value, used for name
                       Regex::ONE, //times to find in a row...
                       Regex::ROOT, //root node, basically says to ignore the previous two arguments
                       //Following == [1-9][0-9]*
                       Regex_Hack::Group(Regex_Hack::Regex::ONE, //repeat finding of group once only
                                         Regex_Hack::Range('1','9') //range of numbers 1-9 repeated once ([1-9])
                                         << //craziness to solve variadic argument problem... basically rigged an ostream
                                         Regex_Hack::Digit(Regex_Hack::Regex::ZERO_OR_MORE))); //range 0-9 repeated zero+ ([0-9]*)

So… I didn’t get it finished last night… and probably wont have it done tonight either. Well… I finished it last night, but there are too many inefficiencies for it to do any good even as test code. Also the code is REALLY messy. Actually there’s a total of 10 functions I’ll have to re-write as I’d prefer to release code using the STL instead of my private library. So far I’ve also noticed at least 5 functions in which I assumed things about the input. That needs fixing… and finally the code for deciding how many dictionary entries to have hasn’t even been written. Yet another issue is when it picks dictionary byte pairs to use it has to re-check the data for pairs that no-longer exist due to the changes… this is a bit of an issue and how I implemented it required too much memory and time.

-cQuaid

Recently, I’ve been rather distracted by reddit and haven’t put too much work into writing the compiler (well, translator).  BUT That’s different now… I got bored of it.  Now this post is actually about compression, so I’ll just skip over to that.

I was thinking about reverse code engineering and compression formats.  Now I’ve dabbled with the idea of writing a proprietary format before, but I’ve never gotten motivated enough to do so.  I actually found a use for a proprietary compression format though.  The use being my fileserver.  While I could use a known format, I’ve decided a custom compression format combined with a custom encryption should work for privacy reasons.  The encryption is already done.  That is, however, another story and not the topic.

 

So, this compression is defined as follows:

struct Header{
char magic_1 : ‘Q’; //Quaid’s
char magic_2 : ‘C’; //Compression
char magic_3 : ‘G’; //Garbage :D

unsigned char entry_size : 2; //length of each dictionary entry; default to 2. if 0 the dictionary  entries are variable length.
unsigned char dictionary_size; //length of the dictionary… I have it programatically assigned.

Dictionary_Entry* Dictionary; //array of dictionary_entries (defined bellow)
unsigned char* data; //the file data

};

 

struct Dictionary_Entry{
unsigned char index; //well, the byte representation used to denote the entry
unsigned char* entry; //data for the entry
};

The algorithm should first parse the file for any unused byte numbers (1-255). 0 is not used and cannot be part of a compression entry unless it is variable length entries, in which case, the entries are 0 terminated.
The unused numbers are placed into an array as possible indexes. Next the file data is parsed for byte pairs (or whatever the designated length is) as well as the occurrences of each byte pair. Now, there is some math involved to figure out what number of dictionary entries (under the maximum allowed) will yield the smallest compressed size. After that is done, the indexes used are assigned a byte pair. Then all occurrences of those byte pairs are replaced with the corresponding index number. And… That’s all!

It’s fairly simple. I’m sure it’s been done before, but I don’t really care. I have most of the code already written by this point. And I assume it will take a bit of tweaking. I actually am using an Array class that’s part of my personal standard library. Even with that I ended up adding four new functions to it. I expect to be done tomorrow, but I know that my predictions don’t always tend to be correct. If anything a very rudimentary implementation will be made.

-cQuaid

For the first version of the implemented language, I’ve decided for it to be rather C-like (as already stated). For this version, there is only one function, main. The initial main is laid out as follows:

int main(void){}

I’ve also thought about

void main(void){}

and instead of having a return statement use a global “exit(0),” I pretty much ruled this possibility out.

I removed the “int argc, char** argv” for now… too much trouble to mess with.
Now, the rest of the usable syntax is:
Types:
* long
* ulong -unsigned long
* llong -long long
* ullong -unsigned long long
* short
* ushort -unsigned short
* char
* uchar -unsigned char
* byte -unsigned char with some special properties
* float
* double
Other Terminals:
* ‘+’ -addition
* ‘-’ -subtraction
* ‘*’ -multiplication, pointer op also… but will through a syntax error for now
* ‘/’ -division
* ‘%’ -modulo
* ‘++’ -increment
* ‘–’ -decrement
* ‘(‘ -grouping expressions
* ‘)’
* ‘=’
Miscellaneous:
* exit(x)
* print(x)
Type Specifications:
* [0-9]+ -integer
* 0(x|X)[0-9A-Fa-f]+ -hex integer
* [0-7]+(o|O) -octal integer
* [0-1]+(b|B) -binary
* [0-9]*’.’[0-9]+(f|F) -float
* [a-fA-F_]+[0-9a-fA-F_]* -id

As you can see this is a fairly sparse lexicon for now. There are character constants and string literals (the string literals being of no use for now). If you’ll notice, the octal specification is it ends in an ‘o’ I did this because I like having the ability to place as many ’0′s in front of my base 10 integers as I wish.

-cQuaid

Well, the compiler now effectively tokenizes and groups the subset of C that I will be using as a basis for my language.  I figured it’d be better to work with something I’m used to first and evolve it into my language…

-cQuaid

Due to the hurricanes in Alabama I’ve been left without electricity for the past 5 days and as such have not written any code.  However, I did have time to lay out a definite syntax.  For a short description right not, Wunschkind is a “Portable” assembly language, much like C is considered, with enough abstraction that you can write platform independent code easily, but supply platform dependent code if you wish.  The current design implementation I am looking at is entirely retargetable in the simplest way I knew how.  XML style definitions are used for different assembly translations as well as definitions for platform syscalls.

-cQuaid (MAAAAN do hot showers feel nice when you’ve been without)

Back when I was attempting to write a pre-compiler called Corruption,  I realized what I was doing was writing a source-to-source translator.  In that, I mean I was practically writing a new language on top of C++ (not C).  Now, this is exactly what the original implementation of C++ did.  I love the idea, and the translator would be bootstrapped to an extent.  BUT, the reason I disbanded the project is the code was horrible written, and I’d have to make translations of base language features to library functions, this I did not like.  I feel that a language should have an abstraction from the Operating System to the point that functions such as write(x) and read(x) or similar would be standard language features, not something from a header that calls to an OS header (os.h).  This is when I first started to have accelerated thought toward designing my own language.  In such, without Corruption I would have no means of advancement.  Further research led me to realize C++ and even C could have been much better designed.  Just by thinking of this I came up with a partial list of features my language should have.  One was that instead of just having a Code file (ex: .c) and a Header file (ex: .h),  you’d have a Module file (extension of the language) and a Redefinition file (redefinition of internal keywords).  Those two files would be interpreted in a compiler specific way, but would have a standard attached to them as they modify how the compiler parses the language.

-cQuaid

I’ve been thinking for the last month or so about scrapping the idea of writing a code generator for now and just use llvm as the backend.  Here are the benefits and liabilities I have come up with so far:

Benefits

  1. Faster to write
  2. Support for many platforms
  3. Can focus strictly on higher level work

Liabilities

  1. Dependent on an outside library/tool
  2. Near Impossible to bootstrap
  3. Requires learning a new syntax (I’d have to do this regardless, but… still!)
  4. Reduced amount of extendability I would like to have
Now some B&L’s for writing my own backend:
Benefits
  1. Syntax can be whatever I want it to be
  2. Extensions can access code generation if they like
  3. Support for new machines just by a simply defined XML module
Liabilities
  1. Extremely hard to write
  2. Having to write in Assembler
  3. Only one platform supported to begin with, x86 Linux
Just so~ many decisions.  I want to get this out as soon as possible (though work has been leaving me with a reduced availability).
-cQuaid

I’ve noticed I get a lot of questions about my for loops from my professors.  Mainly, “why?”  I admit I do declare them in an odd manner, I just like my pointers (also, it stands for iterator, I usually use “i”, but on my assignments I use “it”):

unsigned long int *it = new unsigned long int(0);

for(*it = 0; *it < 255; ++(*it)) { /* … */ }

delete it; //deleting that god awful clown from that one Steven King novel, obviously.

-cQuaid

I re-wrote the lexer today (it’s not completely bug free ^_^).  But while doing so I realized I was using a C-style mindset using structs and just plain functions.  So then I was like “EVERYTHING SHOULD BE OBJECTS!” and proceeded to make each section of the already written code a class.  This worked out great.  I got rid of four two source files doing so.  Also, the current compiler is technically my backend.  I’m writting a pseudo-assembler language for use as a generic code generator (although I’m translating to nasm currently, I eventually plan to generate machine code myself).  The backend will be useful when I write a higher level language.  I HATE assembler, but love it’s structure.  So the backend language keeps the structure but is much more coder friendly.  Well… for me at least.

-cQuaid

Follow

Get every new post delivered to your Inbox.