At work, I’ve been using Node.JS to make a bridge between a proprietary,
third-party application and our internal tools. This application has the cool
ability to receive insructions in its native language over a network socket.
Its language is however not widely known or used, and so while it would be awesome to
use its network control feature, it would be preferable to do it in a more
familiar langauge (like, say, Python).
Most of this problem is solved by using Node.JS to broker client connections to
the application, and Jison to interpret the application’s output and convert it
to javascript. From there, it’s a snap to convert the values to json, and
rehydrate them into live Python objects on the other side.
This is my first experience with Jison, and it’s so awesome, I thought I should
share my happy experiences. Jison is the
work of Zach Carter. It will probably be familiar if I mention that it’s what
compiles CoffeeScript to JavaScript. It is completely awesome, and if I ever
meet Zach, I’m going to buy him a pizza.
I’m just going to focus on the part of translating the application’s expression
language into JavaScript. There’s a lot more to the overall problem that I
hope to be able to get into later.
And rather than continue calling the application “the application,” I’m
henceforth going to call it Steve.
The Goal
We want to take the Steve’s output and convert it to
javascript. For example, here’s how Steve gives me a list of floats:
float[] => float[] {1.0, 1.0, 2.0, 3.0, 5.0}
It’s saying, “Your function, calling for a list of floats, returned a list of
floats, namely 1.0, 1.0, 2.0, 3.0, and 5.0.”
I just want to convert that to the following JavaScript list:
I don’t care about the type declarations in this case. I’m going to throw them
away, because JavaScript figures out types dynamically.
The Process
To do this, I want to figure out the grammar that Steve uses. I don’t have
access to the actual grammar rules of Steve’s language, so I’m going to have to
guess.
I’m going to poke at Steve and get as many different responses as I can. Each
time I find a response my program can’t interpret, I’m going to make a unit
test case for it. Once all my tests pass, and I can’t find any other types of
response Steve might give, I win.
Parsing with Jison
Jison is named after Bison, which is part of the GNU toolset (along with Flex),
which are named after the Unix tools Lex and Yacc. Lex is for lexical
analysis, i.e., figuring out where words begin and end, and how to break up the
pieces of a sentence into their minimal components. Yacc, for Yet Another C
Compiler, lets you declare a grammar, i.e., the rules for interpreting the
meaning of a stream of tokens that Lex produced, and convert that grammar into
a program in your language of choice. My language of choice is not actually C,
but JavaScript, hence the use of Jison.
So like its Lex and Yacc antecendents, Jison lets you declare the lexer and the
grammar.
Why not just dive in with a really simple example. The following lets you
parse numerals into an integer object. That is, we can take the string “42”
and turn it into 42. Not very exciting, but it exposes many of the important
things we have to deal with.
Here’s a Jison file to do this:
The sections of the file are articulated by the characters %%. A single %
begins a special keyword. The lexer section ends with /lex.
In the %lex section, we have some rules that explain how the program should
break up the text it finds. These rules look like regular expression matches
on the left. On the right is what to do when any of the rules matches.
Here, any number of digits comprise an INT. The - sign is the “–” sign.
The special <> means end-of-file, and is converted to a token called
EOF. I get to make these tokens up. I could just as well have called it
INIGO_MONTOYA. Finally, any remaining characters, the . expression, are
interpreted as TOTAL_DISASTER, which I don’t handle in any rules, and which
will therefore crash the parser. In other words, no other characters should
match than the ones I had previously specified.
Then there’s a precedence rule, %left UMINUS. If you were writing a
calculator or something, you might have a bunch of precendence rules, so I
wanted to include one. This says that the token I’m calling UMINUS has
left-precedence. More on this below.
Then we have the grammar section. I happened to say %start grammar, but you
could say %start winning or anything you want. Whatever you %start, that’s
where the parser will begin. So here it looks at my grammar definition
first.
These rules have the form name : pattern | pattern | pattern ... ;. Each
pattern has an action associated with it. The parser will try to match the
first pattern it can. So here, a successful grammar is one that matches
either an integer followed by EOF, or jut EOF. What about integer?
Where’d that come from? I made it up. I get to do that. I define it just
below. You can probably guess what the return $integer and return null do.
Ok, integer. An integer is defined as either an INT token (as defined in
the lexical analysis section, i.e., one or more numerals, [0-9]+), or a -
sign followed by an integer with left-precendence. The super powerful thing
here is that integer can refer to itself recursively.
What about the actions on the right? Those are all JavaScript with some
special variables thrown in. $$ is the return value it passes along in the
parser. $1, $2, $3, etc., are the first, second, third, etc., things
that match. I can also match by name where I use one, like I did in $integer
in the grammar section.
Finally, yytext is the text that the parser returned. So Number(yytext)
means “Make a JavaScript number out of the text you found.”
Ok, that’s a simple parser of strings to numbers. Now how do you use it?
Using It
To compile the grammar, you save this file as something with the .jison suffix and run jison on it. If I save this as numfoo.jison, I do this:
This yields a numfoo.js module. It’s just sitting there, waiting to do your
bidding. Use it like so:
$ node
> var parse = require('./numfoo').parse
> parse("42")
42
> parse("-42")
-42
> parse("41") + 1
42
w00t! String to integer. It’s that easy. Now, if you try to parse anything
that doesn’t match the grammar, like “+42”, or something with whitespace, it
will crash. We’ll deal with whitespace below.
A Fuller Example
To begin parsing the output from Steve, I want to do a few more things.
- Ignore whitespace
- Deal with floats, ints, strings, etc.
- Deal with lists and extensible structures
I’m also going to ignore half of Steve’s output. Like I said, I don’t care
about type declarations. (Well, not in the case of strings and ints. If Steve
were returning other objects, I’d have to care and do something with them to
convert them to some kind of JSON-stringifiable representation.)
Remember I said that Steve has a very particular way of expression lists of
floats:
float[] => float[] { 1.0, 4.2 }
Ditto for string[], int[], etc.
Here’s a grammar that will handle those things. (Note that this is not actually
the grammar I use to talk to Steve – it’s not even part of it.)
Here are a few important points by line:
6. Ignore whitespace. Find one or more whitespace characters and do nothing.
9. To find a quoted string, we discard the quotes and return the contents.
Hence the use of yylen and yytext.substr. Also, as we have a compound
command, squiggly brackets are needed.
22. I can put JS like console.log in there for debugging, yes.
32. In my simple type declaration, I can have either “list of type” or “type”
itself. I need to match the type declarations for the parser, but I ignore the
values. Hence there are no functions after the rules.
37. JS doesn’t really distinguish between int and float.
43. It’s useful to have a primitive type definition. I’ll use this to
construct lists of things below.
44. This is cool. A list of primitives is one or more primitive separated
by a ,. Note that the solitary primitive is returned as the single element
in a list. A list of primitives is defined by pushing the last primitive onto
a list of primitives. Recursively, this will result in making a list of a
single primitive, and then pushing each successive one onto it. Useful
pattern.
46. This is Steve-speak. All I do is return the result of primitives, which
will be my comma-separated list of primitives.
47. Finally, the expression itself. It’s a list or a primitive.
Note that my grammar does nothing to test whether Steve’s output is valid
Steve-speak. For instance, if Steve said string[] => int[] { 3.14 }, my
grammar would not complain. Do I care? In this case, no. I trust Steve to
say the right thing.
Test Cases
For a hacky, reverse-engineering job like this, having test cases is
invaluable. As I go, I accumulate a test case for each result, because as my
grammar grows in complexity, I need to be sure that everything still works.
You could have a module like this for use with nodeunit:
var testCase = require('nodeunit').testCase;
var parse = require('your-parser').parse;
module.exports = testCase({
testFloatList: function(test) {
result = parse("float[] => float[] {1.2, 2.3, 3.4}");
test.ok(result[1] === 2.3);
test.done();
},
});
And so on.
Translating to Other Languages
Now that we have live JavaScript from Steve-speak, we can JSON.stringify our
results and ship them off to our clients. A Python client can, say,
simplejson.loads the result and think that Steve is just speaking Python.
Awesome.
Further Reading
I hope this is useful. Jison has turned out to be a huge help to me. Many thanks to
Zach Carter for this fantastic tool.