Saturday, April 4, 2009

Fast, validating UTF8-aware JSON parser in 53 lines of code

Lex(1) and yacc(1) both suffer from the curse of abtraction--too much complexity! Like graphviz, cscope, and vi these are not tools that you can use once and then pick up quickly a month later. However, sometimes this complexity is worth it, as this writeup hopefully demonstrates.

This log entry gives a simple example of how to use lex and yacc to parse and validate a text message formatted using JSON.

Initially, to solve this problem I explored a couple json libraries available in C. But as I read through the code, I found it hard to understand--one used reference counting to handle memory and the other had a ton of code dealing with unicode. This message parsing will be inside a daemon for which stability and performance are paramount, so I was leery to integrate memory handling code I didn't grok or a bunch of code I didn't need.

So I decided to take a crack with lex and yacc. After all, it's a tight fit--the messages are text with a fixed grammer. Lex is a parser and yacc provides a way to define a grammer.

Sample Message

{ code:"abc123",
  children: [
    { name:"asb", weight:50 },
    { name:"def", weight:25 }]}

A first lex file: print out the tokens

The lexer defines what we care about in the message. For example, I want to know when

  • an object starts (an "{", or "open brace"),
  • an object ends (a "close brace"),
  • we hit a key in a key/value pair (e.g., "code:" is first key in message above),
  • when an array starts (an "[")
  • etc., etc.

These definitions are created using regular expressions.

%{
#include <stdio.h>
%}

%%
"{"                 printf("OBRACE ");
"}"                 printf("CBRACE ");
"["                 printf("OBRACKET ");
"]"                 printf("CBRACKET ");
\"[^"]*\"           printf("TEXTSTRING='%s' ", yytext);
[0-9]+"."[0-9]*     printf("FLOAT ");
[0-9]+              printf("INT ");
[a-zA-Z]+" "*:      printf("KEY ");
.|\n                /* Ignore all other characters. */

Matching notes:

  • if more than one regular expression matches, lex takes the one that matched the longest text string.
  • if more than one regular expression matches the longest text string, lex takes the first regular expression defined in the file.
  • the TEXTSTRING match assumes there are no internal double quotes. In production, I'll use a more complex regular expression that handles escaping and still lets UTF8 characters through.

Compile

$ lex pub.l && cc lex.yy.c -o pub -ll

and run. Note that when you run this executable, it sits their waiting for you to type in text. The text you type in is what gets processed. In the log below, the lowercase text is what I typed in and the uppercase words are what the lexer output. You press ^D to end the standard input (or ^C, which aborts the lexer).

$ ./pub
{code:"abc"}
OBRACE KEY TEXTSTRING='"abc"' CBRACE ^D
$

So, now we have a lexer that identifies the tokens in a message.

Lex works fine with UTF8 characters

$ cat t.c
#include <stdio.h>

int
main(void)
{
        printf("{code:\"");

        /* an "o" with an umlaut, encoded as UTF8. */

        putchar(0xc3);
        putchar(0xb6);

        printf("\"}");
        return 0;
}
$ cc t.c -o t
$ ./t
{code:"ö"} $ ./t | ./pub
OBRACE KEY TEXTSTRING='"ö"' CBRACE $

So as long as our parser's contract is that the input and output are encoded as UTF8, there is no unicode-specific code to write. Lex FTW!

Use yacc to enforce a structure

Briefly,

  • instead of outputing token strings like "OBRACE", we define symbolic constants in yacc, and have lex return them when it recognizes a pattern
  • we tell yacc to exit on any syntax error
  • we define what a valid token sequence is in a message

The lex source is very similar--we have added one include and change the rules so they return a symbolic constant instead of printing the token name to stdout:

%{
#include <stdio.h>

#include "y.tab.h"
%}

%%
"{"                 return(OBRACE);
"}"                 return(CBRACE);
"["                 return(OBRACKET);
"]"                 return(CBRACKET);
\"[^"]*\"           return(TEXTSTRING);
[0-9]+"."[0-9]*     return(FLOAT);
[0-9]+              return(INT);
[a-zA-Z]+" "*:      return(KEY);
.|\n                /* Ignore all other characters. */

The interesting part of the yacc source is towards the end, where the valid message structure is defined in terms of the tokens:

%{
#include <err.h>
#include <stdio.h>
#include <string.h>

void
yyerror(const char *msg)
{
        errx(1, "error: %s", msg);
}

int
yywrap(void)
{
        return 1;
}

int
main(void)
{
        yyparse();
        return 0;
}
%}

%token OBRACE CBRACE OBRACKET CBRACKET TEXTSTRING FLOAT INT KEY
%%
pub      : OBRACE KEY TEXTSTRING KEY OBRACKET children CBRACKET CBRACE
         ;

children : /* empty */
         | children child
         ;

child    : OBRACE KEY TEXTSTRING KEY INT CBRACE
         ;
%%
#include "lex.yy.c"

When we now compile and run the executable, we get an error if the message has invalid syntax.

$ lex pub.l
$ yacc -dv pub.y
$ cc y.tab.c -o pub -ll -ly
$ ./pub
{abc}
pub: error: syntax error
$

But a valid message is accepted without complaint:

$ ./pub
{ code: "abc", children: [] }
^D
$
$ ./pub
{ code: "abc", children: [ { name:"a", type:1}]}
{ code: "abc", children: [ { name:"a", type:1}, {name:"b", type:2}]}

Conclusion

With 53 lines of code, we have built a validating parser that is UTF8 capable and is as fast as they come.

However, the documentation took far more text than the code! If you are like me, you'll stick these steps in a Makefile so months from now you can come back to a working example and getting up to speed will be fast.

Next steps are to have lex and yacc process a string and store the results in a structure instead of processing stdin and outputting to stdout.

For the right problem, lex and yacc are great tools!

References

  1. Lex and YACC primer/HOWTO
  2. The Roles of Lex and YACC
  3. OpenBSD Reference Manual, FLEX(1)
  4. OpenBSD Reference Manual, YACC(1)
  5. Adding utf-8 Encoding to Lex, Eric Prud'Hommeaux: regular expressions for byte ranges

No comments:

Post a Comment