Friday, April 17, 2009

Debugging Text Encoding Issues

If you tell software to expect UTF-8, and then feed it Unicode, it will be unhappy; for example, Django will complain with the message:
Could not decode to UTF-8 column
In these cases, I have found two things very handy: To see what bytes are in your text, capture the problematic text to a file. Then examine it with od:
$ od -c -tx1 q.out
A  l  l     M  e  n  '  s     P  r  o  t  ?  g
41 6c 6c 20 4d 65 6e 27 73 20 50 72 6f 74 e9 67
?     o  n     S  a  l  e     4  /  1  2  -  4
e9 20 6f 6e 20 53 61 6c 65 20 34 2f 31 32 2d 34
/  1  8  \n            
2f 31 38 0a 
$
Note: od output is modified a bit so it will fit. The last letter in first line is a "g"; if you can't see it, you are missing something. Try reducing your browser font.

The problematic byte is in the first line; the ASCII version is a "?" and the hex is "e9". From this we know it's not UTF-8, as there is one byte with a hex value greater than 127 and a following byte with a value less than 128. Then, there's a second problem byte, with the same hex value, two bytes later.

So the entire problem-causing sequence is : P r o t <e9> g <e9>.

If you open the Unicode/UTF-8 Chart and scroll down to the Unicode code point U+00E9, you'll see that character is a "LATIN SMALL LETTER E WITH ACUTE", which makes our mystery word: Protégé

So in this case, the text feed is encoded as ISO8859-1, and not as UTF-8.

Converting ISO 8859-1 to UTF-8

Sunday, April 5, 2009

Will a slow client bog down a single-threaded server?

Single-threaded code is attractive; it is shorter and much less complex. If a single-threaded FastCGI daemon is really fast, what happens if there is a really slow connection; for example, a web browser on a slow modem? Do other clients have to wait? I expect the answer is no, but wanted to test to be sure.

Background

Recently, while debugging an issue with a multi-threaded daemon and nanosleep(2), I was surprised to find that the single-threaded version of the daemon ran four times as fast (in wall clock time) as the multi-threaded version. (The system and user times were the same.) I was surprised because there was very little locking--only the call to FCGX_Accept was synchronized and the protected section of code was really really fast--no file reads, no database calls--just memory access.

Initial Benchmark

  1. Created a FastCGI daemon that responds with a big web page (94kB).
  2. Ran http_load with ten clients in parallel, making requests as fast as they could for 60 seconds.
This gave me an initial benchmark:
  • 15,473 total fetches
  • 258 fetches/second
  • 2.5 x 10^7 bytes/second
  • 0.23 mean msecs/connect
  • 37.3 mean msecs/first response
  • 15,473 HTTP 200 responses

Benchmark With Slow Client

The second test ran the same http_load call, but this time after starting a second http_load call in another console that throttled the client to receive approximately 1,000 bytes/second. (This is much slower than a modem, which can generally at least to 33,000 bytes/second.) Note that the http_load call to do this is not documented in the manual page; it is:
$ http_load -Throttle 10000 -parallel 5 -seconds 60 urls
Results:
  • 13,757 fetches
  • 229 fetches/second
  • 2.2 x 10^7 bytes/second
  • 0.23 mean msecs/connect
  • 41.8 mean msecs/first-response
  • 13,757 HTTP 200 responses
Here's the result from the throttled client:
  • 1 fetch
  • 0.01 fetches/second (ran for 90 seconds)
  • 1,079 bytes/second
  • 0.61 mean msecs/connect
  • 4.2 mean msecs/first-response
  • 1 HTTP 200 response
It was throttled so much that it could only complete one request during the entire 90 seconds.

Conclusion

After writing this up, I think the test is in fact worthless! Here's why--watching netstat, I can see the Recv-Q of the socket is filled when running the throttled http_load. I think this means the data has been transmitted over the network (well, between sockets--the test was run on a single machine), and it is queued somewhere in the operating system. This is not an accurate test of a slow network. Drat, back to the drawing board.

Addendum

After thinking about this a bit, I took a different approach:
  • for throttled client, create a modified http_get utility that uses a socket with a minimal receive buffer (512bytes) and sleep one second between each read call
  • run full-speed http_load at same time.
This seems like a decent simulation of a (really!) slow client connection. Watching netstat shows one socket with tens of thousands of bytes in its send queue (the web server) and another with 192 bytes in its receive queue (the throttled client). And the full-speed http_load ran just fine, processing 13,000 fetches in 60 seconds and transferring 2 x 10^7 bytes/second. So as long as the socket send buffer size on the server is big enough to hold the entire response, for sure there is no problem with using a single-threaded server, even for pathologically slow client connections.

Addendum 2

In fact, I suspect with Lighttpd, even if the SO_SNDBUF is not large enough to hold the entire response, it will still fly.

Lightty uses the select loop and multiple send sockets; if a send socket can't hold the entire response, and is backed up in sending the data it does have to the client socket, I would expect the select loop will simply continue to serve the other sockets that are ready to send more data and which are connected to faster clients.

Saturday, April 4, 2009

Fast, validating UTF8-aware JSON parser in 53 lines of code

Lex(1) and yacc(1) both suffer from the curse of abtraction--too much complexity! Like graphviz, cscope, and vi these are not tools that you can use once and then pick up quickly a month later. However, sometimes this complexity is worth it, as this writeup hopefully demonstrates.

This log entry gives a simple example of how to use lex and yacc to parse and validate a text message formatted using JSON.

Initially, to solve this problem I explored a couple json libraries available in C. But as I read through the code, I found it hard to understand--one used reference counting to handle memory and the other had a ton of code dealing with unicode. This message parsing will be inside a daemon for which stability and performance are paramount, so I was leery to integrate memory handling code I didn't grok or a bunch of code I didn't need.

So I decided to take a crack with lex and yacc. After all, it's a tight fit--the messages are text with a fixed grammer. Lex is a parser and yacc provides a way to define a grammer.

Sample Message

{ code:"abc123",
  children: [
    { name:"asb", weight:50 },
    { name:"def", weight:25 }]}

A first lex file: print out the tokens

The lexer defines what we care about in the message. For example, I want to know when

  • an object starts (an "{", or "open brace"),
  • an object ends (a "close brace"),
  • we hit a key in a key/value pair (e.g., "code:" is first key in message above),
  • when an array starts (an "[")
  • etc., etc.

These definitions are created using regular expressions.

%{
#include <stdio.h>
%}

%%
"{"                 printf("OBRACE ");
"}"                 printf("CBRACE ");
"["                 printf("OBRACKET ");
"]"                 printf("CBRACKET ");
\"[^"]*\"           printf("TEXTSTRING='%s' ", yytext);
[0-9]+"."[0-9]*     printf("FLOAT ");
[0-9]+              printf("INT ");
[a-zA-Z]+" "*:      printf("KEY ");
.|\n                /* Ignore all other characters. */

Matching notes:

  • if more than one regular expression matches, lex takes the one that matched the longest text string.
  • if more than one regular expression matches the longest text string, lex takes the first regular expression defined in the file.
  • the TEXTSTRING match assumes there are no internal double quotes. In production, I'll use a more complex regular expression that handles escaping and still lets UTF8 characters through.

Compile

$ lex pub.l && cc lex.yy.c -o pub -ll

and run. Note that when you run this executable, it sits their waiting for you to type in text. The text you type in is what gets processed. In the log below, the lowercase text is what I typed in and the uppercase words are what the lexer output. You press ^D to end the standard input (or ^C, which aborts the lexer).

$ ./pub
{code:"abc"}
OBRACE KEY TEXTSTRING='"abc"' CBRACE ^D
$

So, now we have a lexer that identifies the tokens in a message.

Lex works fine with UTF8 characters

$ cat t.c
#include <stdio.h>

int
main(void)
{
        printf("{code:\"");

        /* an "o" with an umlaut, encoded as UTF8. */

        putchar(0xc3);
        putchar(0xb6);

        printf("\"}");
        return 0;
}
$ cc t.c -o t
$ ./t
{code:"ö"} $ ./t | ./pub
OBRACE KEY TEXTSTRING='"ö"' CBRACE $

So as long as our parser's contract is that the input and output are encoded as UTF8, there is no unicode-specific code to write. Lex FTW!

Use yacc to enforce a structure

Briefly,

  • instead of outputing token strings like "OBRACE", we define symbolic constants in yacc, and have lex return them when it recognizes a pattern
  • we tell yacc to exit on any syntax error
  • we define what a valid token sequence is in a message

The lex source is very similar--we have added one include and change the rules so they return a symbolic constant instead of printing the token name to stdout:

%{
#include <stdio.h>

#include "y.tab.h"
%}

%%
"{"                 return(OBRACE);
"}"                 return(CBRACE);
"["                 return(OBRACKET);
"]"                 return(CBRACKET);
\"[^"]*\"           return(TEXTSTRING);
[0-9]+"."[0-9]*     return(FLOAT);
[0-9]+              return(INT);
[a-zA-Z]+" "*:      return(KEY);
.|\n                /* Ignore all other characters. */

The interesting part of the yacc source is towards the end, where the valid message structure is defined in terms of the tokens:

%{
#include <err.h>
#include <stdio.h>
#include <string.h>

void
yyerror(const char *msg)
{
        errx(1, "error: %s", msg);
}

int
yywrap(void)
{
        return 1;
}

int
main(void)
{
        yyparse();
        return 0;
}
%}

%token OBRACE CBRACE OBRACKET CBRACKET TEXTSTRING FLOAT INT KEY
%%
pub      : OBRACE KEY TEXTSTRING KEY OBRACKET children CBRACKET CBRACE
         ;

children : /* empty */
         | children child
         ;

child    : OBRACE KEY TEXTSTRING KEY INT CBRACE
         ;
%%
#include "lex.yy.c"

When we now compile and run the executable, we get an error if the message has invalid syntax.

$ lex pub.l
$ yacc -dv pub.y
$ cc y.tab.c -o pub -ll -ly
$ ./pub
{abc}
pub: error: syntax error
$

But a valid message is accepted without complaint:

$ ./pub
{ code: "abc", children: [] }
^D
$
$ ./pub
{ code: "abc", children: [ { name:"a", type:1}]}
{ code: "abc", children: [ { name:"a", type:1}, {name:"b", type:2}]}

Conclusion

With 53 lines of code, we have built a validating parser that is UTF8 capable and is as fast as they come.

However, the documentation took far more text than the code! If you are like me, you'll stick these steps in a Makefile so months from now you can come back to a working example and getting up to speed will be fast.

Next steps are to have lex and yacc process a string and store the results in a structure instead of processing stdin and outputting to stdout.

For the right problem, lex and yacc are great tools!

References

  1. Lex and YACC primer/HOWTO
  2. The Roles of Lex and YACC
  3. OpenBSD Reference Manual, FLEX(1)
  4. OpenBSD Reference Manual, YACC(1)
  5. Adding utf-8 Encoding to Lex, Eric Prud'Hommeaux: regular expressions for byte ranges