Friday, April 17, 2009

Debugging Text Encoding Issues

If you tell software to expect UTF-8, and then feed it Unicode, it will be unhappy; for example, Django will complain with the message:
Could not decode to UTF-8 column
In these cases, I have found two things very handy: To see what bytes are in your text, capture the problematic text to a file. Then examine it with od:
$ od -c -tx1 q.out
A  l  l     M  e  n  '  s     P  r  o  t  ?  g
41 6c 6c 20 4d 65 6e 27 73 20 50 72 6f 74 e9 67
?     o  n     S  a  l  e     4  /  1  2  -  4
e9 20 6f 6e 20 53 61 6c 65 20 34 2f 31 32 2d 34
/  1  8  \n            
2f 31 38 0a 
$
Note: od output is modified a bit so it will fit. The last letter in first line is a "g"; if you can't see it, you are missing something. Try reducing your browser font.

The problematic byte is in the first line; the ASCII version is a "?" and the hex is "e9". From this we know it's not UTF-8, as there is one byte with a hex value greater than 127 and a following byte with a value less than 128. Then, there's a second problem byte, with the same hex value, two bytes later.

So the entire problem-causing sequence is : P r o t <e9> g <e9>.

If you open the Unicode/UTF-8 Chart and scroll down to the Unicode code point U+00E9, you'll see that character is a "LATIN SMALL LETTER E WITH ACUTE", which makes our mystery word: Protégé

So in this case, the text feed is encoded as ISO8859-1, and not as UTF-8.

Converting ISO 8859-1 to UTF-8

No comments:

Post a Comment