Could not decode to UTF-8 columnIn these cases, I have found two things very handy:
- od (1), and
- Unicode/UTF8 Chart
od
:
$ od -c -tx1 q.out A l l M e n ' s P r o t ? g 41 6c 6c 20 4d 65 6e 27 73 20 50 72 6f 74 e9 67 ? o n S a l e 4 / 1 2 - 4 e9 20 6f 6e 20 53 61 6c 65 20 34 2f 31 32 2d 34 / 1 8 \n 2f 31 38 0a $Note: od output is modified a bit so it will fit. The last letter in first line is a "g"; if you can't see it, you are missing something. Try reducing your browser font.
The problematic byte is in the first line; the ASCII version is a "?" and the hex is "e9". From this we know it's not UTF-8, as there is one byte with a hex value greater than 127 and a following byte with a value less than 128. Then, there's a second problem byte, with the same hex value, two bytes later.
So the entire problem-causing sequence is : P r o t <e9> g <e9>
.
If you open the Unicode/UTF-8 Chart and scroll down to the Unicode code point U+00E9, you'll see that character is a "LATIN SMALL LETTER E WITH ACUTE", which makes our mystery word: Protégé
So in this case, the text feed is encoded as ISO8859-1, and not as UTF-8.
Converting ISO 8859-1 to UTF-8
- Converting iso-8859-1 Into utf-8, by Sean B. Palmer
- C version, by the same
No comments:
Post a Comment