Is it safe to replace CP850 with UTF-8 encoding

I have an old project reading files with CP850 encoding. But it handles accent characters wrong (e.g., Montréal becomes MontrÚal). I want to replace CP850 with UTF-8. The question is:

Is it safe? In other word, can we assume UTF-8 is a super set and Encoding the same way as CP850 encoding characters?

Thanks

I tried hexdump, below is the sample of my csv file, is it UTF-8?

000000d0 76 20 64 65 20 4d 61 72 6c 6f 77 65 2c 2c 4d 6f |v de Marlowe,,Mo|
000000e0 6e 74 72 c3 a9 61 6c 2c 51 43 2c 48 34 41 20 20 |ntr..al,QC,H4A |
1

1 Answer

If by superset you mean does UTF-8 include all the characters of CP850, then trivially yes, since UTF-8 can encode all valid Unicode code points using a variable-length encoding (1–4 bytes).

If you mean are characters encoded the same way, then as you've seen this is not the case, since é (U+00E9) is encoded as 82 in CP850 and C3 A9 in UTF-8.

I cannot see a character set / code page that encodes Ú as 82, but Ú is encoded as E9 in CP850, which is the ISO-8859-1 representation of é, so it's possible you've got your conversion the wrong way around (i.e. you're converting your file from ISO-8859-1 to CP850, and you want to convert from CP850 to UTF-8).

Here's an example using hd and iconv:

hd test.cp850.txt
00000000 4d 6f 6e 74 72 82 61 6c |Montr.al|
00000008
iconv --from cp850 --to utf8 test.cp850.txt > test.utf8.txt
hd test.utf8.txt
00000000 4d 6f 6e 74 72 c3 a9 61 6c |Montr..al|
00000009
9

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge that you have read and understand our privacy policy and code of conduct.

You Might Also Like