Consider a text file that looks like this visually:
This’s ISO-8859-1
This’s UTF-8Behind-the-scenes, the ’ curly quote character in the first line is encoded as ISO-8859-1, and the same ’ character in the second line is encoded as UTF-8
The file looks like this on cat -v (-v option displays unprintable characters):
$ cat -v testing.txt
ThisM-4s ISO-8859-1
ThisM-bM-^@M-^Ys UTF-8The goal is to standardize the file to UTF-8, meaning the first line needs to change and the second line MUST NOT change. However, if you attempt an ISO-8859-1 to UTF-8 conversion using iconv, recode and others, it'll corrupt the second line of the file by converting the UTF-8 ’ into gibberish characters
Here's an example using iconv demonstrating that the second line becomes mangled:
$ cat testing.txt | iconv -f iso-8859-1 -t utf-8
This´s ISO-8859-1
This’s UTF-8recode behaves similarly, mangling the second line:
$ recode iso-8859-1..utf-8 testing.txt
$ cat testing.txt
This´s ISO-8859-1
This’s UTF-8What I'd like it to do is skip over conversion of the UTF-8 ´ character (but still pass it along to the output, DON'T strip it out), because it's already UTF-8, so there's no need to convert it
But I haven't found any way to do this
This simplified text file is just being used as an example -- need a solution that will work for much larger files as well
For example, a file might contain the UTF-8 ’ character on line 30, 40, 100; and the ISO-8859-1 ’ character on line 50, 60, and 200. A file might not contain any instances of the ISO-8859-1 ’ character (in which case no changes to the file are needed). Safe to assume that the file will not contain both the ISO-8859-1 ’ character and the UTF-8 ’ character on the SAME line, if that makes the problem scope easier.
I looked at this question: How to recode to UTF-8 conditionally?
however it doesn't seem to account for the scenario where the file contains mixed ISO-8859-1 and UTF-8
and yes I know it's not a good idea to have mixed encodings in the same file
but it already happened years ago and the goal is to get it all cleaned up so it won't be a problem again
32 Answers
Python's UTF-8 decoder can pass-through non-UTF-8 characters as special codepoints U+DC00 – U+DCFF (which are normally illegal in UTF-8). Afterwards they can be found and re-decoded as something else:
#!/usr/bin/env python3
import argparse
import re
import sys
parser = argparse.ArgumentParser()
parser.add_argument("input")
args = parser.parse_args()
with open(args.input, "rb") as fh: buf = fh.read() buf = buf.decode("utf-8", errors="surrogateescape") buf = re.sub(r"[\udc00-\udcff]+", lambda m: (m.group(0) .encode("utf-8", errors="surrogateescape") .decode("iso8859-1")), buf) sys.stdout.write(buf)You can also do it by hand:
#!/usr/bin/env python3
import argparse
import sys
parser = argparse.ArgumentParser()
parser.add_argument("input")
args = parser.parse_args()
def decipher_runes(fh): curr = None more = 0 while buf := fh.read(1): ch = buf[0] if more == 0: # Expect a UTF-8 leading byte curr = bytearray([ch]) if ch & 0b10000000 == 0b00000000: more = 0 elif ch & 0b11100000 == 0b11000000: more = 1 elif ch & 0b11110000 == 0b11100000: more = 2 elif ch & 0b11111000 == 0b11110000: more = 3 elif ch & 0b11111100 == 0b11111000: more = 4 elif ch & 0b11111110 == 0b11111100: more = 5 else: more = -1 else: # Expect a continuation byte curr.append(ch) if ch & 0b11000000 == 0b10000000: more -= 1 else: more = -1 if more < 0: more = 0 yield curr.decode("iso8859-1") elif more == 0: yield curr.decode("utf-8") if more: yield curr.decode("iso8859-1")
with open(args.input, "rb") as fh: for ch in decipher_runes(fh): sys.stdout.write(ch) 2 .NET allows you to create a custom encoder/decoder for invalid characters beside the default options (throw exception on invalid characters or replace them with a user-specified string) so you can use any .NET based languages and write your own decoder to convert ISO-8859-1 characters to UTF-8. I've written a simple PowerShell script to do that. Install PowerShell to Linux if you don't have and save the below script as convert.ps1
class Decoder88591FallbackBuffer : System.Text.DecoderFallbackBuffer { [char]$c; [int]$idx # Internal decoder state Decoder88591FallbackBuffer() { $this.Reset() } [bool] Fallback([byte[]]$bytesUnknown, [int]$index) { $this.idx = 1; $this.c = [char]::ConvertFromUtf32($bytesUnknown[0]) return $true } [char] GetNextChar() { if ($this.idx -eq 1) { $this.idx = 2; return $this.c } return 0 } [bool] MovePrevious() { if ($this.idx -eq 2) { $this.idx = 1; return $true } return $false } [int] get_Remaining() { if ($this.idx -eq 0) { if ($this.c -eq 0) { return 0 } else {return 1 } } return 0 } [void] Reset() { $this.c = 0; $this.idx = 0 }
}
class Decoder88591Fallback : System.Text.DecoderFallback { Decoder88591Fallback() {} [Text.DecoderFallbackBuffer] CreateFallbackBuffer() { return [Decoder88591FallbackBuffer]::new(); } [int] get_MaxCharCount() { return 1; }
}
$enc = [Text.Encoding]::GetEncoding(65001, ` [Text.EncoderReplacementFallback]::new(), [Decoder88591Fallback]::new())
if ($PSVersionTable.PSVersion -ge [version]"6.0") { $content = Get-Content -AsByteStream -Raw $args[0]
} else { $content = Get-Content -Encoding Byte -Raw $args[0]
}
Set-Content -Path $args[1] -Encoding UTF8 -Value ($enc.GetString($content))Then run the command as
./convert.ps1 testing.txt testing_out.txtIf you want to make it work for Windows-1252 then just change [char]::ConvertFromUtf32($bytesUnknown[0]) to [Text.Encoding]::GetEncoding(1252).GetString($bytesUnknown)[0]
Sample output:
$ cat -v testing2.txtThisM-4s ISO-8859-1
Bx M-0 M-1 M-2 M-3 M-4 M-5 M-6 M-7 M-8 M-9 M-: M-; M- M-?
Cx M-@ M-A M-B M-C M-D M-E M-F M-G M-H M-I M-J M-K M-L M-M M-N M-O
Dx M-P M-Q M-R M-S M-T M-U M-V M-W M-X M-Y M-Z M-[ M-\ M-] M-^ M-_
Ex M-` M-a M-b M-c M-d M-e M-f M-g M-h M-i M-j M-k M-l M-m M-n M-o
Fx M-p M-q M-r M-s M-t M-u M-v M-w M-x M-y M-z M-{ M-| M-} M-~ M-^?
ThisM-bM-^@M-^Ys UTF-8
Bx M-BM-0 M-BM-1 M-BM-2 M-BM-3 M-BM-4 M-BM-5 M-BM-6 M-BM-7 M-BM-8 M-BM-9 M-BM-: M-BM-; M-BM- M-BM-?
Cx M-CM-^@ M-CM-^A M-CM-^B M-CM-^C M-CM-^D M-CM-^E M-CM-^F M-CM-^G M-CM-^H M-CM-^I M-CM-^J M-CM-^K M-CM-^L M-CM-^M M-CM-^N M-CM-^O
Dx M-CM-^P M-CM-^Q M-CM-^R M-CM-^S M-CM-^T M-CM-^U M-CM-^V M-CM-^W M-CM-^X M-CM-^Y M-CM-^Z M-CM-^[ M-CM-^\ M-CM-^] M-CM-^^ M-CM-^_
Ex M-CM- M-CM-! M-CM-" M-CM-# M-CM-$ M-CM-% M-CM-& M-CM-' M-CM-( M-CM-) M-CM-* M-CM-+ M-CM-, M-CM-- M-CM-. M-CM-/
Fx M-CM-0 M-CM-1 M-CM-2 M-CM-3 M-CM-4 M-CM-5 M-CM-6 M-CM-7 M-CM-8 M-CM-9 M-CM-: M-CM-; M-CM- M-CM-?$ cat testing2_out.txtThis´s ISO-8859-1
Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
This’s UTF-8
Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿNote that od -c or hd (included in most Linux distros by default) would be much better than cat -v because they allow easier examining of the byte values
$ hd testing.txt00000000 54 68 69 73 b4 73 20 49 53 4f 2d 38 38 35 39 2d |This.s ISO-8859-| 00000010 31 0a 54 68 69 73 e2 80 99 73 20 55 54 46 2d 38 |1.This...s UTF-8| 00000020 0a 0a |..| 00000022$ od -c testing.txt0000000 T h i s 264 s I S O - 8 8 5 9 - 0000020 1 \n T h i s 342 200 231 s U T F - 8 0000040 \n \n 0000042
For more information read