How can I perform an ISO-8859-1 to UTF-8 text file conversion while not changing any characters that are already valid UTF-8

Consider a text file that looks like this visually:

This’s ISO-8859-1
This’s UTF-8

Behind-the-scenes, the curly quote character in the first line is encoded as ISO-8859-1, and the same character in the second line is encoded as UTF-8

The file looks like this on cat -v (-v option displays unprintable characters):

$ cat -v testing.txt
ThisM-4s ISO-8859-1
ThisM-bM-^@M-^Ys UTF-8

The goal is to standardize the file to UTF-8, meaning the first line needs to change and the second line MUST NOT change. However, if you attempt an ISO-8859-1 to UTF-8 conversion using iconv, recode and others, it'll corrupt the second line of the file by converting the UTF-8 into gibberish characters

Here's an example using iconv demonstrating that the second line becomes mangled:

$ cat testing.txt | iconv -f iso-8859-1 -t utf-8
This´s ISO-8859-1
This’s UTF-8

recode behaves similarly, mangling the second line:

$ recode iso-8859-1..utf-8 testing.txt
$ cat testing.txt
This´s ISO-8859-1
This’s UTF-8

What I'd like it to do is skip over conversion of the UTF-8 ´ character (but still pass it along to the output, DON'T strip it out), because it's already UTF-8, so there's no need to convert it

But I haven't found any way to do this

This simplified text file is just being used as an example -- need a solution that will work for much larger files as well

For example, a file might contain the UTF-8 character on line 30, 40, 100; and the ISO-8859-1 character on line 50, 60, and 200. A file might not contain any instances of the ISO-8859-1 character (in which case no changes to the file are needed). Safe to assume that the file will not contain both the ISO-8859-1 character and the UTF-8 character on the SAME line, if that makes the problem scope easier.

I looked at this question: How to recode to UTF-8 conditionally?

however it doesn't seem to account for the scenario where the file contains mixed ISO-8859-1 and UTF-8

and yes I know it's not a good idea to have mixed encodings in the same file

but it already happened years ago and the goal is to get it all cleaned up so it won't be a problem again

3

2 Answers

Python's UTF-8 decoder can pass-through non-UTF-8 characters as special codepoints U+DC00 – U+DCFF (which are normally illegal in UTF-8). Afterwards they can be found and re-decoded as something else:

#!/usr/bin/env python3
import argparse
import re
import sys
parser = argparse.ArgumentParser()
parser.add_argument("input")
args = parser.parse_args()
with open(args.input, "rb") as fh: buf = fh.read() buf = buf.decode("utf-8", errors="surrogateescape") buf = re.sub(r"[\udc00-\udcff]+", lambda m: (m.group(0) .encode("utf-8", errors="surrogateescape") .decode("iso8859-1")), buf) sys.stdout.write(buf)

You can also do it by hand:

#!/usr/bin/env python3
import argparse
import sys
parser = argparse.ArgumentParser()
parser.add_argument("input")
args = parser.parse_args()
def decipher_runes(fh): curr = None more = 0 while buf := fh.read(1): ch = buf[0] if more == 0: # Expect a UTF-8 leading byte curr = bytearray([ch]) if ch & 0b10000000 == 0b00000000: more = 0 elif ch & 0b11100000 == 0b11000000: more = 1 elif ch & 0b11110000 == 0b11100000: more = 2 elif ch & 0b11111000 == 0b11110000: more = 3 elif ch & 0b11111100 == 0b11111000: more = 4 elif ch & 0b11111110 == 0b11111100: more = 5 else: more = -1 else: # Expect a continuation byte curr.append(ch) if ch & 0b11000000 == 0b10000000: more -= 1 else: more = -1 if more < 0: more = 0 yield curr.decode("iso8859-1") elif more == 0: yield curr.decode("utf-8") if more: yield curr.decode("iso8859-1")
with open(args.input, "rb") as fh: for ch in decipher_runes(fh): sys.stdout.write(ch)
2

.NET allows you to create a custom encoder/decoder for invalid characters beside the default options (throw exception on invalid characters or replace them with a user-specified string) so you can use any .NET based languages and write your own decoder to convert ISO-8859-1 characters to UTF-8. I've written a simple PowerShell script to do that. Install PowerShell to Linux if you don't have and save the below script as convert.ps1

class Decoder88591FallbackBuffer : System.Text.DecoderFallbackBuffer { [char]$c; [int]$idx # Internal decoder state Decoder88591FallbackBuffer() { $this.Reset() } [bool] Fallback([byte[]]$bytesUnknown, [int]$index) { $this.idx = 1; $this.c = [char]::ConvertFromUtf32($bytesUnknown[0]) return $true } [char] GetNextChar() { if ($this.idx -eq 1) { $this.idx = 2; return $this.c } return 0 } [bool] MovePrevious() { if ($this.idx -eq 2) { $this.idx = 1; return $true } return $false } [int] get_Remaining() { if ($this.idx -eq 0) { if ($this.c -eq 0) { return 0 } else {return 1 } } return 0 } [void] Reset() { $this.c = 0; $this.idx = 0 }
}
class Decoder88591Fallback : System.Text.DecoderFallback { Decoder88591Fallback() {} [Text.DecoderFallbackBuffer] CreateFallbackBuffer() { return [Decoder88591FallbackBuffer]::new(); } [int] get_MaxCharCount() { return 1; }
}
$enc = [Text.Encoding]::GetEncoding(65001, ` [Text.EncoderReplacementFallback]::new(), [Decoder88591Fallback]::new())
if ($PSVersionTable.PSVersion -ge [version]"6.0") { $content = Get-Content -AsByteStream -Raw $args[0]
} else { $content = Get-Content -Encoding Byte -Raw $args[0]
}
Set-Content -Path $args[1] -Encoding UTF8 -Value ($enc.GetString($content))

Then run the command as

./convert.ps1 testing.txt testing_out.txt

If you want to make it work for Windows-1252 then just change [char]::ConvertFromUtf32($bytesUnknown[0]) to [Text.Encoding]::GetEncoding(1252).GetString($bytesUnknown)[0]

Sample output:

$ cat -v testing2.txtThisM-4s ISO-8859-1
Bx M-0 M-1 M-2 M-3 M-4 M-5 M-6 M-7 M-8 M-9 M-: M-; M- M-?
Cx M-@ M-A M-B M-C M-D M-E M-F M-G M-H M-I M-J M-K M-L M-M M-N M-O
Dx M-P M-Q M-R M-S M-T M-U M-V M-W M-X M-Y M-Z M-[ M-\ M-] M-^ M-_
Ex M-` M-a M-b M-c M-d M-e M-f M-g M-h M-i M-j M-k M-l M-m M-n M-o
Fx M-p M-q M-r M-s M-t M-u M-v M-w M-x M-y M-z M-{ M-| M-} M-~ M-^?
ThisM-bM-^@M-^Ys UTF-8
Bx M-BM-0 M-BM-1 M-BM-2 M-BM-3 M-BM-4 M-BM-5 M-BM-6 M-BM-7 M-BM-8 M-BM-9 M-BM-: M-BM-; M-BM- M-BM-?
Cx M-CM-^@ M-CM-^A M-CM-^B M-CM-^C M-CM-^D M-CM-^E M-CM-^F M-CM-^G M-CM-^H M-CM-^I M-CM-^J M-CM-^K M-CM-^L M-CM-^M M-CM-^N M-CM-^O
Dx M-CM-^P M-CM-^Q M-CM-^R M-CM-^S M-CM-^T M-CM-^U M-CM-^V M-CM-^W M-CM-^X M-CM-^Y M-CM-^Z M-CM-^[ M-CM-^\ M-CM-^] M-CM-^^ M-CM-^_
Ex M-CM- M-CM-! M-CM-" M-CM-# M-CM-$ M-CM-% M-CM-& M-CM-' M-CM-( M-CM-) M-CM-* M-CM-+ M-CM-, M-CM-- M-CM-. M-CM-/
Fx M-CM-0 M-CM-1 M-CM-2 M-CM-3 M-CM-4 M-CM-5 M-CM-6 M-CM-7 M-CM-8 M-CM-9 M-CM-: M-CM-; M-CM- M-CM-?$ cat testing2_out.txtThis´s ISO-8859-1
Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ
This’s UTF-8
Bx ° ± ² ³ ´ µ ¶ · ¸ ¹ º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù ú û ü ý þ ÿ

Note that od -c or hd (included in most Linux distros by default) would be much better than cat -v because they allow easier examining of the byte values

$ hd testing.txt00000000 54 68 69 73 b4 73 20 49 53 4f 2d 38 38 35 39 2d |This.s ISO-8859-|
00000010 31 0a 54 68 69 73 e2 80 99 73 20 55 54 46 2d 38 |1.This...s UTF-8|
00000020 0a 0a |..|
00000022$ od -c testing.txt0000000 T h i s 264 s I S O - 8 8 5 9 -
0000020 1 \n T h i s 342 200 231 s U T F - 8
0000040 \n \n
0000042

For more information read

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

You Might Also Like