[tclug-list] Escaped unicode conversion

Thu Nov 12 13:49:12 CST 2015

On Wed, 28 Oct 2015, Wakefield, Thad M. wrote:

> ____________________________________
> From: tclug-list-bounces at mn-linux.org [tclug-list-bounces at mn-linux.org] on behalf of Mike Miller [mbmiller+l at gmail.com]
> Sent: Wednesday, October 28, 2015 3:07 AM
> To: TCLUG Mailing List
> Subject: Re: [tclug-list] Escaped unicode conversion
>
> On Tue, 27 Oct 2015, Wakefield, Thad M. wrote:
>
>>> This seems like it should be easy. So I'm suspecting my internet search skills are deficient.
>>>
>>> I have a text file with escaped Unicode that I want to convert to plain text.
>>>
>>> From:  Why We\u2019re in a New Gilded Age
>>> To:      Why We're in a New Gilded Age
>>
>> Tell us if this works for you:
>>
>> perl -pe 's/\\u([0-9A-Fa-f]{4})/chr(hex $1)/ge'
>>
>> It assumes there are always four hexadecimal digits following the "\u".
>> It will give warnings to stderr about "Wide character in print".
>>
>> Your example shows conversion to an ordinary apostrophe, like this:>
>>
>> We're
>>
>> But my code will give you the UTF-8 character U+2019, like this:
>>
>> We?re
>>
>> And that is probably what you want.
>>
>> Mike
>
> This converted the text file with escaped Unicode to an UTF8 file which 
> I was able to convert to an ASCII text file with Notepad++. I was unable 
> to get iconv to do the conversion.

Cool.  But how did iconv deal with characters like U+2019?  When I try it, 
it fails on that character:

$ echo "Why We\u2019re in a New Gilded Age" | perl -pe 's/\\u([0-9A-Fa-f]{4})/chr(hex $1)/ge' | iconv --from-code=UTF-8 --to-code=ISO-8859-1
Wide character in print, <> line 1.
Why Weiconv: illegal input sequence at position 6

Maybe you used a different output encoding.  If you use the -c option, it 
deletes the U+2019 character.

Thanks.

Mike