[Kolab-devel] [SOLVED] Re: ASCII representation of unicode characters

Jeroen van Meeuwen (Kolab Systems) vanmeeuwen at kolabsys.com
Mon Dec 5 15:20:59 CET 2011


On 2011-12-05 11:51, Mathieu Parent wrote:
> Hi,
>
> 2011/12/5 Jeroen van Meeuwen (Kolab Systems) 
> <vanmeeuwen at kolabsys.com>:
>> On 2011-12-05 9:40, Aleksander Machniak wrote:
>>> On 05.12.2011 10:31, Jeroen van Meeuwen (Kolab Systems) wrote:
>>>
>>>> I have created a table of characters going from  through to
>>>> Ѐ[2] (there's more[3]) and I am seeking a logical, codified
>>>> approach to "normalizing" as much of the unicode to ascii. I would
>>>> appreciate your help in outlining what the rules would need to
>>>> be(come).
>>>
>>> You could try using iconv with //TRANSLIT.
>>>
>>>
>>> 
>>> http://stackoverflow.com/questions/4910627/php-iconv-translit-for-removing-accents-not-working-as-excepted
>>
>> I don't think this is satisfactory, iconv() outputs, given the
>> following code-snippet:
>>
>>     $original = 'Ü';
>>     $translated = iconv('UTF-8', 'ASCII//TRANSLIT', $original);
>>     print "$original\t$translated\n";
>>
>> $ php unicode-to-ascii.php
>> Ü       U
>> ü       u
>> $
>>
>> We currently use 'ue' as the substitute for 'ü' however
>> (bruederli at kolabsys.com for Thomas, for example).
>>
>
> It seems that setting LC_ALL to the intended language does
> transliteration right: 
> http://php.net/manual/en/function.iconv.php#105507
>

You're right, it does.

So it seems we want to be setting the locale / language depending on 
the account created/managed... In order to get the account details to be 
filled in to the recipient user expectations...

The example case is our systems using en_US.UTF-8, but many German / 
Swiss names.

I suppose we can use preferredLanguage in LDAP[1] with valid content 
described in [2], and fall back to ... the active system language or any 
specifically configured preferred language perhaps.

[1] http://tools.ietf.org/html/rfc2798#section-2.7
[2] http://tools.ietf.org/html/rfc2068#section-14.13

Here's how a simple routine would look in Python (which is where this 
segment of the code is implemented in its primary function):

     #!/usr/bin/python
     # -*- coding: utf-8 -*-

     # On the command line, this would look as follows:
     # $ echo "Brüderli" | env LANG=de_CH.ISO8859-1 iconv -f 'UTF-8' -t 
'ASCII//TRANSLIT' -s

     import locale
     import subprocess

     (locale_name,locale_charset) = locale.normalize('de_CH').split('.')
     locale.setlocale(locale.LC_ALL, (locale_name,locale_charset))

     command = [ '/usr/bin/iconv',
                 '-f', 'UTF-8',
                 '-t', 'ASCII//TRANSLIT',
                 '-s' ]

     process = subprocess.Popen(command, stdout=subprocess.PIPE, 
stdin=subprocess.PIPE, stderr=subprocess.PIPE, env={'LANG': 
locale.normalize('de_CH')})

     print >> process.stdin, "Brüderli\n"
     print process.communicate()[0].strip()

That'll settle it, I think.

Kind regards,

Jeroen van Meeuwen

-- 
Senior Engineer, Kolab Systems AG

e: vanmeeuwen at kolabsys.com
t: +44 144 340 9500
m: +44 74 2516 3817
w: http://www.kolabsys.com

pgp: 9342 BF08




More information about the devel mailing list