[Kolab-devel] ASCII representation of unicode characters
Jeroen van Meeuwen (Kolab Systems)
vanmeeuwen at kolabsys.com
Mon Dec 5 10:31:00 CET 2011
Hello,
I'm hoping you can help me solve the following problem, or get closer
to a solution.
When for example a recipient policy^1 wants to translate a givenName
and sn (surname) into an email address, the two names can hold virtually
any character set but the email address must be in ASCII.
^1: A recipient policy uses data from the name for an entity to compose
other attributes for the entity given a template (i.e. "Jeroen"
(givenName), "van Meeuwen" (surname) becomes, given the template
'%(givenname)s.%(sn)s@%(domain)s' for the email address:
'jeroen.vanmeeuwen at kolabsys.com').
Examples are people's names with accents (grave, acute, circumflex),
German (umlauts, ß[1]), and of course entirely non-roman alphabets such
as Cyrillic and Greek.
My problem is translating these characters from the input value into
the output value. I only speak/understand a limited number of languages,
but from what I understand mostly the non-ascii characters are
translated into their 'phonetic equivalent representation'. 'Ü' usually
becomes 'Ue', for example. I think for some characters or instances
thereof, however, it's not safe to just translate them. 'ß' for example,
I believe, can become 'ss' or 'sz', depending on a couple of rules that
humans understand but that are hard to codify.
I have created a table of characters going from through to
Ѐ[2] (there's more[3]) and I am seeking a logical, codified
approach to "normalizing" as much of the unicode to ascii. I would
appreciate your help in outlining what the rules would need to be(come).
Thanks in advance!
[1] http://en.wikipedia.org/wiki/%C3%9F
[2] http://hosted.kolabsys.com/~vanmeeuwen/unicodechars.htm
[3] http://ascii-table.com/unicode.php (17 times 65536)
Kind regards,
Jeroen van Meeuwen
--
Senior Engineer, Kolab Systems AG
e: vanmeeuwen at kolabsys.com
t: +44 144 340 9500
m: +44 74 2516 3817
w: http://www.kolabsys.com
pgp: 9342 BF08
More information about the devel
mailing list