Unicode

From Toolserver wiki
Jump to: navigation, search

This page is intended to summarize how to code programs that can properly process Unicode strings, focusing onto the Toolserver environment.

PHP

Among others, ucfirst, strtoupper, lcfirst and strtolower are known to corrupt multi-byte character sequences[1]. Try using the alternatives with mb_ prefix: mb_strtoupper[2] and mb_strtolower. For replacing ucfirst and lcfirst, consider using these:

if (!function_exists('mb_ucfirst') && function_exists('mb_substr')) {
    function mb_ucfirst($string) {
        $string = mb_strtoupper(mb_substr($string, 0, 1)) . mb_substr($string, 1);
        return $string;
    }
}
if (!function_exists('mb_lcfirst') && function_exists('mb_substr')) {
    function mb_lcfirst($string) {
        $string = mb_strtolower(mb_substr($string, 0, 1)) . mb_substr($string, 1);
        return $string;
    }
}

Notes

ucfirst, strtoupper, lcfirst and strtolower are used to work, affecting Latin characters only and keeping others as they are, but not any more. In late 2010, Toolserver has switched its default locale from C to en_US.UTF-8[3], and since then those plain string conversion functions can produce unexpected results. Some cases caused by the problem are listed at [4].

References

  1. http://www.phpwact.org/php/i18n/utf-8#strtolower
  2. http://php.net/manual/en/function.mb-strtoupper.php
  3. https://jira.toolserver.org/browse/TS-852
  4. https://jira.toolserver.org/browse/TS-923
Administration
Personal tools