Shifting Unicode Codepoints

Aug 26 2009

While working on a word search generator that translated letter symbols (sort of like cryptograms, except this one’s impossible to solve without the key), php ran out of memory on a rather ordinary function.

function uniord($c) {
    $h = ord($c{0});
    if ($h <= 0x7F) {
        return $h;
    } else if ($h < 0xC2) {
        return false;
    } else if ($h <= 0xDF) {
        return ($h & 0x1F) << 6 | (ord($c{1}) & 0x3F);
    } else if ($h <= 0xEF) {
        return ($h & 0x0F) << 12 | (ord($c{1}) & 0x3F) << 6
                                 | (ord($c{2}) & 0x3F);
    } else if ($h <= 0xF4) {
        return ($h & 0x0F) << 18 | (ord($c{1}) & 0x3F) << 12
                                 | (ord($c{2}) & 0x3F) << 6
                                 | (ord($c{3}) & 0x3F);
    } else {
        return false;
    }
}
function unichr($c) {
    if ($c <= 0x7F) {
        return chr($c);
    } else if ($c <= 0x7FF) {
        return chr(0xC0 | $c >> 6) . chr(0x80 | $c & 0x3F);
    } else if ($c <= 0xFFFF) {
        return chr(0xE0 | $c >> 12) . chr(0x80 | $c >> 6 & 0x3F)
                                    . chr(0x80 | $c & 0x3F);
    } else if ($c <= 0x10FFFF) {
        return chr(0xF0 | $c >> 18) . chr(0x80 | $c >> 12 & 0x3F)
                                    . chr(0x80 | $c >> 6 & 0x3F)
                                    . chr(0x80 | $c & 0x3F);
    } else {
        return false;
    }
}
function asc_shift($string, $amount) {
 $key = substr($string, 0, 1);
 if(strlen($string)==1) {
   return unichr(uniord($key) + $amount);
 } else {
   return unichr(uniord($key) + $amount) . asc_shift(substr($string, 1, strlen($string)-1), $amount);
 }
}

Now I understand why there is a need for a lower level implementation of UTF-8 strings by default in php 6. Python support of UTF-8 support has already been included in Python 3.0 released December 3rd, 2008.

No responses yet