phpregexmultibyte

What is best way to test uppercase or lowercase type of a given character in php?


What is an ideal way to detected if a character is uppercase or lowercase, regardless of the fact of the current local language.

Is there a more direct function?

Assumptions: Set internal character encoding to UTF-8 & Local browser session is en-US,en;q=0.5 & Have installed Multibyte String extension. Do not use ctype_lower, or ctype_upper.

See below test code that should be multibyte compatible.

$encodingtype = 'utf8';
$charactervalue = mb_ord($character, $encodingtype);

$characterlowercase = mb_strtolower($character, $encodingtype) ;
$characterlowercasevalue = mb_ord(mb_strtolower($character, $encodingtype));

$characteruppercase = mb_strtoupper($character, $encodingtype);
$characteruppercasevalue = mb_ord(mb_strtoupper($character, $encodingtype));



// Diag Info
echo 'Input: ' . $character . "<br />";
echo 'Input Value: ' . $charactervalue = mb_ord($character, $encodingtype) . "<br />" . "<br />";
echo 'Lowercase: ' . $characterlowercase = mb_strtolower($character, $encodingtype) . "<br />";
echo 'Lowercase Value: ' . $characterlowercasevalue = mb_ord(mb_strtolower($character, $encodingtype)) . "<br />" . "<br />";
echo 'Uppercase: ' . $characteruppercase = mb_strtoupper($character, $encodingtype) . "<br />";
echo 'Uppercase Value: ' . $characteruppercasevalue = mb_ord(mb_strtoupper($character, $encodingtype)) . "<br />" . "<br />";
// Diag Info


if ($charactervalue == $characterlowercasevalue and $charactervalue != $characteruppercasevalue){
    $uppercase = 0;
    $lowercase = 1;
    echo 'Is character is lowercase' . "<br />" . "<br />";
}

elseif ($charactervalue == $characteruppercasevalue and $charactervalue != $characterlowercasevalue ){
    $uppercase = 1;
    $lowercase = 0;
    echo 'Character is uppercase' . "<br />" . "<br />";
}

else{
    $uppercase = 0;
    $lowercase = 0;
    echo 'Character is neither lowercase or uppercase' . "<br />" . "<br />";
}

Solution

  • I feel the most direct way would be to write a regex pattern with alternations to determine the character type.

    In the following snippet, I'll search for uppercase letters (including unicode) in the first capture group, or lowercase letters in the second capture group, or an empty match. If the pattern makes an empty match, the character is not a letter and only the full string match element will be populated in the match array.

    A good reference for unicode letters in regex: https://regular-expressions.mobi/unicode.html

    Writing two capture groups separated by a pipe means each type of letter will be slotted into a different indexed element in the output array. [0] is the fullstring match and will only be used if it is the only element in the array. [1] will hold the uppercase match (or be empty when there is a lowercase match -- as a placeholding element). [2] will hold the lowercase match -- it will only be generated if there is a lowercase match.

    For this reason, we can assume the highest key in the matches array will determine the casing of the letter.

    If the input character is a non-letter, preg_match() will return an single-element array, when this happens the 0 key is used with the lookup to access neither.

    Code: (Demo)

    $lookup = ['neither', 'upper', 'lower'];
    $tests = ['A', 'z', '+', '0', 'ǻ', 'Ͱ', ''];
    
    foreach ($tests as $test) {
        preg_match('~(\p{Lu})|(\p{Ll})|~u', $test, $out);
        printf("%s, %s\n", $test, $lookup[array_key_last($out)]);
        //printf("%s: %s\n", $test, $lookup[count($out) - 1]); // below PHP7.3
    }
    

    Output:

    A: upper
    z: lower
    +: neither
    0: neither
    ǻ: lower
    Ͱ: upper
    : neither
    

    This answer closely relates to this similar page: How to check if letter is upper or lower in PHP?