[ACCEPTED]-UTF-8 in PHP regular expressions-utf-8

Accepted answer
Score: 35

Updated answer:
This is now tested and working

$post = '9999, škofja loka';
echo preg_match('/^\\d{4},[\\s\\p{L}]+$/u', $post);

\\w will not 15 work, because it does not contain all unicode 14 letters and contains also [0-9_] additionally 13 to the letters.

Important is also the u modifier 12 to activate the unicode mode.

If there can 11 be letters or whitespace after the comma then 10 you should put those into the same character 9 class, in your regex there are 0 or more 8 whitespace after the comma and then there 7 are only letters.

See http://www.regular-expressions.info/php.html for php regex details

The 6 \\p{L} (Unicode letter) is explained here

Important 5 is also the use of the end of string boundary 4 $ to ensure that really the complete string 3 is verified, otherwise it will match only 2 the first whitespace and ignore the rest 1 for example.

Score: 8

[a-zA-Z] will match only letters in the range of 4 a-z and A-Z. You have non-US-ASCII letters, and 3 therefore your regex won't match, regardless 2 of the /u modifier. You need to use the word 1 character escape sequence (\w).

$post = '9999,škofja loka';
echo preg_match('/^[0-9]{4},[\s]*[\w]+/u', $post);
Score: 7

The problem is your regular expression. You 8 are explicitly saying that you will only 7 accept a b c ... z A B C ... Z. š is not in the a-z set. Remember, š is 6 as different to s as any other character.

So 5 if you really just want a sequence of letters, then 4 you need to test for the unicode properties. e.g.

echo preg_match('/^[0-9]{4},[\s]*\p{L}+', $post);

That 3 shouuld work because \p{L} matches any unicode 2 character which is considered a letter. Not 1 just A through Z.

Score: 0

Add a u, and remember the trailing slash:

echo preg_match('/^[0-9]{4},[\s]*[a-zA-Z]+/u', $post);

Edited:

echo preg_match('/^\d{4},(?:\s|\w)+/u', $post);

0

More Related questions