1 # This file is derived from
3 # http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
5 # Which was created by Markus Kuhn <mkuhn@acm.org> - 2000-09-02
7 # lines begining with # and blank lines are ignored
9 # Beyond that, this file consists of a series of test cases. Each test case consists of
14 # VALID : The string is a valid UTF-8 representation of valid Unicode
15 # INCOMPLETE : The string has a partial character at the end
16 # NOTUNICODE : The string is valid UTF-8, but the characters represented
17 # are not valid unicode (
18 # OVERLONG : The string includes overlong sequences
19 # MALFORMED : The string is not valid UTF-8
20 # 3. If the status is VALID or NOTUNICODE, the UCS-4 representation of the string,
21 # as a series of hex numbers.
23 # 1 Some correct UTF-8 text
26 03ba 1f79 03c3 03bc 03b5
28 # 2.1 First possible sequence of a certain length
30 # FIXME - handle NULLS?
80 # 2.3 Other boundary conditions
106 # 3.1 Unexpected continuation bytes
124 €�‚ƒ„…†‡ˆ‰Š‹Œ�Ž��‘’“”•–—˜™š›œ�žŸ¡¢£¤¥¦§¨©ª«¬®¯°±²³´µ¶·¸¹º»¼½¾¿
127 # 3.2 Lonely start characters
129 À Á Â Ã Ä Å Æ Ç È É Ê Ë Ì Í Î Ï Ð Ñ Ò Ó Ô Õ Ö × Ø Ù Ú Û Ü Ý Þ ß
131 à á â ã ä å æ ç è é ê ë ì í î ï
140 # 3.3 Sequences with last continuation byte missing
163 # 3.4 Concatenation of incomplete sequences
165 Àà€ð€€ø€€€ü€€€€ßï¿÷¿¿û¿¿¿ý¿¿¿¿
168 # 3.5 Impossible bytes
177 # Examples of an overlong ASCII character
190 # Maximum overlong sequences
203 # Overlong representation of the NUL character
216 # Illegal code positions
218 # Single UTF-16 surrogates
248 # Paired UTF-16 surrogates
284 # Some more tests, not from Markus Kuhn's file
287 # Mixed plane 0 and higher planes
291 41 00010000 42 10fffd 43