This will be a long post. And it absolutely has nothing to do with homework, I am just curious, and this won't have immediate practical benefits, but that is like pursuing pure science, you never know what you will get.
I am trying to calculate the value of the total number of valid UTF8 sequences of length N divided by the total number of N-byte sequences. I want an infinitely accurate reduced fraction (which I represent as integer pair (numerator, denominator)
).
The total number of N-byte sequences is easy to calculate, it is just 256N or 1<<8*N
in Python. But the number of valid UTF8 sequences are trickier. But I know how to calculate it.
First, an UTF8 character can be encoded as byte sequences of lengths (1, 2, 3, 4)
, the bit patterns of UTF8 encoding are the following:
Code_Range | Byte_Length | Bit_Pattern | Data_Bits |
---|---|---|---|
U+0000..007F | 1 byte | 0xxxxxxx | 7 bits |
U+0080..07FF | 2 bytes | 110xxxxx 10xxxxxx | 11 bits |
U+0800..FFFF | 3 bytes | 1110xxxx 10xxxxxx 10xxxxxx | 16 bits |
U+10000..10FFFF | 4 bytes | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx | 21 bits |
Now UTF8 encoding doesn't use code-points in the range(0xd800, 0xe000)
, the following will raise an exception in Python:
In [530]: chr(0xd800).encode()
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
Cell In[530], line 1
----> 1 chr(0xd800).encode()
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud800' in position 0: surrogates not allowed
These code-points are used as surrogate pairs in UTF16 and they are invalid in UTF8, there are reserved characters in UNICODE but they are valid in UTF8:
In [531]: chr(0xe000).encode()
Out[531]: b'\xee\x80\x80'
Now for a given byte sequence of length N, the chance of it being a valid N-byte UTF8 encoding of a single character is:
First_Byte_Condition | Byte_Length | Data_Bits | Bit_Chunks | Valid_Range | Total | Proportion |
---|---|---|---|---|---|---|
(fb & 0x80) == 0x00 | 1 byte | 7 | 7, | 0x00..0x7f | 27 = 128 | 128/28 = 1/2 |
(fb & 0xe0) == 0xc0 | 2 bytes | 11 | 5, 6 | 0xc0..0xd7 | 24*26 = 1536 | 1536/216 = 3/128 |
(fb & 0xf0) == 0xe0 | 3 bytes | 16 | 4, 6, 6 | 0xe0..0xef | 216 = 65536 | 65536/224 = 1/256 |
(fb & 0xf8) == 0xf0 | 4 bytes | 21 | 3, 6, 6, 6 | 0xf0..0xf4 | 0x110000 - 0x10000 = 1048576 | 1048576/232 = 1/4096 |
I have written a UTF8 encoder and decoder as a programming challenge, and I reimplemented it as a UTF8 sequence validator, using absolute bruteforce I was able to find the value of (valid UTF8 of length N)/(byte sequence of length N) for N = 1, 2, 3:
from itertools import product
UTF8_PATTERN = {
0: (0, 0),
1: (1, 0),
2: (2, 0),
3: (3, 0),
4: (4, 0),
5: (5, 0),
6: (6, 0),
7: (7, 0),
8: (8, 0),
9: (9, 0),
10: (10, 0),
11: (11, 0),
12: (12, 0),
13: (13, 0),
14: (14, 0),
15: (15, 0),
16: (16, 0),
17: (17, 0),
18: (18, 0),
19: (19, 0),
20: (20, 0),
21: (21, 0),
22: (22, 0),
23: (23, 0),
24: (24, 0),
25: (25, 0),
26: (26, 0),
27: (27, 0),
28: (28, 0),
29: (29, 0),
30: (30, 0),
31: (31, 0),
32: (32, 0),
33: (33, 0),
34: (34, 0),
35: (35, 0),
36: (36, 0),
37: (37, 0),
38: (38, 0),
39: (39, 0),
40: (40, 0),
41: (41, 0),
42: (42, 0),
43: (43, 0),
44: (44, 0),
45: (45, 0),
46: (46, 0),
47: (47, 0),
48: (48, 0),
49: (49, 0),
50: (50, 0),
51: (51, 0),
52: (52, 0),
53: (53, 0),
54: (54, 0),
55: (55, 0),
56: (56, 0),
57: (57, 0),
58: (58, 0),
59: (59, 0),
60: (60, 0),
61: (61, 0),
62: (62, 0),
63: (63, 0),
64: (64, 0),
65: (65, 0),
66: (66, 0),
67: (67, 0),
68: (68, 0),
69: (69, 0),
70: (70, 0),
71: (71, 0),
72: (72, 0),
73: (73, 0),
74: (74, 0),
75: (75, 0),
76: (76, 0),
77: (77, 0),
78: (78, 0),
79: (79, 0),
80: (80, 0),
81: (81, 0),
82: (82, 0),
83: (83, 0),
84: (84, 0),
85: (85, 0),
86: (86, 0),
87: (87, 0),
88: (88, 0),
89: (89, 0),
90: (90, 0),
91: (91, 0),
92: (92, 0),
93: (93, 0),
94: (94, 0),
95: (95, 0),
96: (96, 0),
97: (97, 0),
98: (98, 0),
99: (99, 0),
100: (100, 0),
101: (101, 0),
102: (102, 0),
103: (103, 0),
104: (104, 0),
105: (105, 0),
106: (106, 0),
107: (107, 0),
108: (108, 0),
109: (109, 0),
110: (110, 0),
111: (111, 0),
112: (112, 0),
113: (113, 0),
114: (114, 0),
115: (115, 0),
116: (116, 0),
117: (117, 0),
118: (118, 0),
119: (119, 0),
120: (120, 0),
121: (121, 0),
122: (122, 0),
123: (123, 0),
124: (124, 0),
125: (125, 0),
126: (126, 0),
127: (127, 0),
192: (0, 1),
193: (1, 1),
194: (2, 1),
195: (3, 1),
196: (4, 1),
197: (5, 1),
198: (6, 1),
199: (7, 1),
200: (8, 1),
201: (9, 1),
202: (10, 1),
203: (11, 1),
204: (12, 1),
205: (13, 1),
206: (14, 1),
207: (15, 1),
208: (16, 1),
209: (17, 1),
210: (18, 1),
211: (19, 1),
212: (20, 1),
213: (21, 1),
214: (22, 1),
215: (23, 1),
224: (0, 2),
225: (1, 2),
226: (2, 2),
227: (3, 2),
228: (4, 2),
229: (5, 2),
230: (6, 2),
231: (7, 2),
232: (8, 2),
233: (9, 2),
234: (10, 2),
235: (11, 2),
236: (12, 2),
237: (13, 2),
238: (14, 2),
239: (15, 2),
240: (0, 3),
241: (1, 3),
242: (2, 3),
243: (3, 3),
244: (4, 3),
}
class UTF8_Validator:
def __init__(self, data: bytes) -> None:
self.data = memoryview(data)
self.length = len(data)
self.pos = 0
def _validate_char(self, pos: int) -> None:
data = self.data
if not (pattern := UTF8_PATTERN.get(data[pos])):
return False
code = pattern[0]
length = self.length
for _ in range(pattern[1]):
pos += 1
if pos == length or ((byte := data[pos]) & 0xC0) != 0x80:
return False
code = (code << 6) | (byte & 0x3F)
if code >= 0x110000:
return False
self.pos = pos + 1
return True
def validate(self) -> bool:
length = self.length
while self.pos < length:
if not self._validate_char(self.pos):
return False
return True
def get_valid_UTF8_proportion(length: int) -> tuple[int, int]:
count = sum(
1
for data in product(range(256), repeat=length)
if UTF8_Validator(bytes(data)).validate()
)
trailing = (count & -count).bit_length() - 1
return count >> trailing, 1 << 8 * length - trailing
In [527]: get_valid_UTF8_proportion(1)
Out[527]: (1, 2)
In [528]: get_valid_UTF8_proportion(2)
Out[528]: (35, 128)
In [529]: get_valid_UTF8_proportion(3)
Out[529]: (39, 256)
For the case N = 3, it takes a very long time.
But that is okay, I have implemented an optimized solution using dynamic programming that lets me calculate the result for N in the range of thousands, but it is still using bruteforce at its core.
First, an N byte sequence can only be valid UTF8 sequence if all its subsequences are valid UTF8 sequences and so on... The point is, a single UNICODE character can be encoded either as 1 byte, 2 bytes, 3 bytes or 4 bytes, and a valid UTF8 sequence can only consist of valid UTF8 characters, so to solve the problem we first need to find all ways the number N can be represented as sum of the numbers 1 to 4, it is an integer partition problem.
Integer partition:
def partitions_quad(number: int) -> list[tuple[int, ...]]:
result = []
stack = [(number, ())]
while stack:
remaining, path = stack.pop()
if not remaining:
result.append(path)
else:
stack.extend(
(remaining - i, path + (i,)) for i in range(min(remaining, 4), 0, -1)
)
return result
Example output:
In [557]: partitions_quad(6)
Out[557]:
[(1, 1, 1, 1, 1, 1),
(1, 1, 1, 1, 2),
(1, 1, 1, 2, 1),
(1, 1, 1, 3),
(1, 1, 2, 1, 1),
(1, 1, 2, 2),
(1, 1, 3, 1),
(1, 1, 4),
(1, 2, 1, 1, 1),
(1, 2, 1, 2),
(1, 2, 2, 1),
(1, 2, 3),
(1, 3, 1, 1),
(1, 3, 2),
(1, 4, 1),
(2, 1, 1, 1, 1),
(2, 1, 1, 2),
(2, 1, 2, 1),
(2, 1, 3),
(2, 2, 1, 1),
(2, 2, 2),
(2, 3, 1),
(2, 4),
(3, 1, 1, 1),
(3, 1, 2),
(3, 2, 1),
(3, 3),
(4, 1, 1),
(4, 2)]
I call the tuples like (3, 2, 1)
a run, and the probability of a random 6 byte sequence being a valid UTF8 sequence is the sum of the probability of all these runs. To get the probability, we first need to replace the numbers in the runs with their corresponding probability in the list: [(1, 2), (3, 128), (1, 256), (1, 4096)]
. I have guessed and experimentally confirmed that the probability of a run is the product of its sub-probabilities.
Using dynamic programming, bit-hacking, and some basic arithmetic I was able to implement the completely optimized bruteforce solution:
VALID_UTF8 = [(1, 2), (3, 128), (1, 256), (1, 4096)]
def reduced_add(num1: int, den1: int, num2: int, den2: int) -> tuple[int, int]:
if den1 > den2:
num1, den1, num2, den2 = num2, den2, num1, den1
num = (num1 << den2.bit_length() - den1.bit_length()) + num2
return num, den2
def valid_UTF8_chance(length: int) -> tuple[int, int]:
stack = [(1, 1)]
for i in range(1, length + 1):
total_num, total_den = 0, 1
for char_len in range(1, min(i + 1, 5)):
last_num, last_den = stack[-char_len]
run_num, run_den = VALID_UTF8[char_len - 1]
total_num, total_den = reduced_add(
total_num, total_den, last_num * run_num, last_den * run_den
)
stack.append((total_num, total_den))
stack = stack[-4:]
num, den = stack[-1]
shift = min((num & -num).bit_length() - 1, (den & -den).bit_length() - 1)
return num >> shift, den >> shift
In [561]: [valid_UTF8_chance(i) for i in range(1, 26)]
Out[561]:
[(1, 2),
(35, 128),
(39, 256),
(1389, 16384),
(1545, 32768),
(54995, 2097152),
(61175, 4194304),
(2177581, 268435456),
(2422281, 536870912),
(86223315, 34359738368),
(95912439, 68719476736),
(3414091309, 4398046511104),
(3797741065, 8796093022208),
(135184079315, 562949953421312),
(150375043575, 1125899906842624),
(5352737711661, 72057594037927936),
(5954237885961, 144115188075855872),
(211946563197395, 9223372036854775808),
(235763514741239, 18446744073709551616),
(8392218724509229, 1180591620717411303424),
(9335272783474185, 2361183241434822606848),
(332297603969210835, 151115727451828646838272),
(369638695103107575, 302231454903657293676544),
(13157628659176285741, 19342813113834066795298816),
(14636183439588716041, 38685626227668133590597632)]
And it uses very little memory and it is reasonably fast:
In [562]: %timeit valid_UTF8_chance(16)
35.8 μs ± 994 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [563]: %timeit valid_UTF8_chance(32)
79.5 μs ± 652 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [564]: %timeit valid_UTF8_chance(64)
175 μs ± 1.08 μs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
In [565]: %timeit valid_UTF8_chance(128)
371 μs ± 7.84 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [566]: %timeit valid_UTF8_chance(256)
818 μs ± 19.8 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [567]: %timeit valid_UTF8_chance(512)
1.73 ms ± 25.9 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [568]: %timeit valid_UTF8_chance(1024)
3.96 ms ± 102 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However it is still inefficient, because it is essentially doing these calculations, and there are lots of repeating patterns in them:
def valid_UTF8_chance_expansion(length: int) -> str:
segments = partitions_quad(length)
return " +\n".join(
" * ".join(("{0}/{1}".format(*VALID_UTF8[s - 1]) for s in segment))
for segment in segments
)
print(",\n\n".join(valid_UTF8_chance_expansion(i) for i in range(1, 7)))
But my math is very rusty, I can't devise a better algorithm. Is there a better way?
Okay, so my understanding of the UTF8 encoding is wrong, and subsequently my implementation of UTF8 encoding and decoding, hence the wrong numbers.
There are 0x80 - 0x00 = 128 one-byte UTF8 characters.
There are 0x800 - 0x80 = 1920 two-byte UTF8 characters. 512 of them have the first byte in the range(0xd8, 0xe0)
, these are also the first bytes used by surrogate pairs. I thought to exclude them, because I thought all two-byte sequences with the first byte in the surrogate range must be a part of a surrogate pair, I was wrong.
There are 0x10000 - 0x800 = 63488 UNICODE characters that correspond to a three-byte sequence, but it contains the range(0xd800, 0xe000)
which are surrogate bytes, here we should exclude them, subtract 2048 from 63488 and we get 61440.
And finally, there are indeed 0x110000 - 0x10000 = 1048576 four-byte UTF8 characters.
I feel I need to add this. My original function valid_UTF8_chance
uses the correct method, but the initial values are wrong. Using VALID_UTF8 = [(1, 2), (15, 512), (15, 4096), (1, 4096)]
I got the correct numbers. I have re-implemented my UTF8 encoding and decoding function and used itertools.product(range(256), repeat=length)
to check all byte sequences of a given length, using absolute bruteforce and running the code on PyPy3 I was able to verify the result for lengths 1, 2, 3, 4 in under 2 hours.
I am currently working on implement a brute-force checking in C++ with threading and asynchronous functions to verify the numbers for case length = 5, I think I need to utilize the GPU for this.
Honestly I couldn't quite follow this question all the way through, but here's the idea I think you want.
Judging by your last program, you figured out, more or less, that you can express the number of valid UTF-8 byte sequences of length N
as a recurrence relation. I actually think you got the numbers wrong, though; I believe it should be this:
n(0) = 1
since there is only one UTF-8 byte sequence with zero bytesn(1) = 128
for obvious reasonsn(2) = 18304
:
128 * 128
sequences containing two characters encoded in one byte each1920
(that's 2**11 - 128
) sequences containing one character encoded in two bytesn(3) = 2650112
:
128 * 128 * 128
sequences of three characters128 * 1920 * 2
sequences of two characters (since you have a two-byte character, 1920 possibilities, and a one-byte character, 128 possibilities, in either of two orders)61440
(that's 2**16 - 1920 - 128
and then minus another 2**11
for the surrogates) sequences of one three-byte charactern(4) = 383270912
:
128 * 128 * 128 * 128
sequences of four characters128 * 128 * 1920 * 3
sequences of three characters (one two-byte character and two one-byte characters, with the two-byte character occupying any of three positions)1920 * 1920
sequences of two two-byte characters128 * 61440 * 2
sequences of a one-byte and a three-byte character, in either order1048576
sequences of one four-byte character(but my main point here is to demonstrate the technique, so if I'm wrong about these numbers, adjust accordingly) After that, the number of UTF-8 byte sequences of length N
if N > 4
is determined by this recurrence relation:
n(N) = 128 * n(N-1) + 1920 * n(N-2) + 61440 * n(N-3) + 1048576 * n(N - 4)
reflecting the fact that you can form a UTF-8 sequence of length N > 4
by appending a one-byte character to any sequence of length N - 1
, or appending a two-byte character to any sequence of length N - 2
, or so on for 3- or 4-byte characters.
This formula is very straightforward to implement iteratively or recursively, and you can get the desired proportion by dividing the result by 2**N
. So a reasonably efficient way to compute this is to write a function that computes n(N)
using iteration or recursion and then just divide it by 2**N
. For example:
@functools.cache
def numerator(N: int) -> int:
return 128 * numerator(N - 1) + ...
def valid_UTF8_chance_expansion(length: int):
return fractions.Fraction(numerator(length), 2**length) # or convert to string if you like
If you want something better than that, you can convert the sequence into a closed-form expression by using a characteristic polynomial. The basic idea is that you take the recurrence relation from above, replace each n(k)
with the k
th power of an unknown variable x
to get
x**N = 128 * x**(N-1) - 1920 * x**(N-2) - 61440 * x**(N-3) - 1048576 * x**(N-4)
then simplify and move everything to one side to get
x**4 - 128 * x**3 - 1920 * x**2 - 61440 * x - 1048576 = 0
and find the roots of that polynomial, call them r_1
through r_4
. (You can use Sympy or Wolfram Alpha or any computer algebra system or symbolic equation solver.) You can then write n(N)
as a linear combination of N
th powers of the roots; that is:
n(N) = c_1 * r_1**N + c_2 * r_2**N + c_3 * r_3**N + c_4 * r_4**N
for constants c_k
, which can be determined by matching to initial conditions - that is, write four copies of this formula for N=1
through N=4
, set them equal to the numeric values you know those to have, and you have a system of four linear equations in four variables which you can solve using standard techniques. You'll get a moderately complicated formula involving some roots and maybe some complex numbers, but with a bit of manipulation you can probably rearrange it in some way where you group the roots and i's together and then you can make a Python implementation that only uses integer arithmetic.