There are 44 characters in this text: the quick brown fox jumped over the lazy dog
The same text can be represented with just 11 Unicode characters:
(These characters look the same as "[]" but they are all different characters!)
This is because the ASCII characters can range from 1-27 (base 27 if you only use the 27 characters in this character set abcdefghijklmnopqrstuvwxyz
) and Unicode characters range from 1-1114112, which means you can store multiple numbers in a bigger number if you do indices-related math.
For example, the text this
looks like [19, 7, 8, 18]
if you convert each character to their index in the above base 27 character set. If you do the calculation below:
19 x 27 ^ 0 +
7 x 27 ^ 1 +
8 x 27 ^ 2 +
18 x 27 ^ 3 = 360334
You will get a unique number 360334
, which happens to be within 1-1114112 so you can do chr(360334)
to get the Unicode character
. To go back, you do ord('')
to get 360334
, which you can continuously divmod to get back the numbers shown below:
360334 % 27 = 19
360334 // 27 = 13345
13345 % 27 = 7
13345 // 27 = 494
494 % 27 = 8
494 // 27 = 18
18 % 27 = 18
18 // 27 = 0 BREAK
The question is: How to make this as a convert and revert function in Python?
Here is my attempt:
def power_sum(values, base, offset = 0):
return sum(value * base ** (index + offset) for index, value in enumerate(values))
def convert_text(text, chars):
base = len(chars)
chars = {char : index for index, char in enumerate(chars)}
temp = []
result = ''
for index, char in enumerate(text):
value = chars[char] # indexerror = missing that char in char set
if power_sum(temp, base, 1) + value > 0x10FFFF: # U+10FFFF is max unicode code point
result += chr(power_sum(temp, base))
temp = [value]
else:
temp.append(value)
result += chr(power_sum(temp, base))
return result
def revert_text(text, chars):
base = len(chars)
chars = list(chars)
result = ''
for char in text:
value = ord(char)
while value:
result += chars[int(value % base)]
value //= base
return result
chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars), end = '\n\n')
texts = [
'this',
'the quick brown fox jumped over the lazy dog',
'china'
]
for text in texts:
print('Start text ({}): {}'.format(len(text), text))
text = convert_text(text, chars)
print('Unicode text ({}): {}'.format(len(text), text))
text = revert_text(text, chars)
print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')
Output:
Base: 27
Start text (4): this
Unicode text (1):
Revert text (4): this
Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (11): 늺🖛
Revert text (44): the quick brown fox jumped over the lazy dog
Start text (5): china
Unicode text (2):
Revert text (4): chin
It fails with the string china
for some reason.
Thanks to chrslg's answer for showing that the code stores any number of trailing index 0 characters as 0
after the indices math. I fixed the code by shifting all the characters in the charset by 1, essentially making the 27 char set = base 28.
def power_sum(values, base, offset = 0):
return sum(value * base ** (index + offset) for index, value in enumerate(values))
def convert_text(text, chars):
base = len(chars) + 1
chars = {char : index + 1 for index, char in enumerate(chars)}
temp = []
result = ''
for char in text:
value = chars[char] # indexerror = missing that char in char set
if value * base ** len(temp) + power_sum(temp, base, 1) > 0x10FFFF:
# U+10FFFF is max unicode code point
result += chr(power_sum(temp, base))
temp = [value]
else:
temp.append(value)
result += chr(power_sum(temp, base))
return result
def revert_text(text, chars):
base = len(chars) + 1
chars = [None] + list(chars)
result = ''
for char in text:
value = ord(char)
while value:
result += chars[int(value % base)]
value //= base
return result
chars = 'abcdefghijklmnopqrstuvwxyz '
print('Base:', len(chars) + 1, end = '\n\n')
texts = [
'this',
'the quick brown fox jumped over the lazy dog',
'china',
'aaa',
'aaaaa',
'aaaaaaaa',
'aaaab'
]
for text in texts:
print('Start text ({}): {}'.format(len(text), text))
text = convert_text(text, chars)
print('Unicode text ({}): {}'.format(len(text), text))
text = revert_text(text, chars)
print('Revert text ({}): {}'.format(len(text), text), end = '\n\n')
Output:
Base: 28
Start text (4): this
Unicode text (1):
Revert text (4): this
Start text (44): the quick brown fox jumped over the lazy dog
Unicode text (12): 𑼭䬪
Revert text (44): the quick brown fox jumped over the lazy dog
Start text (5): china
Unicode text (2):
Revert text (5): china
Start text (3): aaa
Unicode text (1): ̭
Revert text (3): aaa
Start text (5): aaaaa
Unicode text (2): 壭
Revert text (5): aaaaa
Start text (8): aaaaaaaa
Unicode text (2): 壭壭
Revert text (8): aaaaaaaa
Start text (5): aaaab
Unicode text (2): 壭
Revert text (5): aaaab