unicodeutf-8nlp

Why is the vocab size of Byte level BPE smaller than Unicode's vocab size?


I recently read GPT2 and the paper says:

This would result in a base vocabulary of over 130,000 before any multi-symbol tokens are added. This is prohibitively large compared to the 32,000 to 64,000 token vocabularies often used with BPE. In contrast, a byte-level version of BPE only requires a base vocabulary of size 256.

I really don't understand the words. The number of characters that Unicode represents is 130K but how can this be reduced to 256? Where's the rest of approximately 129K characters? What am I missing? Does byte-level BPE allow duplicating of representation between different characters?

I don't understand the logic. Below are my questions:


Detail question

Thank you for your answer but I really don't get it. Let's say we have 130K unique characters. What we want (and BBPE do) is to reduce this basic (unique) vocabulary. Each Unicode character can be converted 1 to 4 bytes by utilizing UTF-8 encoding. The original paper of BBPE says (Neural Machine Translation with Byte-Level Subwords):

Representing text at the level of bytes and using the 256 bytes set as vocabulary is a potential solution to this issue.

Each byte can represent 256 characters (bits, 2^8), we only need 2^17 (131072) bits for representing the unique Unicode characters. In this case, where did the 256 bytes in the original paper come from? I don't know both the logic and how to derive this result.

I arrange my questions again, more detail:

Since I have little knowledge of computer architecture and programming, please let me know if there's something I missed.

Sincerely, thank you.


Solution

  • Unicode code points are integers in the range 0..1,114,112, of which roughly 130k are in use at the moment. Every Unicode code point corresponds to a character, like "a" or "λ" or "龙", which is handy to work with in many cases (but there are a lot of complicated details, eg. combining marks).

    When you save text data to a file, you use one of the UTFs (UTF-8, UTF-16, UTF-32) to convert code points (integers) to bytes. For UTF-8 (the most popular file encoding), each character is represented by 1, 2, 3, or 4 bytes (there's some inner logic to discriminate single- and multi-byte characters).

    So when the base vocabulary are bytes, this means that rare characters will be encoded with multiple BPE segments.

    Example

    Let's consider a short example sentence like “That’s great 👍”.

    With a base vocabulary of all Unicode characters, the BPE model starts off with something like this:

    T      54
    h      68
    a      61
    t      74
    ’    2019
    s      73
           20
    g      67
    r      72
    e      65
    a      61
    t      74
           20
    👍   1F44D
    

    (The first column is the character, the second its codepoint in hexadecimal notation.)

    If you first encode this sentence with UTF-8, then this sequence of bytes is fed to BPE instead:

    T      54
    h      68
    a      61
    t      74
    �      e2
    �      80
    �      99
    s      73
           20
    g      67
    r      72
    e      65
    a      61
    t      74
           20
    �      f0
    �      9f
    �      91
    �      8d
    

    The typographic apostrophe "’" and the thumbs-up emoji are represented by multiple bytes.

    With either input, the BPE segmentation (after training) may end with something like this:

    Th|at|’s|great|👍
    

    (This is a hypothetical segmentation, but it's possible that capitalised “That“ is too rare to be represented as a single segment.)

    The number of BPE operations is different though: to arrive at the segment ’s, only one merge step is required for code-point input, but three steps for byte input.

    With byte input, the BPE segmentation is likely to end up with sub-character segments for rare characters. The down-stream language model will have to learn to deal with that kind of input.