In the GPT-2 paper, under Section 2, Page 3 it says,
Since the supervised objective is the the same as the unsupervised objective but only evaluated on a subset of the sequence, the global minimum of the unsupervised objective is also the global minimum of the supervised objective.
I didn't follow this line of reasoning. What is the logic behind concluding this?
The underlying principle here is that if f
is a function with domain D
and S
is a subset of D
, then if d
maximizes f
over D
and d
happens to be in S
, then d
also maximizes f
over S
.
In simper words "a global maximum is also a local maximum".
Now how does this apply to GPT-2? Let's look at how GPT-2 is trained.
First step: GPT-2 uses unsupervised training to learn the distribution of the next letter in a sequence by examining examples in a huge corpus of existing text. By this point, it should be able to output valid words and be able to complete things like "Hello ther" to "Hello there".
Second step: GPT-2 uses supervised training at specific tasks such as answering specific questions posed to it such as "Who wrote the book the origin of species?" Answer "Charles Darwin".
Question: Does the second step of supervised training undo general knowledge that GPT-2 learned in the first step?
Answer: No, the question-answer pair "Who wrote the book the origin of species? Charles Darwin." is itself valid English text that comes from the same distribution that the network is trying to learn in the first place. It may well even appear verbatim in the corpus of text from step 1. Therefore, these supervised examples are elements of the same domain (valid English text) and optimizing the loss function to get these supervised examples correct is working towards the same objective as optimizing the loss function to get the unsupervised examples correct.
In simpler words, supervised question-answer pairs or other specific tasks that GPT-2 was trained to do use examples from the same underlying distribution as the unsupervised corpus text, so they are optimizing towards the same goal and will have the same global optimum.
Caveat: you can still accidentally end up in a local-minimum due to (over)training using these supervised examples that you might not have run into otherwise. However, GPT-2 was revolutionary in its field and whether or not this happened with GPT-2, it still made significant progress from the state-of-the-art before it.