tl;dr : I can't find a comprehensive list of all tags used in Google Grams Dataset besides that one which only includes PoS tags and _START_
, _ROOT_
and _END_
.
What do tokens like ,_.
, ._.
, _._
mean ? Given their frequencies -- see below -- I'd strongly assume they're tags (they can't be proper tokens).
Context :
I am trying to extract information from Google's n-grams dataset and have troubles understanding some of their tags, and how to take them into account.
Ultimately, I would like to approximate how likely a word will follow another one.
For example, calculating how likely the token protection
will follow equal
would roughly mean calculating count("equal protection") / count("equal *")
where *
is the wildcard : any 1gram in the corpus.
The tricky part is calculating that count("equal *")
.
Indeed, for example, the bi-gram equal to
accounts many times in the Google n-grams dataset :
equal to
, equal to_PRT
(disambiguated PoS version)equal _PRT_
(aggregated for all PRT i.e. particles that might follow equal
).As shows when I compute this on pyspark :
>>> total = ggrams.filter(ggrams.ngram.startswith("equal ")).groupby("ngram") \
.sum("match_count")
>>> total.sort("sum(match_count)", ascending=False).show(n=15)
+------------+----------------+
| ngram|sum(match_count)|
+------------+----------------+
|equal _NOUN_| 20130934|
| equal _PRT_| 16620727|
| equal to| 16598291|
|equal to_PRT| 16598291|
| equal _._| 5119672|
| equal _ADP_| 3037747|
| equal ,| 2276119|
| equal ,_.| 2276119|
| equal in| 1682835|
|equal in_ADP| 1682176|
| equal .| 1628257|
| equal ._.| 1628257|
|equal _CONJ_| 1363739|
| ... | ...|
So to avoid accounting the same bigram multiple times, my idea was to rather just sum all counts for all patterns like "equal <POS>"
where <POS>
is in the described PoS set [_PRT_, _NOUN_, ...]
(findable here)
Doing this I obtain sum figures that are 1/3rd of the one I'd get from the displayed dataframe above. Which strenghthen my hypothesis above that one count will account three times. But I can't help persuading myself what the best way to do it is, especially notifying these weird tokens ,_.
, ._.
, _._
which meanings I don't have any clue.
The list of POS tags given in the documentation does not mention two of the tags, but the 2012 paper Syntactic Annotations for the Google Books Ngram Corpus does:
‘.’
(punctuation marks)X
(a catch-all for other categories such as abbreviations or foreign words)So the token ,_.
is a comma appended with its POS tag, just like the token run_VERB
. Similarly, ._.
is a full stop appended with its POS tag. Finally, _._
means punctuation, any punctuation just like _VERB_
is any verb.