javanlpstanford-nlp

Stanford CoreNLP and Emoji?


So far when I tried to use emoji and using the POS tagger, it appeared as unknown symbols, small boxes. Is there a way to get the POS tagger to work with emoji? Emoji (eg 😀) the unicode versions.


Solution

  • Provided the character encoding is correct throughout your code, system and the Stanford CoreNLP code, emoji should be represented correctly. However, you'll have two more fundamental problems:

    First, emoji are one character long and they are unlikely to be tagged as anything other than an indefinite article. 'a' in English. A smart tokenizer might make better sense of emoji, but I doubt it.

    Secondly, and more importantly, POS taggers annotate parts of speech. Emoji are not a part of speech. In the very least, they are an independent, new class of tokens, but certainly not grammatical.

    All that said ... you know their character codes ... they're already tagged.