pythonspacynamed-entity-recognitiontransliteration

What is the difference between 2 strings for SpaCy NER?


I need to find russian names and surnames in english text. I tried SpaCy NER, but it matches only english names (for instance, John Brandon), but not russian (like Vitaliy Ivanov).

I use the transliterate python library to translit english text to russian and then apply russian Spacy nlp model to get names by NER.

But i get different result for the string 'Ivanov Vitaliy' and the same string gotten by transliteration.

The code is as follows:

from transliterate import translit, get_available_language_codes #pip install transliterate
from transliterate.discover import autodiscover
autodiscover()

from transliterate.base import TranslitLanguagePack, registry

class ExampleLanguagePack(TranslitLanguagePack):
    language_code = "example"
    language_name = "Example"
    mapping = (
             u'abvgdeziyklmnprstufhABVGDEZIYKLMNPRSTUFH', 
             u'абвгдезийклмнпрстуфхАБВГДЕЗИЙКЛМНПРСТУФХ',
             )
             
    pre_processor_mapping = {
             u"kh": u"x",
             u'ye': u'е',
             u'yo': u'ё',
             u'zh': u'ж', 
             u'ts': u'ц',
             u'ch': u'ч',
             u'sh': u'ш',
             u'shch': u'щ',
             u'sch': u'щ',
             #u'y': u'ы',
             u'e': u'э',
             u'kh': u'х',
             u'yu': u'ю',
             u'iu':u'ю',
             u'ya': u'я',
             u'ia': u'я',
             }     

registry.register(ExampleLanguagePack)
print(get_available_language_codes())

ru_trans=translit('Ivanov Vitaliy','example')
print (ru_trans)

name_ru=nlp_ru(ru_trans) 
#use NER to extract names
 
person=[entity.text for entity in name_ru.ents if entity.label_=="PER"] 
print(person)

Got ['Виталий']

Another case:

name_ru=nlp_ru('Иванов Виталий')

#name_ru=nlp_ru(ru_trans) 
#use NER to extract names
 
person=[entity.text for entity in name_ru.ents if entity.label_=="PER"] 
print(person)

The result is ['Иванов Виталий']

I checked the type of the object gotten after translit:

type(ru_trans) 

str

The ASCII codes of the letter in these two strings are the same.

What can be the reason for getting different results of NER on these two strings?


Solution

  • You forgot to define the o letter mapping.

    Here is the fix:

    mapping = (
        'abvgdeziyklmnoprstufhABVGDEZIYKLMNOPRSTUFH', 
        'абвгдезийклмнопрстуфхАБВГДЕЗИЙКЛМНОПРСТУФХ',
    )
    

    Note I added both upper- and lowercase o letter mappings.