I need to find russian names and surnames in english text. I tried SpaCy NER, but it matches only english names (for instance, John Brandon), but not russian (like Vitaliy Ivanov).
I use the transliterate python library to translit english text to russian and then apply russian Spacy nlp model to get names by NER.
But i get different result for the string 'Ivanov Vitaliy' and the same string gotten by transliteration.
The code is as follows:
from transliterate import translit, get_available_language_codes #pip install transliterate
from transliterate.discover import autodiscover
autodiscover()
from transliterate.base import TranslitLanguagePack, registry
class ExampleLanguagePack(TranslitLanguagePack):
language_code = "example"
language_name = "Example"
mapping = (
u'abvgdeziyklmnprstufhABVGDEZIYKLMNPRSTUFH',
u'абвгдезийклмнпрстуфхАБВГДЕЗИЙКЛМНПРСТУФХ',
)
pre_processor_mapping = {
u"kh": u"x",
u'ye': u'е',
u'yo': u'ё',
u'zh': u'ж',
u'ts': u'ц',
u'ch': u'ч',
u'sh': u'ш',
u'shch': u'щ',
u'sch': u'щ',
#u'y': u'ы',
u'e': u'э',
u'kh': u'х',
u'yu': u'ю',
u'iu':u'ю',
u'ya': u'я',
u'ia': u'я',
}
registry.register(ExampleLanguagePack)
print(get_available_language_codes())
ru_trans=translit('Ivanov Vitaliy','example')
print (ru_trans)
name_ru=nlp_ru(ru_trans)
#use NER to extract names
person=[entity.text for entity in name_ru.ents if entity.label_=="PER"]
print(person)
Got ['Виталий']
Another case:
name_ru=nlp_ru('Иванов Виталий')
#name_ru=nlp_ru(ru_trans)
#use NER to extract names
person=[entity.text for entity in name_ru.ents if entity.label_=="PER"]
print(person)
The result is ['Иванов Виталий']
I checked the type of the object gotten after translit:
type(ru_trans)
str
The ASCII codes of the letter in these two strings are the same.
What can be the reason for getting different results of NER on these two strings?
You forgot to define the o
letter mapping.
Here is the fix:
mapping = (
'abvgdeziyklmnoprstufhABVGDEZIYKLMNOPRSTUFH',
'абвгдезийклмнопрстуфхАБВГДЕЗИЙКЛМНОПРСТУФХ',
)
Note I added both upper- and lowercase o
letter mappings.