pythonentitystanford-nlpspacyinformation-extraction

Best Approach for Custom Information Extraction (NER)


I'm trying to extract locations from blobs of text (NER/IE) and have tried many solutions all which are far too innaccurate spacy, Stanford etc etc.

All really are only about 80-90% accurate on my dataset (spacy was like 70%), another problem I'm having is not having a probability that means anything for these entities so I don't know confidence and can't proceed accordingly.

I tried a super naive approach of splitting my blobs into singular words then extracting surrounding context as features, also used a location placename lookup (30/40k location placenames) as a feature aswell. Then I used just a classifier(XGDBoost) and the results where much better once I trained the classifier on about 3k manually labelled datapoints (100k total only 3k where locations). 95% precision for states/countries and about 85% for cities.

This approach sucks obviously but why is it outperforming everything I have tried? I think the black box approach to NER just isn't working for my data problem, I tried spacy custom training and it really just didn't seem like it was going to work. Not having a confidence in the entity is kind of killer also as the probability they give you for that is almost meaningless.

Is there someway I can approach this problem a little better to improve my results even more? shallow nlp for like 2/3/4-grams? Another problem I have with my approach is the output of the classifier isnt some sequential entity, its literally just classified word blobs which somehow need to be clustered back into one entity i.e : -> San Francisco, CA is just 'city','city', '0','state' with no concept of them being the same entity

spacy example:

example blob:

About Us - Employment Opportunities Donate Donate Now The Power of Mushrooms Enhancing Response Where We Work Map Australia Africa Asia Pacific Our Work Agriculture Anti - Trafficking and Gender - based Violence Education Emergency Response Health and Nutrition Rural and Economic Development About Us Who We Are Annual Report Newsletters Employment Opportunities Video Library Contact Us Login My Profile Donate Join Our Email List Employment Opportunities Annual Report Newsletters Policies Video Library Contact Us Employment Opportunities Current Career Opportunity Internships Volunteer Who We Are Our History Employment Opportunities with World Hope International Working in Service to the Poor Are you a professional that wants a sense of satisfaction out of your job that goes beyond words of affirmation or a pat on the back ? You could be a part of a global community serving the poor in the name of Jesus Christ . You could use your talents and resources to make a significant difference to millions . Help World Hope International give a hand up rather than a hand out . Career opportunities . Internship opportunities . Volunteer Why We Work Here World Hope International envisions a world free of poverty . Where young girls aren ’ t sold into sexual slavery . Where every child has enough to eat . Where men and women can earn a fair and honest wage , and their children aren ’ t kept from an education . Where every community in Africa has clean water . As an employee of World Hope International , these are the people you will work for . Regardless of their religious beliefs , gender , race or ethnic background , you will help shine the light of hope into the darkness of poverty , injustice and oppression . Find out more by learning about the of World Hope International and reviewing a summary of our work in the most recent history annual report . Equal Opportunity Employer World Hope International is both an equal opportunity employer and a faith - based religious organization . We hire US employees without regard to race , color , ancestry , national origin , citizenship , age , sex , marital status , parental status , membership in any labor organization , political ideology or disability of an otherwise qualified individual . We hire national employees in our countries of operation pursuant to the law of the country where we hire the employees . The status of World Hope International as an equal opportunity employer does not prevent the organization from hiring US staff based on their religious beliefs so that all US staff share the same religious commitment . Pursuant to the United States Civil Rights Act of 1964 , Section 702 ( 42 U . S . C . 2000e 1 ( a ) ) , World Hope International has the right to , and does , hire only candidates whose beliefs align with the Apostle ’ s Creed . Apostle ’ s Creed : I believe in Jesus Christ , Gods only Son , our Lord , who was conceived by the Holy Spirit , born of the Virgin Mary , suffered under Pontius Pilate , was crucified , died , and was buried ; he descended to the dead . On the third day he rose again ; he ascended into heaven , he is seated at the right hand of the Father , and he will come again to judge the living and the dead . I believe in the Holy Spirit , the holy catholic church , the communion of saints , the forgiveness of sins , the resurrection of the body , and the life everlasting . AMEN . Christian Commitment All applicants will be screened for their Christian commitment . This process will include a discussion of : The applicant ’ s spiritual journey and relationship with Jesus Christ as indicated in their statement of faith The applicant ’ s understanding and acceptance of the Apostle ’ s Creed . Statement of Faith A statement of faith describes your faith and how you see it as relevant to your involvement with World Hope International . It must include , at a minimum , a description of your spiritual disciplines ( prayer , Bible study , etc . ) and your current fellowship or place of worship . Applicants can either incorporate their statement of faith into their cover letter content or submit it as a separate document . 519 Mt Petrie Road Mackenzie , Qld 4156 1 - 800 - 967 - 534 ( World Hope ) + 61 7 3624 9977 CHEQUE Donations World Hope International ATTN : Gift Processing 519 Mt Petrie Road Mackenzie , Qld 4156 Spread the Word Stay Informed Join Email List Focused on the Mission In fiscal year 2015 , 88 % of all expenditures went to program services . Find out more . Privacy Policy | Terms of Service World Hope Australia Overseas Aid Fund is registered with the ACNC and all donations over $ 2 are tax deductible . ABN : 64 983 196 241 © 2017 WORLD HOPE INTERNATIONAL . All rights reserved .'

and the results:

('US', 'GPE')
('US', 'GPE')
('US', 'GPE')
('the', 'GPE')
('United', 'GPE')
('States', 'GPE')
('Jesus', 'GPE')
('Christ', 'GPE')
('Pontius', 'GPE')
('Pilate', 'GPE')
('Faith', 'GPE')
('A', 'GPE')

Solution

  • Even the best Deep Learning based NER systems only achieve an F1 of 92.0 these days. Deep Learning based systems (CNN-BiLSTM-CRF) should outperform Stanford CoreNLP's plain CRF sequence tagger. Recently there have been even more advancements involving integrating language models. You might want to look at AllenNLP.

    But if you want super high accuracy like 99.0%, you're going to have integrate rule-based approaches for the time being.

    I think rule-based processing could be helpful. For instance, you can write a pattern that says "city city O , state" should be merged together into one entity. Also, you might want to consider discarding entities that don't appear in your dictionary of location/places. Or discard entities that aren't in a location dictionary but are in another type. But I find it hard to believe many unknown string sequences are location place names you care about extracting. I would think people names are the most likely to be outside of dictionaries.

    UIUC's NLP tools have some dictionaries in them if you download their software.

    When running StanfordCoreNLP, using the ner,regexner,entitymentions annotators will allow automatic grouping together of consecutive NE tags into entities. Full info on the pipeline here: https://stanfordnlp.github.io/CoreNLP/cmdline.html

    Also, remember the out-of-the-box versions of these systems are typically trained on news articles from the last 15 years. Retraining on data closer to your set is essential. Ultimately you might be best off just writing some rules that do dictionary based extraction.

    You can look into Stanford CoreNLP's TokensRegex and RegexNER functionality to see how to use Stanford CoreNLP for that purpose.

    TokensRegex: https://nlp.stanford.edu/software/tokensregex.html RegexNER: https://nlp.stanford.edu/software/regexner.html