I need to extract the name, gender, job title and employer/company name from newspaper articles, running the process on local hardware (no Cloud allowed) due to copyright reasons.
I've been playing around with Llama 3.1 but I'm finding I don't get useable results with the models smaller than 70B parameters, and at that size the models run much too slowly on the best hardware I have to throw at them.
Is there another, smaller LLM that might be good at this while using fewer processing resources?
Is there is NER I can use to extract all that data? The NERs I've looked into extract name but not gender. (I don't know if they extract the other data because gender is a showstopper for me.)
Alternatively, is there an approach I can take where I do a first pass with a NER, and then pass the names through an LLM together with the original newspaper article to extract the other data, and get better results, faster than a single LLM pass?
Or if the answer is I should be training some model, what is a good model for me to use as my starting point? I'm very much at the beginning of my machine learning journey and would love to be pointed in the right direction.
Thanks in advance!
Apart from your limitations, I wouldn't recommend using LLMs like Llamma 3.1 for such a task. NER
is one of the classic tasks of NLP and there are smaller language models and tools you can incorporate to achieve your goal. You can use NLTK
or SpaCy
for this matter. My personal choice is SpaCy
, however a gender
as you defined is not a known named entity. you can see a list of named entities in this doc.
I guess what you mean by gender
is the possible gender
associated with the names of a PERSON
mentioned in your articles. There are a few python packages that you can use to lookup genders, however, you should note that this can be very ambiguous and there should be a substantial tolerance for error. You can use gender-guesser
package.
A possible solution would be like this:
import spacy
import gender_guesser.detector as gender
nlp = spacy.load("en_core_web_sm")
def extract_info(text):
doc = nlp(text)
gender_detector = gender.Detector()
for ent in doc.ents:
if ent.label_ == "PERSON":
name = ent.text
name_gender = gender_detector.get_gender(name)
return doc.ents, name_gender
Note that en_core_web_sm
is the small model available via spaCy, you can use the large model by specifying en_core_web_lg
, just make sure that the model is downloaded before running your code. here's how you can download the model:
python -m spacy download en_core_web_sm