I am working on a project where I need to do coreference resolution on a lot of text. In doing so I've dipped my toe into the NLP world and found AllenNLP's coref model.
In general I have a script where I use pandas to load in a dataset of "articles" to be resolved and pass those articles to the predictor.from_path()
object to be resolved. Because of the large number of articles that I want to resolve, I'm running this on a remote cluster(though I don't believe that is the source of this problem as this problem also occurs when I run the script locally). That is, my script looks something like this:
from allennlp.predictors.predictor import Predictor
import allennlp_models.tagging
import pandas as pd
print("HERE TEST")
def predictorFunc(article):
predictor = predictor.from_path("https://storage.googleapis.com/allennlp-public-models/coref-spanbert-large-2021.03.10.tar.gz")
resolved_object = predictor(document=article)
### Some other interrogation of the predicted clusters ###
return resolved_object['document']
df = pd.read_csv('articles.csv')
### Some pandas magic ###
resolved_text = predictorFunc(article_pre_resolved)
When I execute the script the following message is printed to my .log file before anything else (for example the print("HERE TEST")
that I included) -- even before the predictor
object itself is called:
Some weights of BertModel were not initialized from the model checkpoint at SpanBERT/spanbert-large-cased and are newly initialized: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
I understand that this message itself is to be expected as I'm using a pre-trained model, but when this message appears it completely locks up the .log file (nothing else gets printed until the script ends and everything gets printed at once). This has been deeply problematic for me as it makes it almost impossible to debug other parts of my script in a meaningful way. (It will also make tracking the final script's progress on a large dataset very difficult... :'( ) Also, I would very much like to know why the predictor object appears to be loading even before it gets called. Though I can't tell for sure, I also think that whatever is causing this is also causing runaway memory use (even for toy examples of just a single 'article' (a couple hundred words as a string)).
Has anyone else had this problem/know why this happens? Thanks very much in advance!
I think I figured out two competing and unrelated problems in what I was doing. First, the reason for the unordered printing had to do with SLURM. Using the --unbuffered
option fixed the printing problem and made diagnosis much easier. The second problem (which looked like runaway memory usage) had to do with a very long article (aprox 10,000 words) that was just over the max length of the Predictor object. I'm going to close this question now!