using scispacy, trying to use the Hearst Patterns feature, which returns a spacy.tokens.span.Span object. When trying to get the result into a datafame I get an error, object is treated as several words and not as a single object.
Following the example -
import spacy
from scispacy.hyponym_detector import HyponymDetector
nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})
doc = nlp("Keystone plant species such as fig trees are good for the soil.")
print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]
print(type(doc_hp[0][1]))
>>> <class 'spacy.tokens.span.Span'>
doc_hp = doc._.hearst_patterns
dict = {
"hp_connector": doc_hp[0][0],
"hp_entity_1":doc_hp[0][1],
"hp_entity_2":doc_hp[0][2],
}
df = pd.DataFrame.from_dict(dict)
throws an error:
Traceback (most recent call last):
File "extract_hearst_patterns.py", line 42, in <module>
df = pd.DataFrame.from_dict(dict)
File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 1760, in from_dict
return cls(data, index=index, columns=columns, dtype=dtype)
File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 709, in __init__
mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 481, in dict_to_mgr
return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 115, in arrays_to_mgr
index = _extract_index(arrays)
File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 655, in _extract_index
raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length
This ended up working for me -
doc_hp = doc._.hearst_patterns
for pattern in doc_hp:
patten_dict = get_pattern_dict(full_sent, pattern)
patten_dict = {
"hp object": [patten],
"hp_connector": str(patten[0]),
"hp_entity_1": patten[1].text,
"hp_entity_2": patten[2].text,
}
list_of_pattern_dicts.append(patten_dict)
df = pd.DataFrame.from_dict(list_of_pattern_dicts)