pythondataframenlpspacyspacy-3

error inserting spacy.tokens.span.Span into pandas dataframe


using scispacy, trying to use the Hearst Patterns feature, which returns a spacy.tokens.span.Span object. When trying to get the result into a datafame I get an error, object is treated as several words and not as a single object.

Following the example -

import spacy
from scispacy.hyponym_detector import HyponymDetector

nlp = spacy.load("en_core_sci_sm")
nlp.add_pipe("hyponym_detector", last=True, config={"extended": False})

doc = nlp("Keystone plant species such as fig trees are good for the soil.")

print(doc._.hearst_patterns)
>>> [('such_as', Keystone plant species, fig trees)]
print(type(doc_hp[0][1]))
>>> <class 'spacy.tokens.span.Span'>

doc_hp = doc._.hearst_patterns
dict = {
    "hp_connector": doc_hp[0][0],
    "hp_entity_1":doc_hp[0][1],
    "hp_entity_2":doc_hp[0][2],
}

df = pd.DataFrame.from_dict(dict)

throws an error:

Traceback (most recent call last):
  File "extract_hearst_patterns.py", line 42, in <module>
    df = pd.DataFrame.from_dict(dict)
  File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 1760, in from_dict
    return cls(data, index=index, columns=columns, dtype=dtype)
  File "/venv/lib/python3.9/site-packages/pandas/core/frame.py", line 709, in __init__
    mgr = dict_to_mgr(data, index, columns, dtype=dtype, copy=copy, typ=manager)
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 481, in dict_to_mgr
    return arrays_to_mgr(arrays, columns, index, dtype=dtype, typ=typ, consolidate=copy)
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 115, in arrays_to_mgr
    index = _extract_index(arrays)
  File "/venv/lib/python3.9/site-packages/pandas/core/internals/construction.py", line 655, in _extract_index
    raise ValueError("All arrays must be of the same length")
ValueError: All arrays must be of the same length

Solution

  • This ended up working for me -

    doc_hp = doc._.hearst_patterns
    for pattern in doc_hp:
        patten_dict = get_pattern_dict(full_sent, pattern)
        patten_dict = {
            "hp object": [patten],
            "hp_connector": str(patten[0]),
            "hp_entity_1": patten[1].text,
            "hp_entity_2": patten[2].text,
        }
        list_of_pattern_dicts.append(patten_dict)
    df = pd.DataFrame.from_dict(list_of_pattern_dicts)