databricksazure-databricksspacy-3

Error while importing 'en_core_web_sm' for spacy in Azure Databricks


I am getting an error while loading 'en_core_web_sm' of spacy in Databricks notebook. I have seen a lot of other questions regarding the same, but they are of no help.

The code is as follows

 import spacy
 !python -m spacy download en_core_web_sm 
  from spacy import displacy

  nlp = spacy.load("en_core_web_sm")
  # Process 
  text = ("This is a test document")
  doc = nlp(text)

I get the error "OSError: [E050] Can't find model 'en_core_web_sm'. It doesn't seem to be a Python package or a valid path to a data directory"

The details of installation are

  Python - 3.8.10
  spaCy version 3.3

It simply does not work. I tried the following

   ℹ spaCy installation:
   /databricks/python3/lib/python3.8/site-packages/spacy

   NAME             SPACY                 VERSION                            
   en_core_web_sm   >=2.2.2               3.3.0   ✔
   

But the error still remains

Not sure if this message is relevant

/databricks/python3/lib/python3.8/site-packages/spacy/util.py:845: UserWarning: [W094] Model 'en_core_web_sm' (2.2.5) specifies an under-constrained spaCy version requirement: >=2.2.2. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.3.0,<3.4.0 warnings.warn(warn_msg)

Also the message when installing 'en_core_web_sm"

"Defaulting to user installation because normal site-packages is not writeable"

Any help will be appreciated

Ganesh


Solution

  • I suspect that you have cluster with autoscaling, and when autoscaling happened, new nodes didn't have the that module installed. Another reason could be that cluster node was terminated by cloud provider & cluster manager pulled a new node.

    To prevent such situations I would recommend to use cluster init script as it's described in the following answer - it will guarantee that the module is installed even on the new nodes. Content of the script is really simple:

    #!/bin/bash
    
    pip install spacy
    python -m spacy download en_core_web_sm