javapythonapache-sparkpysparkjohnsnowlabs-spark-nlp

java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.DocumentAssembler spark in Pycharm with conda env


I saved a pre-trained model from spark-nlp, then I'm trying to run a Python script in Pycharm with anaconda env:

Model_path = "./xxx"
model = PipelineModel.load(Model_path)

But I got the following error: (I tried with pyspark 2.4.4 & spark-nlp2.4.4, and pyspark 2.4.4 & spark-nlp2.5.4) Got the same error:

21/02/05 13:31:21 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).

Traceback (most recent call last):
  File "C:/Users/xxxx/xxxxx.py", line 381, in <module>
    model = PipelineModel.load(Model_path)
  File "C:\Users\xxxxxxxx\anaconda3\envs\python3.7\lib\site-packages\pyspark\ml\util.py", line 362, in load
    return cls.read().load(path)
  File "C:\Users\\xxxxxxxx\anaconda3\envs\python3.7\lib\site-packages\pyspark\ml\pipeline.py", line 242, in load
    return JavaMLReader(self.cls).load(path)
  File "C:\Users\xxxxxxxx\anaconda3\envs\python3.7\lib\site-packages\pyspark\ml\util.py", line 300, in load
    java_obj = self._jread.load(path)
  File "C:\Users\xxxxxxxx\anaconda3\envs\python3.7\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\xxxxxxxx\anaconda3\envs\python3.7\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
    return f(*a, **kw)
  File "C:\Users\xxxxxxxx\anaconda3\envs\python3.7\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
    format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling o314.load.
: java.lang.ClassNotFoundException: com.johnsnowlabs.nlp.DocumentAssembler

I'm new to pyspark and spark-nlp, might someone be able to help please?


Solution

  • some context first. The spark-nlp library depends on a jar file that needs to be present in the Spark classpath. There are three ways to provide this jar according to how you start the context in PySpark. a) When you start your Python app throught interpreter, you call sparknlp.start() and the jar will be automatically downloaded.

    b) You pass the jar to pyspark command using the --jars switch. In this case you took the jar from the releases page and download it manually.

    c) You start pyspark and pass --packages, here you need to pass a maven coordinate, example,

    pyspark --packages com.johnsnowlabs.nlp:spark-nlp_2.11:2.7.5
    

    Please check documentation here,

    https://github.com/JohnSnowLabs/spark-nlp#usage

    and make sure you pick the version you want.