apache-sparkpysparkjohnsnowlabs-spark-nlp

How to install offline Spark NLP packages


How can I install offline Spark NLP packages without internet connection. I've downloaded the package (recognizee_entities_dl) and uploaded it to the cluster.

I've installed Spark NLP using pip install spark-nlp==2.5.5. I'm using PySpark and from the cluster I'm unable to download the packages.

Already tried;

pipeline = PretrainedPipeLine.from_disk('/path/to/recognize_entities_dl')
pipeline = PretrainedPipeLine.load('/path/to/recognize_entities_dl')

Errors:

'PretrainedPipeline' has no attribute 'load'

Input path does not exist:
    hdfs://...../recognize_entities_dl_en_2.4.3_2.4_1584626752821/metatdata

Solution

  • Looking at your error:

     hdfs://...../recognize_entities_dl_en_2.4.3_2.4_1584626752821/metatdata
    

    metatdata you should change to metadata by removing one extra "t".

    Also, You see 2.4.3 in "recognize_entities_dl_en_2.4.3_2.4_1584626752821"

    This indicates it is for Spark NLP 2.4.3

    But, In the question, you have mentioned you are using,

    spark-nlp==2.5.5

    Which is okay as long as

    2.5.5 >= 2.4.3
    

    But sometimes it causes issues.

    Also 2.4 in "recognize_entities_dl_en_2.4.3_2.4_1584626752821"

    This indicates it is for Apache Spark 2.4

    The Spark NLP library built and compiled against Apache Spark 2.4.x. That is why models and pipelines are only available for the 2.4.x version.