Question 1

i am unable to use the predefined pipeline "recognize_entities_dl" provided by the spark-nlp library

i tried installing different versions of pyspark and spark-nlp library

import sparknlp
from sparknlp.pretrained import PretrainedPipeline#create or get Spark Sessionspark = sparknlp.start()sparknlp.version()
spark.version#download, load, and annotate a text by pre-trained pipelinepipeline = PretrainedPipeline('recognize_entities_dl', lang='en')
result = pipeline.annotate('Harry Potter is a great movie')2.1.0
recognize_entities_dl download started this may take some time.

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-b71a0f77e93a> in <module>11 #download, load, and annotate a text by pre-trained pipeline12 
---> 13 pipeline = PretrainedPipeline('recognize_entities_dl', 'en')14 result = pipeline.annotate('Harry Potter is a great movie')d:\python36\lib\site-packages\sparknlp\pretrained.py in __init__(self, name, lang, remote_loc)89 90     def __init__(self, name, lang='en', remote_loc=None):
---> 91         self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)92         self.light_model = LightPipeline(self.model)93 d:\python36\lib\site-packages\sparknlp\pretrained.py in downloadPipeline(name, language, remote_loc)50     def downloadPipeline(name, language, remote_loc=None):51         print(name + " download started this may take some time.")
---> 52         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()53         if file_size == "-1":54             print("Can not find the model to download please check the name!")AttributeError: module 'sparknlp.internal' has no attribute '_GetResourceSize'

Question 2

Thanks for confirming your Apache Spark version. The pre-trained pipelines and models are based on Apache Spark and Spark NLP versions. The lowest Apache Spark version must be 2.4.x to be able to download the pre-trained models/pipelines. Otherwise, you need to train your own models/pipelines for any version before.

This is the list of all pipelines and they all for Apache Spark 2.4.x: https://nlp.johnsnowlabs.com/docs/en/pipelines

If you take a look at the URL of any models or pipelines you can see this information:

recognize_entities_dl_en_2.1.0_2.4_1562946909722.zip

Name: recognize_entities_dl
Lang: en
Spark NLP: must be equal to 2.1.0 or greater
Apache Spark: equal to 2.4.x or greater

NOTE: The Spark NLP library is being built and compiled against Apache Spark 2.4.x. That is why models and pipelines are being only available for the 2.4.x version.

NOTE 2: Since you are using Windows, you need to use _noncontrib models and pipelines which are compatible with Windows: Do Spark-NLP pretrained pipelines only work on linux systems?

I hope this answer helps and solves your issue.

UPDATE April 2020: Apparently the models and pipelines trained and uploaded on Apache Spark 2.4.x are compatible with Apache Spark 2.3.x as well. So if you are on Apache Spark 2.3.x even though you cannot use pretrained() for auto-download you can download it manually and just use .load() instead.

Full list of all models and pipelines with links to download: https://github.com/JohnSnowLabs/spark-nlp-models

Update: After 2.4.0 release, all the models and pipelines are cross-platform and there is no need to choose a different model/pipeline for any specific OS: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.4.0

For newer releases: https://github.com/JohnSnowLabs/spark-nlp/releases

unable to download the pipeline provided by spark-nlp library

Related Q&A

Can setattr() can be defined in a class with slots?

mysql-connector python IN operator stored as list

Pandas: Use iterrows on Dataframe subset

Can I parameterize a pytest fixture with other fixtures?

fit method in python sklearn

Django 1.9 JSONField update behavior

Using Tweepy to search for tweets with API 1.1

Retrieving my own data via FaceBook API

Python -- Optimize system of inequalities

Pandas side-by-side stacked bar plot