unable to download the pipeline provided by spark-nlp library

2024/10/15 15:28:21

i am unable to use the predefined pipeline "recognize_entities_dl" provided by the spark-nlp library

i tried installing different versions of pyspark and spark-nlp library

import sparknlp
from sparknlp.pretrained import PretrainedPipeline#create or get Spark Sessionspark = sparknlp.start()sparknlp.version()
spark.version#download, load, and annotate a text by pre-trained pipelinepipeline = PretrainedPipeline('recognize_entities_dl', lang='en')
result = pipeline.annotate('Harry Potter is a great movie')2.1.0
recognize_entities_dl download started this may take some time.
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-13-b71a0f77e93a> in <module>11 #download, load, and annotate a text by pre-trained pipeline12 
---> 13 pipeline = PretrainedPipeline('recognize_entities_dl', 'en')14 result = pipeline.annotate('Harry Potter is a great movie')d:\python36\lib\site-packages\sparknlp\pretrained.py in __init__(self, name, lang, remote_loc)89 90     def __init__(self, name, lang='en', remote_loc=None):
---> 91         self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)92         self.light_model = LightPipeline(self.model)93 d:\python36\lib\site-packages\sparknlp\pretrained.py in downloadPipeline(name, language, remote_loc)50     def downloadPipeline(name, language, remote_loc=None):51         print(name + " download started this may take some time.")
---> 52         file_size = _internal._GetResourceSize(name, language, remote_loc).apply()53         if file_size == "-1":54             print("Can not find the model to download please check the name!")AttributeError: module 'sparknlp.internal' has no attribute '_GetResourceSize'
Answer

Thanks for confirming your Apache Spark version. The pre-trained pipelines and models are based on Apache Spark and Spark NLP versions. The lowest Apache Spark version must be 2.4.x to be able to download the pre-trained models/pipelines. Otherwise, you need to train your own models/pipelines for any version before.

This is the list of all pipelines and they all for Apache Spark 2.4.x: https://nlp.johnsnowlabs.com/docs/en/pipelines

If you take a look at the URL of any models or pipelines you can see this information:

recognize_entities_dl_en_2.1.0_2.4_1562946909722.zip

  • Name: recognize_entities_dl
  • Lang: en
  • Spark NLP: must be equal to 2.1.0 or greater
  • Apache Spark: equal to 2.4.x or greater

NOTE: The Spark NLP library is being built and compiled against Apache Spark 2.4.x. That is why models and pipelines are being only available for the 2.4.x version.

NOTE 2: Since you are using Windows, you need to use _noncontrib models and pipelines which are compatible with Windows: Do Spark-NLP pretrained pipelines only work on linux systems?

I hope this answer helps and solves your issue.

UPDATE April 2020: Apparently the models and pipelines trained and uploaded on Apache Spark 2.4.x are compatible with Apache Spark 2.3.x as well. So if you are on Apache Spark 2.3.x even though you cannot use pretrained() for auto-download you can download it manually and just use .load() instead.

Full list of all models and pipelines with links to download: https://github.com/JohnSnowLabs/spark-nlp-models

Update: After 2.4.0 release, all the models and pipelines are cross-platform and there is no need to choose a different model/pipeline for any specific OS: https://github.com/JohnSnowLabs/spark-nlp/releases/tag/2.4.0

For newer releases: https://github.com/JohnSnowLabs/spark-nlp/releases

https://en.xdnf.cn/q/69269.html

Related Q&A

Can __setattr__() can be defined in a class with __slots__?

Say I have a class which defines __slots__:class Foo(object):__slots__ = [x]def __init__(self, x=1):self.x = x# will the following work?def __setattr__(self, key, value):if key == x:object.__setattr__…

mysql-connector python IN operator stored as list

I am using mysql-connector with python and have a query like this:SELECT avg(downloadtime) FROM tb_npp where date(date) between %s and %s and host like %s",(s_date,e_date,"%" + dc + &quo…

Pandas: Use iterrows on Dataframe subset

What is the best way to do iterrows with a subset of a DataFrame?Lets take the following simple example:import pandas as pddf = pd.DataFrame({Product: list(AAAABBAA),Quantity: [5,2,5,10,1,5,2,3],Start…

Can I parameterize a pytest fixture with other fixtures?

I have a python test that uses a fixture for credentials (a tuple of userid and password)def test_something(credentials)(userid, password) = credentialsprint("Hello {0}, welcome to my test".f…

fit method in python sklearn

I am asking myself various questions about the fit method in sklearn.Question 1: when I do:from sklearn.decomposition import TruncatedSVD model = TruncatedSVD() svd_1 = model.fit(X1) svd_2 = model.fit(…

Django 1.9 JSONField update behavior

Ive recently updated to Django 1.9 and tried updating some of my model fields to use the built-in JSONField (Im using PostgreSQL 9.4.5). As I was trying to create and update my objects fields, I came a…

Using Tweepy to search for tweets with API 1.1

Ive been trying to get tweepy to search for a sring without success for the past 3 hours. I keep getting replied it should use api 1.1. I thought that was implemented... because I can post with tweepy.…

Retrieving my own data via FaceBook API

I am building a website for a comedy group which uses Facebook as one of their marketing platforms; one of the requirements for the new site is to display all of their Facebook events on a calendar.Cur…

Python -- Optimize system of inequalities

I am working on a program in Python in which a small part involves optimizing a system of equations / inequalities. Ideally, I would have wanted to do as can be done in Modelica, write out the equation…

Pandas side-by-side stacked bar plot

I want to create a stacked bar plot of the titanic dataset. The plot needs to group by "Pclass", "Sex" and "Survived". I have managed to do this with a lot of tedious nump…