Combining Spark Streaming + MLlib

2024/10/5 1:13:51

I've tried to use a Random Forest model in order to predict a stream of examples, but it appears that I cannot use that model to classify the examples. Here is the code used in pyspark:

sc = SparkContext(appName="App")model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={}, impurity='gini', numTrees=150)ssc = StreamingContext(sc, 1)
lines = ssc.socketTextStream(hostname, int(port))parsedLines = lines.map(parse)
parsedLines.pprint()predictions = parsedLines.map(lambda event: model.predict(event.features))

and the error returned while compiling it in the cluster:

  Error : "It appears that you are attempting to reference SparkContext from a broadcast "Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.

is there a way to use a modèle generated from a static data to predict a streaming examples ?

Thanks guys i really appreciate it !!!!

Answer

Yes, you can use model generated from static data. The problem you experience is not related to streaming at all. You simply cannot use JVM based model inside action or transformations (see How to use Java/Scala function from an action or a transformation? for an explanation why). Instead you should apply predict method to a complete RDD for example using transform on DStream:

from pyspark.mllib.tree import RandomForest
from pyspark.mllib.util import MLUtils
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from operator import attrgettersc = SparkContext("local[2]", "foo")
ssc = StreamingContext(sc, 1)data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
trainingData, testData = data.randomSplit([0.7, 0.3])model = RandomForest.trainClassifier(trainingData, numClasses=2, nmTrees=3
)(ssc.queueStream([testData])# Extract features.map(attrgetter("features"))# Predict .transform(lambda _, rdd: model.predict(rdd)).pprint())ssc.start()
ssc.awaitTerminationOrTimeout(10)
https://en.xdnf.cn/q/70546.html

Related Q&A

How to select dataframe rows according to multi-(other column)-condition on columnar groups?

Copy the following dataframe to your clipboard:textId score textInfo 0 name1 1.0 text_stuff 1 name1 2.0 different_text_stuff 2 name1 2.0 text_stuff …

Python Recursive Search of Dict with Nested Keys

I recently had to solve a problem in a real data system with a nested dict/list combination. I worked on this for quite a while and came up with a solution, but I am very unsatisfied. I had to resort t…

Scrapy: how to catch download error and try download it again

During my crawling, some pages failed due to unexpected redirection and no response returned. How can I catch this kind of error and re-schedule a request with original url, not with the redirected url…

Cryptacular is broken

this weekend our docker image broke because it cannot be build anymore. While looking into the stats, I saw this line:crypt_blowfish-1.2/crypt.h:17:23: fatal error: gnu-crypt.h: No such file or directo…

how to run test against the built image before pushing to containers registry?

From the gitlab documentation this is how to create a docker image using kaniko: build:stage: buildimage:name: gcr.io/kaniko-project/executor:debugentrypoint: [""]script:- mkdir -p /kaniko/.d…

Adding a colorbar to a pcolormesh with polar projection

I am trying to add a colorbar to a pcolormesh plot with polar projection. The code works fine if I dont specify a polar projection. With polar projection specified, a tiny plot results, and the colorba…

GridSearch for Multi-label classification in Scikit-learn

I am trying to do GridSearch for best hyper-parameters in every individual one of ten folds cross validation, it worked fine with my previous multi-class classification work, but not the case this time…

Visualize tree in bash, like the output of unix tree

Given input:apple: banana eggplant banana: cantaloupe durian eggplant: fig:I would like to concatenate it into the format:├─ apple │ ├─ banana │ │ ├─ cantaloupe │ │ └─ durian │ └…

pygame.error: Failed loading libmpg123.dll: Attempt to access invalid address

music = pygame.mixer.music.load(not.mp3) pygame.mixer.music.play(loops=-1)when executing the code I got this error: Traceback (most recent call last):File "C:\Users\Admin\AppData\Local\Programs\Py…

Plot Red Channel from 3D Numpy Array

Suppose that we have an RGB image that we have converted it to a Numpy array with the following code:import numpy as np from PIL import Imageimg = Image.open(Peppers.tif) arr = np.array(img) # 256x256x…