PySpark reversing StringIndexer in nested array

2024/11/3 7:49:40

I'm using PySpark to do collaborative filtering using ALS. My original user and item id's are strings, so I used StringIndexer to convert them to numeric indices (PySpark's ALS model obliges us to do so).

After I've fitted the model, I can get the top 3 recommendations for each user like so:

recs = (model.recommendForAllUsers(3)
)

The recs dataframe looks like so:

+-----------+--------------------+
|userIdIndex|     recommendations|
+-----------+--------------------+
|       1580|[[10096,3.6725707...|
|       4900|[[10096,3.0137873...|
|       5300|[[10096,2.7274625...|
|       6620|[[10096,2.4493625...|
|       7240|[[10096,2.4928937...|
+-----------+--------------------+
only showing top 5 rowsroot|-- userIdIndex: integer (nullable = false)|-- recommendations: array (nullable = true)|    |-- element: struct (containsNull = true)|    |    |-- productIdIndex: integer (nullable = true)|    |    |-- rating: float (nullable = true)

I want to create a huge JSOM dump with this dataframe, and I can like so:

(recs.toJSON().saveAsTextFile("name_i_must_hide.recs")
)

and a sample of these jsons is:

{"userIdIndex": 1580,"recommendations": [{"productIdIndex": 10096,"rating": 3.6725707},{"productIdIndex": 10141,"rating": 3.61542},{"productIdIndex": 11591,"rating": 3.536216}]
}

The userIdIndex and productIdIndex keys are due to the StringIndexer transformation.

How can I get the original value of these columns back? I suspect I must use the IndexToString transformer, but I can't quite figure out how since the data is nested in an array inside the recs Dataframe.

I tried to use a Pipeline evaluator (stages=[StringIndexer, ALS, IndexToString]) but it looks like this evaluator doesn't support these indexers.

Cheers!

Answer

In both cases you'll need an access to the list of labels. This can be accessed using either a StringIndexerModel

user_indexer_model = ...  # type: StringIndexerModel
user_labels = user_indexer_model.labelsproduct_indexer_model = ...  # type: StringIndexerModel
product_labels = product_indexer_model.labels

or column metadata.

For userIdIndex you can just apply IndexToString:

from pyspark.ml.feature import IndexToStringuser_id_to_label = IndexToString(inputCol="userIdIndex", outputCol="userId", labels=user_labels)
user_id_to_label.transform(recs)

For recommendations you'll need either udf or expression like this:

from pyspark.sql.functions import array, col, lit, structn = 3  # Same as numItemsproduct_labels_ = array(*[lit(x) for x in product_labels])
recommendations = array(*[struct(product_labels_[col("recommendations")[i]["productIdIndex"]].alias("productId"),col("recommendations")[i]["rating"].alias("rating")
) for i in range(n)])recs.withColumn("recommendations", recommendations)
https://en.xdnf.cn/q/72709.html

Related Q&A

Numba np.convolve really slow

Im trying to speed up a piece of code convolving a 1D array (filter) over each column of a 2D array. Somehow, when I run it with numbas njit, I get a 7x slow down. My thoughts:Maybe column indexing is …

Python: Retrieving only POP3 message text, no headers

Im trying to make a Python program that retrieves only the body text of an email without passing headers or any other parameters. Im not sure how to go about this.The goal is to be able to send basic c…

Getting text between xml tags with minidom [duplicate]

This question already has answers here:Getting text values from XML in Python(2 answers)Closed 9 years ago.I have this sample xml document snippet<root><foo>bar</foo><foo>baz<…

OpenCV Error: Unknown error code -49 in Python

I am trying to learn face detection in python-3.6 using cv2.I follow the src given in a book.I have already installed opencv-python(3.2.0) by pip.xml and .jpg files are all in the same path with python…

Python Exchange ActiveSync Library

Is anyone familiar with an Exchange ActiveSync library or open source client for python? Ive done preliminary searching with little to no success. Ive seen some examples for C#, but I figured Id ask a…

Tastypie: How can I fill the resource without database?

I want to grab some information from Foursquare , add some fields and return it via django-tastypie. UPDATE:def obj_get_list(self, request=None, **kwargs):near = if near in request.GET and request.GET…

Is there a way to protect built-ins in python?

My question arises from this question, in which a user got himself confused by unknowingly rebinding the built-in global set. Is there a straightforward way to get python to warn you when you attempt t…

Generate thumbnail for arbitrary audio file

I want to represent an audio file in an image with a maximum size of 180180 pixels.I want to generate this image so that it somehow gives a representation of the audio file, think of it like SoundCloud…

Extract specific text lines?

I have a large several hudred thousand lines text file. I have to extract 30,000 specific lines that are all in the text file in random spots. This is the program I have to extract one line at a time:b…

Listing users for certain DB with PyMongo

What Im trying to acheiveIm trying to fetch users for a certain database.What I did so farI was able to find function to list the databases or create users but none for listing the users, I thought ab…