Count occurrences of a list of substrings in a pyspark df column

2024/11/14 13:55:45

I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string.

Input:          ID    History1     USA|UK|IND|DEN|MAL|SWE|AUS2     USA|UK|PAK|NOR3     NOR|NZE4     IND|PAK|NORlst=['USA','IND','DEN']Output :ID    History                      Count1     USA|UK|IND|DEN|MAL|SWE|AUS    32     USA|UK|PAK|NOR                13     NOR|NZE                       04     IND|PAK|NOR                   1
Answer
# Importing requisite packages and creating a DataFrame
from pyspark.sql.functions import split, col, size, regexp_replace
values = [(1,'USA|UK|IND|DEN|MAL|SWE|AUS'),(2,'USA|UK|PAK|NOR'),(3,'NOR|NZE'),(4,'IND|PAK|NOR')]
df = sqlContext.createDataFrame(values,['ID','History'])
df.show(truncate=False)
+---+--------------------------+
|ID |History                   |
+---+--------------------------+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|
|2  |USA|UK|PAK|NOR            |
|3  |NOR|NZE                   |
|4  |IND|PAK|NOR               |
+---+--------------------------+

The idea is to split the string based on these three delimiters: lst=['USA','IND','DEN'] and then count the number of substrings produced.

For eg; the string USA|UK|IND|DEN|MAL|SWE|AUS gets split like - ,, |UK|, |, |MAL|SWE|AUS. Since, there were 4 substrings created and there were 3 delimiters matches, so 4-1 = 3 gives the count of these strings appearing in the column string.

I am not sure if multi character delimiters are supported in Spark, so as a first step, we replace any of these 3 sub-strings in the list ['USA','IND','DEN'] with a flag/dummy value %. You could use something else as well. The following code does this replacement -

df = df.withColumn('History_X',col('History'))
lst=['USA','IND','DEN']
for i in lst:df = df.withColumn('History_X', regexp_replace(col('History_X'), i, '%'))
df.show(truncate=False)
+---+--------------------------+--------------------+
|ID |History                   |History_X           |
+---+--------------------------+--------------------+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|%|UK|%|%|MAL|SWE|AUS|
|2  |USA|UK|PAK|NOR            |%|UK|PAK|NOR        |
|3  |NOR|NZE                   |NOR|NZE             |
|4  |IND|PAK|NOR               |%|PAK|NOR           |
+---+--------------------------+--------------------+

Finally, we count the number of substrings created by splitting it first with % being the delimiter, then counting the number of substrings created with size function and finally subtracting 1 from it.

df = df.withColumn('Count', size(split(col('History_X'), "%")) - 1).drop('History_X')
df.show(truncate=False)
+---+--------------------------+-----+
|ID |History                   |Count|
+---+--------------------------+-----+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|3    |
|2  |USA|UK|PAK|NOR            |1    |
|3  |NOR|NZE                   |0    |
|4  |IND|PAK|NOR               |1    |
+---+--------------------------+-----+
https://en.xdnf.cn/q/72090.html

Related Q&A

What are screen units in tkinter?

I was reading the response in the link below and ran into screen units but I couldnt find what exactly was referred to by screen units in Jim Denneys response. I know they are not pixels. How do I use …

Python SUMPRODUCT of elements in nested list

I have two nested lists: a = [[1,2,3],[2,4,2]] b = [[5,5,5],[1,1,1]]I want to multiply and SUMPRODUCT each group of elements to get c = [[30],[8]]Which result from = [[1*5+2*5+3*5],[2*1,4*1,2*1]] Ive t…

Modifying viridis colormap (replacing some colors)

Ive searched around and found things that came close to working but nothing exactly suiting what I need. Basically, I really like the viridis colormap as a starting point. However, I would like to repl…

gtk minimum size

Is there an easy way to request that a GTK widget have a minimum width/height? I know you can do it on the column of a TreeView, but is it available for general widgets?

How do I convert a json file to a python class?

Consider this json file named h.json I want to convert this into a python dataclass. {"acc1":{"email":"[email protected]","password":"acc1","name&…

PyTorch how to compute second order Jacobian?

I have a neural network thats computing a vector quantity u. Id like to compute first and second-order jacobians with respect to the input x, a single element. Would anybody know how to do that in PyTo…

Tensorflow setup on RStudio/ R | CentOS

For the last 5 days, I am trying to make Keras/Tensorflow packages work in R. I am using RStudio for installation and have used conda, miniconda, virtualenv but it crashes each time in the end. Install…

Cant import soundfile

Im using Anaconda and Im trying to import soundfile/pysoundfile. I installed the package by running conda install -c conda-forge pysoundfile and I think it succeeded because when I run conda list it sh…

Most efficient way to multiply a small matrix with a scalar in numpy

I have a program whose main performance bottleneck involves multiplying matrices which have one dimension of size 1 and another large dimension, e.g. 1000: large_dimension = 1000a = np.random.random((1…

MultiValueDictKeyError / request.POST

I think I hav a problem at request.POST[title]MultiValueDictKeyError at /blog/add/post/"title"Request Method: GETRequest URL: http://119.81.247.69:8000/blog/add/post/Django Version: 1.8.…