Question 1

I want to count the occurrences of list of substrings and create a column based on a column in the pyspark df which contains a long string.

Input:          ID    History1     USA|UK|IND|DEN|MAL|SWE|AUS2     USA|UK|PAK|NOR3     NOR|NZE4     IND|PAK|NORlst=['USA','IND','DEN']Output :ID    History                      Count1     USA|UK|IND|DEN|MAL|SWE|AUS    32     USA|UK|PAK|NOR                13     NOR|NZE                       04     IND|PAK|NOR                   1

Question 2

# Importing requisite packages and creating a DataFrame
from pyspark.sql.functions import split, col, size, regexp_replace
values = [(1,'USA|UK|IND|DEN|MAL|SWE|AUS'),(2,'USA|UK|PAK|NOR'),(3,'NOR|NZE'),(4,'IND|PAK|NOR')]
df = sqlContext.createDataFrame(values,['ID','History'])
df.show(truncate=False)
+---+--------------------------+
|ID |History                   |
+---+--------------------------+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|
|2  |USA|UK|PAK|NOR            |
|3  |NOR|NZE                   |
|4  |IND|PAK|NOR               |
+---+--------------------------+

The idea is to split the string based on these three delimiters: lst=['USA','IND','DEN'] and then count the number of substrings produced.

For eg; the string USA|UK|IND|DEN|MAL|SWE|AUS gets split like - ,, |UK|, |, |MAL|SWE|AUS. Since, there were 4 substrings created and there were 3 delimiters matches, so 4-1 = 3 gives the count of these strings appearing in the column string.

I am not sure if multi character delimiters are supported in Spark, so as a first step, we replace any of these 3 sub-strings in the list ['USA','IND','DEN'] with a flag/dummy value %. You could use something else as well. The following code does this replacement -

df = df.withColumn('History_X',col('History'))
lst=['USA','IND','DEN']
for i in lst:df = df.withColumn('History_X', regexp_replace(col('History_X'), i, '%'))
df.show(truncate=False)
+---+--------------------------+--------------------+
|ID |History                   |History_X           |
+---+--------------------------+--------------------+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|%|UK|%|%|MAL|SWE|AUS|
|2  |USA|UK|PAK|NOR            |%|UK|PAK|NOR        |
|3  |NOR|NZE                   |NOR|NZE             |
|4  |IND|PAK|NOR               |%|PAK|NOR           |
+---+--------------------------+--------------------+

Finally, we count the number of substrings created by splitting it first with % being the delimiter, then counting the number of substrings created with size function and finally subtracting 1 from it.

df = df.withColumn('Count', size(split(col('History_X'), "%")) - 1).drop('History_X')
df.show(truncate=False)
+---+--------------------------+-----+
|ID |History                   |Count|
+---+--------------------------+-----+
|1  |USA|UK|IND|DEN|MAL|SWE|AUS|3    |
|2  |USA|UK|PAK|NOR            |1    |
|3  |NOR|NZE                   |0    |
|4  |IND|PAK|NOR               |1    |
+---+--------------------------+-----+

Count occurrences of a list of substrings in a pyspark df column

Related Q&A

What are screen units in tkinter?

Python SUMPRODUCT of elements in nested list

Modifying viridis colormap (replacing some colors)

gtk minimum size

How do I convert a json file to a python class?

PyTorch how to compute second order Jacobian?

Tensorflow setup on RStudio/ R | CentOS

Cant import soundfile

Most efficient way to multiply a small matrix with a scalar in numpy

MultiValueDictKeyError / request.POST