Pyspark: Concat function generated columns into new dataframe

2024/10/14 11:19:31

I have a pyspark dataframe (df) with n cols, I would like to generate another df of n cols, where each column records the percentage difference b/w consecutive rows in the corresponding, original df column. And the column headers in the new df should be == corresponding column header in old dataframe + "_diff". With the following code I can generate the new columns of percentage changes for each column in the original df but am not able to stick them in a new df with suitable column headers:

from pyspark.sql import SparkSession
from pyspark.sql.window import Window
import pyspark.sql.functions as funcspark = (SparkSession.builder.appName('pct_change').enableHiveSupport().getOrCreate())df = spark.createDataFrame([(1, 10, 11, 12), (2, 20, 22, 24), (3, 30, 33, 36)], ["index", "col1", "col2", "col3"])
w = Window.orderBy("index")for i in range(1, len(df.columns)):col_pctChange = func.log(df[df.columns[i]]) - func.log(func.lag(df[df.columns[i]]).over(w))

Thanks

Answer

In this case, you can do a list comprehension inside of a call to select.

To make the code a little more compact, we can first get the columns we want to diff in a list:

diff_columns = [c for c in df.columns if c != 'index']

Next select the index and iterate over diff_columns to compute the new column. Use .alias() to rename the resulting column:

df_diff = df.select('index',*[(func.log(func.col(c)) - func.log(func.lag(func.col(c)).over(w))).alias(c + "_diff")for c in diff_columns]
)
df_diff.show()
#+-----+------------------+-------------------+-------------------+
#|index|         col1_diff|          col2_diff|          col3_diff|
#+-----+------------------+-------------------+-------------------+
#|    1|              null|               null|               null|
#|    2| 0.693147180559945| 0.6931471805599454| 0.6931471805599454|
#|    3|0.4054651081081646|0.40546510810816416|0.40546510810816416|
#+-----+------------------+-------------------+-------------------+
https://en.xdnf.cn/q/117965.html

Related Q&A

Mysql.connector to access remote database in local network Python 3

I used mysql.connector python library to make changes to my local SQL server databases using: from __future__ import print_function import mysql.connector as kkcnx = kk.connect(user=root, password=pass…

concurrent.futures not parallelizing write

I have a list dataframe_chunk which contains chunks of a very large pandas dataframe.I would like to write every single chunk into a different csv, and to do so in parallel. However, I see the files be…

Querying SQLite database file in Google Colab

print (Files in Drive:)!ls drive/AIFiles in Drive:database.sqlite Reviews.csv Untitled0.ipynb fine_food_reviews.ipynb Titanic.csvWhen I run the above code in Google Colab, clearly my sqlite file is pre…

AttributeError: function object has no attribute self

I have a gui file and I designed it with qtdesigner, and there are another py file. I tried to changing button name or tried to add item in listwidget but I didnt make that things. I got an error messa…

Find file with largest number in filename in each sub-directory with python?

I am trying to find the file with the largest number in the filename in each subdirectory. This is so I can acomplish opening the most recent file in each subdirectory. Each file will follow the namin…

Selenium Python - selecting from a list on the web with no stored/embedded options

Im very new to Python so forgive me if this isnt completely comprehensible. Im trying to select from a combobox in a webpage. All the examples Ive seen online are choosing from a list where the options…

How to use a method in a class from another class that inherits from yet another class python

I have 3 classes :class Scene(object):def enter(self):passclass CentralCorridor(Scene):def enter(self):passclass Map(object):def __init__(self, start_game): passAnd the class map is initiated like this…

Finding common IDs (intersection) in two dictionaries

I wrote a piece of code that is supposed to find common intersecting IDs in line[1] in two different files. On my small sample files it works OK, but on my bigger files does not. I cannot figure out wh…

Run command line containing multiple strings from python script

Hello i am trying to autogenerate a PDF, i have made a python script that generates the wanted PDF but to generate it i have to call my_cover.py -s "Atsumi" -t "GE1.5s" -co "Ja…

Identify value across multiple columns in a dataframe that contain string from a list in python

I have a dataframe with multiple columns containing phrases. What I would like to do is identify the column (per row observation) that contains a string that exists within a pre-made list of words. Wi…