Un-persisting all dataframes in (py)spark

2024/11/20 16:33:24

I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use multiple times. It appears that when I call cache on my dataframe a second time, a new copy is cached to memory. In my application, this leads to memory issues when scaling up. Even though, a given dataframe is a maximum of about 100 MB in my current tests, the cumulative size of the intermediate results grows beyond the alloted memory on the executor. See below for a small example that shows this behavior.

cache_test.py:

from pyspark import SparkContext, HiveContextspark_context = SparkContext(appName='cache_test')
hive_context = HiveContext(spark_context)df = (hive_context.read.format('com.databricks.spark.csv').load('simple_data.csv'))
df.cache()
df.show()df = df.withColumn('C1+C2', df['C1'] + df['C2'])
df.cache()
df.show()spark_context.stop()

simple_data.csv:

1,2,3
4,5,6
7,8,9

Looking at the application UI, there is a copy of the original dataframe, in adition to the one with the new column. I can remove the original copy by calling df.unpersist() before the withColumn line. Is this the recommended way to remove cached intermediate result (i.e. call unpersist before every cache()).

Also, is it possible to purge all cached objects. In my application, there are natural breakpoints where I can simply purge all memory, and move on to the next file. I would like to do this without creating a new spark application for each input file.

Thank you in advance!

Answer

Spark 2.x

You can use Catalog.clearCache:

from pyspark.sql import SparkSessionspark = SparkSession.builder.getOrCreate
...
spark.catalog.clearCache()

Spark 1.x

You can use SQLContext.clearCache method which

Removes all cached tables from the in-memory cache.

from pyspark.sql import SQLContext
from pyspark import SparkContextsqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
...
sqlContext.clearCache()
https://en.xdnf.cn/q/26300.html

Related Q&A

Can pip (or setuptools, distribute etc...) list the license used by each installed package?

Im trying to audit a Python project with a large number of dependencies and while I can manually look up each projects homepage/license terms, it seems like most OSS packages should already contain the…

Convert DataFrameGroupBy object to DataFrame pandas

I had a dataframe and did a groupby in FIPS and summed the groups that worked fine.kl = ks.groupby(FIPS)kl.aggregate(np.sum)I just want a normal Dataframe back but I have a pandas.core.groupby.DataFram…

Correct way to obtain confidence interval with scipy

I have a 1-dimensional array of data:a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])for which I want to obtain the 68% confidence interval (ie: the 1 sigma).The first comment in this answer states that…

How to supply a mock class method for python unit test?

Lets say I have a class like this. class SomeProductionProcess(CustomCachedSingleTon):@classmethoddef loaddata(cls):"""Uses an iterator over a large file in Production for the Data pipel…

View pdf image in an iPython Notebook

The following code allows me to view a png image in an iPython notebook. Is there a way to view pdf image? I dont need to use IPython.display necessarily. I am looking for a way to print a pdf image i…

Is the use of del bad?

I commonly use del in my code to delete objects:>>> array = [4, 6, 7, hello, 8] >>> del(array[array.index(hello)]) >>> array [4, 6, 7, 8] >>> But I have heard many …

Find how many lines in string

I am creating a python movie player/maker, and I want to find the number of lines in a multiple line string. I was wondering if there was any built in function or function I could code to do this:x = &…

AttributeError: Cant get attribute new_block on module pandas.core.internals.blocks

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Cant get attribute new_block on <module pandas.core.internals.blocks. Below is…

Disable python import sorting in VSCode

I am trying to disable vscode from formatting my python imports when I save my file. I have some code that must run in between various imports so order is important, but every time I save it just shove…

Log-log lmplot with seaborn

Can Seaborns lmplot plot on log-log scale? This is lmplot with linear axes: import numpy as np import pandas as pd import seaborn as sns x = 10**arange(1, 10) y = 10** arange(1,10)*2 df1 = pd.DataFra…