Spark 1.4 increase maxResultSize memory

2024/11/20 7:13:20

I am using Spark 1.4 for my research and struggling with the memory settings. My machine has 16GB of memory so no problem there since the size of my file is only 300MB. Although, when I try to convert Spark RDD to panda dataframe using toPandas() function I receive the following error:

serialized results of 9 tasks (1096.9 MB) is bigger than spark.driver.maxResultSize (1024.0 MB)

I tried to fix this changing the spark-config file and still getting the same error. I've heard that this is a problem with spark 1.4 and wondering if you know how to solve this. Any help is much appreciated.

Answer

You can set spark.driver.maxResultSize parameter in the SparkConf object:

from pyspark import SparkConf, SparkContext# In Jupyter you have to stop the current context first
sc.stop()# Create new config
conf = (SparkConf().set("spark.driver.maxResultSize", "2g"))# Create new context
sc = SparkContext(conf=conf)

You should probably create a new SQLContext as well:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
https://en.xdnf.cn/q/26348.html

Related Q&A

Java abstract/interface design in Python

I have a number of classes which all share the same methods, only with different implementations. In Java, it would make sense to have each of these classes implement an interface or extend an abstract…

PyCharm - no tests were found?

Ive been getting na error in PyCharm and I cant figure out why Im getting it:No tests were foundThis is what I have for my point_test.py: import unittest import sys import ossys.path.insert(0, os.path.…

Does Python PIL resize maintain the aspect ratio?

Does PIL resize to the exact dimensions I give it no matter what? Or will it try to keep the aspect ratio if I give it something like the Image.ANTIALIAS argument?

How to scale images to screen size in Pygame

I was wondering how I would go about scaling the size of images in pygame projects to the resolution of the screen. For example, envisage the following scenario assuming windowed display mode for the t…

GridSearch for an estimator inside a OneVsRestClassifier

I want to perform GridSearchCV in a SVC model, but that uses the one-vs-all strategy. For the latter part, I can just do this:model_to_set = OneVsRestClassifier(SVC(kernel="poly"))My problem …

The Pythonic way of organizing modules and packages

I come from a background where I normally create one file per class. I organize common classes under directories as well. This practice is intuitive to me and it has been proven to be effective in C++,…

Where do you need to use lit() in Pyspark SQL?

Im trying to make sense of where you need to use a lit value, which is defined as a literal column in the documentation.Take for example this udf, which returns the index of a SQL column array:def find…

Evaluate multiple scores on sklearn cross_val_score

Im trying to evaluate multiple machine learning algorithms with sklearn for a couple of metrics (accuracy, recall, precision and maybe more).For what I understood from the documentation here and from t…

Generate SQL statements from a Pandas Dataframe

I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way…

How to translate a model label in Django Admin?

I could translate Django Admin except a model label because I dont know how to translate a model label in Django Admin. So, how can I translate a model label in Django Admin?