PySpark: how to resolve path of a resource file present inside the dependency zip file

2024/10/11 15:21:57

I have a mapPartitions on an RDD and within each partition, a resource file has to be opened. This module that contains the method invoked by mapPartitions and the resource file is passed on to each executor using the --py-files argument as a zip file.

To make it clear:

rdd = rdd.mapPartitions(work_doing_method)def work_doing_method(rows):for row in rows:resource_file_path = os.path.join(os.path.dirname(__file__), "resource.json")with open(resource_file_path) as f:resource = json.loads(f.read())...

When I do this after passing the zip file which includes all of this using the --py-file parameter to the spark-submit command,

I get IOError: [Errno 20] Not a directory:/full/path/to/the/file/within/zip/file

I do not understand how Spark uses the zip file to read the dependencies. The os.path.dirname utility returns the full path including the zip file, for eg. /spark/dir/my_dependency_file.zip/path/to/the/resource/file. I believe this should be the problem. I tried many combinations to resolve the path of the file. Any help is appreciated.

Thanks!

Answer

I think when you add a file to a Spark job, it will be copied to the working directory of each executor. I've used the SparkFiles API to get absolute paths to files on the executors.

You can also use the --archives flag to pass in arbitrary data archives such as zipfiles. What's the difference between --archives, --files, py-files in pyspark job arguments

https://en.xdnf.cn/q/118309.html

Related Q&A

Convert normal Python script to REST API

Here I have an excel to pdf conversion script. How can I modify it to act as a REST API? import os import comtypes.client SOURCE_DIR = D:/projects/python TARGET_DIR = D:/projects/python app = comtypes…

How to track changes in specific registry key or file with Python? [closed]

Closed. This question is seeking recommendations for software libraries, tutorials, tools, books, or other off-site resources. It does not meet Stack Overflow guidelines. It is not currently accepting …

How to use Android NDK to compile Numpy as .so?

Because the Numpy isnt a static library(it contains .py files, .pyc files, .so files, etc), so if I want to import it to my python code which is used in an Android phone(using CLE), I should recompile …

passing boolean function to if-condition in python

I"m learning python, and Im trying to do this, which I thought should be trivial, but apparently, its not. $python >>> def isTrue(data): ... "ping" in data ... >>>…

Unsupported operand type(s) for str and str. Python

Ive got the IF statement;if contactstring == "[Practice Address Not Available]" | contactstring == "[]":Im not sure what is going wrong(possibly the " "s?) but I keep ge…

getting Monday , june 5 , 2016 instead of June 5 ,2016 using DateTimeField

I have an app using Django an my my model has the following field: date = models.DateTimeField(auto_now_add=True,auto_now=False)Using that I get this: June 5, 2016, 9:16 p.m.but I need something like…

WeasyPrint usage with Python 3.x on Windows

I cant seem to get WeasyPrint to work on Windows with Python 3.4 or 3.5. Has anyone been able to do this? There arent forums at weasyprint.org and the IRC channel is dead. Ive been able to install …

matplotlib scatter array lengths are not same

i have 2 arrays like this x_test = [[ 14. 1.] [ 14. 2.] [ 14. 3.] [ 14. 4.] [ 14. 5.] [ 14. 6.] [ 14. 7.] [ 14. 8.] [ 14. 9.] [ 14. 10.] [ 14. 11.] [ 14. 12.]]y_test = [ 254.7 255…

APLpy/matplotlib: Coordinate grid alpha levels for EPS quality figure

In the normal matplotlib axes class, it is possible to set gridlines to have a certain transparency (alpha level). Im attempting to utilise this with the APLpy package using the following:fig = pyplot.…

How to extract word frequency from document-term matrix?

I am doing LDA analysis with Python. And I used the following code to create a document-term matrixcorpus = [dictionary.doc2bow(text) for text in texts].Is there any easy ways to count the word frequen…