pyspark returns a no module named error for a custom module

2024/10/4 5:28:44

I would like to import a .py file that contains some modules. I have saved the files init.py and util_func.py under this folder:

/usr/local/lib/python3.4/site-packages/myutil

The util_func.py contains all the modules that i would like to use. I also need to create a pyspark udf so I can use it to transform my dataframe. My code looks like this:

import myutil
from myutil import util_func
myudf = pyspark.sql.functions.udf(util_func.ConvString, StringType())

somewhere down the code, I am using this to convert one of the columns in my dataframe:

df = df.withColumn("newcol", myudf(df["oldcol"]))

then I am trying to see if it converts it my using:

df.head()

It fails with an error "No module named myutil".

I am able to bring up the functions within ipython. Somehow the pyspark engined does not see the module. Any idea how to make sure that the pyspark engine picks up the module?

Answer

You must build a egg file of your package using setup tools and add the egg file to your application like below

sc.addFile('<path of the egg file>') 

here sc is the spark context variable.

https://en.xdnf.cn/q/70649.html

Related Q&A

Perform a conditional operation on a pandas column

I know that this should be simple, but I want to take a column from a pandas dataframe, and for only the entries which meet some condition (say less than 1), multiply by a scalar (say 2).For example, i…

How to programmatically get SVN revision number?

Like this question, but without the need to actually query the SVN server. This is a web-based project, so I figure Ill just use the repository as the public view (unless someone can advise me why this…

Convert fractional years to a real date in Python

How do I convert fractional years to a real date by using Python? E. g. I have an array [2012.343, 2012.444, 2012.509] containing fractional years and I would like to get "yyyy-mm-dd hh:mm".

Django template: Translate include with variable

I have a template in which you can pass a text variable. I want to include this template into another one but with a translated text as its variable. How can you achieve this?I would like something li…

Pandas - Creating a New Column

I have always made new columns in pandas using the following:df[new_column] = valueI am using this method, however, am receiving the warning for setting a copy.What is the way to make a new column with…

Adding an extra column to (big) SQLite database from Pandas dataframe

I feel like Im overlooking something really simple, but I cant make it work. Im using SQLite now, but a solution in SQLAlchemy would also be very helpful.Lets create our original dataset:### This is ju…

error inserting values to db with psycopg2 module [duplicate]

This question already has answers here:psycopg2: cant adapt type numpy.int64(4 answers)Inserting records into postgreSQL database in Python(3 answers)Closed 3 months ago.I am attempting to insert a dat…

NaN values in pivot_table index causes loss of data

Here is a simple DataFrame:> df = pd.DataFrame({a: [a1, a2, a3],b: [optional1, None, optional3],c: [c1, c2, c3],d: [1, 2, 3]}) > dfa b c d 0 a1 optional1 c1 1 1 a2 None c2…

ModuleNotFoundError in Docker

I have imported my entire project into docker, and I am getting a ModuleNotFoundErrorfrom one of the modules I have created.FROM python:3.8 WORKDIR /workspace/ COPY . /workspace/ RUN pip install pipenv…

Can I use md5 authentication with psycopg2?

After two hours of reading documentation, source code and help-threads, Im giving up. I cant get psycopg2 to authenticate with a md5-string. According to this thread I dont have to anything besides ena…