Where do you need to use lit() in Pyspark SQL?

2024/11/20 8:34:52

I'm trying to make sense of where you need to use a lit value, which is defined as a literal column in the documentation.

Take for example this udf, which returns the index of a SQL column array:

def find_index(column, index):return column[index]

If I were to pass an integer into this I would get an error. I would need to pass a lit(n) value into the udf to get the correct index of an array.

Is there a place I can better learn the hard and fast rules of when to use lit and possibly col as well?

Answer

To keep it simple you need a Column (can be a one created using lit but it is not the only option) when JVM counterpart expects a column and there is no internal conversion in a Python wrapper or you want to call a Column specific method.

In the first case the only strict rule is the on that applies to UDFs. UDF (Python or JVM) can be called only with arguments which are of Column type. It also typically applies to functions from pyspark.sql.functions. In other cases it is always best to check documentation and docs string firsts and if it is not sufficient docs of a corresponding Scala counterpart.

In the second case rules are simple. If you for example want to compare a column to a value then value has to be on the RHS:

col("foo") > 0  # OK

or value has to be wrapped with literal:

lit(0) < col("foo")  # OK

In Python many operators (<, ==, <=, &, |, + , -, *, /) can use non column object on the LHS:

0 < col("foo") 

but such applications are not supported in Scala.

It goes without saying that you have to use lit if you want to access any of the pyspark.sql.Column methods treating standard Python scalar as a constant column. For example you'll need

c = lit(1)

not

c = 1

to

c.between(0, 3)  # type: pyspark.sql.Column
https://en.xdnf.cn/q/26341.html

Related Q&A

Evaluate multiple scores on sklearn cross_val_score

Im trying to evaluate multiple machine learning algorithms with sklearn for a couple of metrics (accuracy, recall, precision and maybe more).For what I understood from the documentation here and from t…

Generate SQL statements from a Pandas Dataframe

I am loading data from various sources (csv, xls, json etc...) into Pandas dataframes and I would like to generate statements to create and fill a SQL database with this data. Does anyone know of a way…

How to translate a model label in Django Admin?

I could translate Django Admin except a model label because I dont know how to translate a model label in Django Admin. So, how can I translate a model label in Django Admin?

converty numpy array of arrays to 2d array

I have a pandas series features that has the following values (features.values)array([array([0, 0, 0, ..., 0, 0, 0]), array([0, 0, 0, ..., 0, 0, 0]),array([0, 0, 0, ..., 0, 0, 0]), ...,array([0, 0, 0, …

profiling a method of a class in Python using cProfile?

Id like to profile a method of a function in Python, using cProfile. I tried the following:import cProfile as profile# Inside the class method... profile.run("self.myMethod()", "output_f…

Installing h5py on an Ubuntu server

I was installing h5py on an Ubuntu server. However it seems to return an error that h5py.h is not found. It gives the same error message when I install it using pip or the setup.py file. What am I miss…

NLTK Named Entity Recognition with Custom Data

Im trying to extract named entities from my text using NLTK. I find that NLTK NER is not very accurate for my purpose and I want to add some more tags of my own as well. Ive been trying to find a way t…

How do I write to the console in Google App Engine?

Often when I am coding I just like to print little things (mostly the current value of variables) out to console. I dont see anything like this for Google App Engine, although I note that the Google Ap…

Does Google App Engine support Python 3?

I started learning Python 3.4 and would like to start using libraries as well as Google App Engine, but the majority of Python libraries only support Python 2.7 and the same with Google App Engine.Shou…

how to subquery in queryset in django?

how can i have a subquery in djangos queryset? for example if i have:select name, age from person, employee where person.id = employee.id and employee.id in (select id from employee where employee.com…