Escape New line character in Spark CSV read

2024/9/30 17:29:15

I'm working on Spark 2.2.1 version and using the below python code, I can able to escape special characters like @ : I want to escape the special characters like newline(\n) and carriage return(\r). I replaced the @ which \n, however it didn't worked. Any suggestions please.

Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="@")

Not Working:

spark_df = spark.read.csv(file.csv,mode="DROPMALFORMED",inferSchema=True,header =True,escape="\n")
Answer

If your goal is to read csv having textual content with multiple newlines in it, then the way to go is using the spark multiline option.

I recently posted some code for scala there.

val df = spark.read
.option("wholeFile", true)
.option("multiline",true)
.option("header", true)
.option("inferSchema", "true")
.option("dateFormat", "yyyy-MM-dd")
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss")
.csv("test.csv")

The python syntax will be slightly different but shoud work well.

https://en.xdnf.cn/q/71063.html

Related Q&A

How can a class that inherits from a NumPy array change its own values?

I have a simple class that inherits from the NumPy n-dimensional array. I want to have two methods of the class that can change the array values of an instance of the class. One of the methods should s…

Python/SQL Alchemy Migrate - ValueError: too many values to unpack when migrating changes in db

I have several models in SQLAlchemy written and I just started getting an exception when running my migrate scripts: ValueError: too many values to unpackHere are my models:from app import dbROLE_USER …

How to store Dataframe data to Firebase Storage?

Given a pandas Dataframe which contains some data, what is the best to store this data to Firebase?Should I convert the Dataframe to a local file (e.g. .csv, .txt) and then upload it on Firebase Stora…

Multiple characters in Python ord function

Programming beginner here. (Python 2.7)Is there a work around for using more than a single character for Pythons ord function?For example, I have a hex string \xff\x1a which Id like the decimal value …

Get minimum x and y from 2D numpy array of points

Given a numpy 2D array of points, aka 3D array with size of the 3rd dimension equals to 2, how do I get the minimum x and y coordinate over all points? Examples:First:I edited my original example, sin…

Extract Text with its Font Details (Style,Size,color,Italic etc) from a PDF in Python [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic…

How to keep track of status with multiprocessing and pool.map?

Im setting up a multiprocessing module for the first time, and basically, I am planning to do something along the lines offrom multiprocessing import pool pool = Pool(processes=102) results = pool.map(…

How to get time 17:00:00 today or yesterday?

If 17:00:00 today is already passed, then it should be todays date, otherwise - yesterdays. Todays time I get with:test = datetime.datetime.now().replace(hour=17,minute=0,second=0,microsecond=0)But I d…

PyMongo Aggregate how to get executionStats

I am trying to get executionStats of a Particular mongo aggregate query. I run db.command but that doesnt give "execution status"This is what I am trying to do. how to get Python Mongo Aggreg…

Is it possible to do parallel reads on one h5py file using multiprocessing?

I am trying to speed up the process of reading chunks (load them into RAM memory) out of a h5py dataset file. Right now I try to do this via the multiprocessing library. pool = mp.Pool(NUM_PROCESSES) g…