datatype for handling big numbers in pyspark

2024/9/25 6:18:23

I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark.

>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)

Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found'And following is the output

[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]

The value in my csv file is as follows: 8.27370028700801e+21 is 8273700287008010012345 8.37670028702205e+21 is 8376700287022050054321

When I create a data frame out of it and then query it,

>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()

the test_column gives value 'null' for all the records.

So, how to solve this problem of parsing big number in spark, really appreciate your help

Answer

Well, types matter. Since you convert your data to float you cannot use LongType in the DataFrame. It doesn't blow only because PySpark is relatively forgiving when it comes to types.

Also, 8273700287008010012345 is too large to be represented as LongType which can represent only the values between -9223372036854775808 and 9223372036854775807.

If you want to convert your data to a DataFrame you'll have to use DoubleType:

from pyspark.sql.types import *rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

Typically it is a better idea to handle this with DataFrames directly:

from pyspark.sql.functions import colstr_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

If you don't want to use Double you can cast to Decimal with specified precision:

str_df.select(col("x").cast(DecimalType(38))).show(1, False)## +----------------------+
## |x                     |
## +----------------------+
## |8273700287008010012345|
## +----------------------+
https://en.xdnf.cn/q/71611.html

Related Q&A

Multi processing code repeatedly runs

So I wish to create a process using the python multiprocessing module, I want it be part of a larger script. (I also want a lot of other things from it but right now I will settle for this)I copied the…

Why use os.setsid() in Python?

I know os.setsid() is to change the process(forked) group id to itself, but why we need it?I can see some answer from Google is: To keep the child process running while the parent process exit.But acc…

How to apply different aggregation functions to same column by using pandas Groupby

It is clear when doingdata.groupby([A,B]).mean()We get something multiindex by level A and B and one column with the mean of each grouphow could I have the count(), std() simultaneously ?so result loo…

Can not connect to an abstract unix socket in python

I have a server written in c++ which creates and binds to an abstract unix socket with a namespace address of "\0hidden". I also have a client which is written in c++ also and this client can…

Pandas display extra unnamed columns for an excel file

Im working on a project using pandas library, in which I need to read an Excel file which has following columns: invoiceid, locationid, timestamp, customerid, discount, tax,total, subtotal, productid, …

Modifying the weights and biases of a restored CNN model in TensorFlow

I have recently started using TensorFlow (TF), and I have come across a problem that I need some help with. Basically, Ive restored a pre-trained model, and I need to modify the weights and biases of o…

Flask SQLAlchemy paginate over objects in a relationship

So I have two models: Article and Tag, and a m2m relationship which is properly set.I have a route of the kind articles/tag/ and I would like to display only those articles related to that tagI have so…

generating correlated numbers in numpy / pandas

I’m trying to generate simulated student grades in 4 subjects, where a student record is a single row of data. The code shown here will generate normally distributed random numbers with a mean of 60 …

AttributeError: list object has no attribute split

Using Python 2.7.3.1I dont understand what the problem is with my coding! I get this error: AttributeError: list object has no attribute splitThis is my code:myList = [hello]myList.split()

Managing multiple Twisted client connections

Im trying to use Twisted in a sort of spidering program that manages multiple client connections. Id like to maintain of a pool of about 5 clients working at one time. The functionality of each clien…