Question 1

I am using spark with python.After uploading a csv file,I needed to parse a column in a csv file which has numbers that are 22 digits long. For parsing that column I used LongType() . I used map() function for defining column. Following are my commands in pyspark.

>>> test=sc.textFile("test.csv")
>>> header=test.first()
>>> schemaString = header.replace('"','')
>>> testfields = [StructField(field_name, StringType(), True) for field_name in schemaString.split(',')]
>>> testfields[5].dataType = LongType()
>>> testschema = StructType(testfields)
>>> testHeader = test.filter(lambda l: "test_date" in l)
>>> testNoHeader = test.subtract(testHeader)
>>> test_temp = testNoHeader.map(lambda k: k.split(",")).map(lambda
p:(p[0],p[1],p[2],p[3],p[4],***float(p[5].strip('"'))***,p[6],p[7]))
>>> test_temp.top(2)

Note: I have also tried 'long' and 'bigint' in place of 'float' in my variable test_temp, but the error in spark was 'keyword not found'And following is the output

[('2012-03-14', '7', '1698.00', 'XYZ02abc008793060653', 'II93', ***8.27370028700801e+21*** , 'W0W0000000000007', '879870080088815007'), ('2002-03-14', '1', '999.00', 'ABC02E000050086941', 'II93', 8.37670028702205e+21, 'A0B0080000012523', '870870080000012421')]

The value in my csv file is as follows: 8.27370028700801e+21 is 8273700287008010012345 8.37670028702205e+21 is 8376700287022050054321

When I create a data frame out of it and then query it,

>>> test_df = sqlContext.createDataFrame(test_temp, testschema)
>>> test_df.registerTempTable("test")
>>> sqlContext.sql("SELECT test_column FROM test").show()

the test_column gives value 'null' for all the records.

So, how to solve this problem of parsing big number in spark, really appreciate your help

Question 2

Well, types matter. Since you convert your data to float you cannot use LongType in the DataFrame. It doesn't blow only because PySpark is relatively forgiving when it comes to types.

Also, 8273700287008010012345 is too large to be represented as LongType which can represent only the values between -9223372036854775808 and 9223372036854775807.

If you want to convert your data to a DataFrame you'll have to use DoubleType:

from pyspark.sql.types import *rdd = sc.parallelize([(8.27370028700801e+21, )])
schema = StructType([StructField("x", DoubleType(), False)])
rdd.toDF(schema).show()## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

Typically it is a better idea to handle this with DataFrames directly:

from pyspark.sql.functions import colstr_df = sc.parallelize([("8273700287008010012345", )]).toDF(["x"])
str_df.select(col("x").cast("double")).show()## +-------------------+
## |                  x|
## +-------------------+
## |8.27370028700801E21|
## +-------------------+

If you don't want to use Double you can cast to Decimal with specified precision:

str_df.select(col("x").cast(DecimalType(38))).show(1, False)## +----------------------+
## |x                     |
## +----------------------+
## |8273700287008010012345|
## +----------------------+

datatype for handling big numbers in pyspark

Related Q&A

Multi processing code repeatedly runs

Why use os.setsid() in Python?

How to apply different aggregation functions to same column by using pandas Groupby

Can not connect to an abstract unix socket in python

Pandas display extra unnamed columns for an excel file

Modifying the weights and biases of a restored CNN model in TensorFlow

Flask SQLAlchemy paginate over objects in a relationship

generating correlated numbers in numpy / pandas

AttributeError: list object has no attribute split

Managing multiple Twisted client connections