Function should clean data to half the size, instead it enlarges it by an order of magnitude

2024/7/7 6:36:43

This has been driving me nuts all week weekend. I am trying merge data for different assets around a common timestamp. Each asset's data is a value in dictionary. The data of interest is stored in lists in one column, so this needs to be separated first. Here is a sample of the unprocessed df:

data['Dogecoin'].head()market_cap_by_available_supply  price_btc   price_usd   volume_usd
0   [1387118554000, 3488670]    [1387118554000, 6.58771e-07]    [1387118554000, 0.000558776]    [1387118554000, 0.0]
1   [1387243928000, 1619159]    [1387243928000, 3.18752e-07]    [1387243928000, 0.000218176]    [1387243928000, 0.0]
2   [1387336027000, 2191987]    [1387336027000, 4.10802e-07]    [1387336027000, 0.000267749]    [1387336027000, 0.0]

Then I apply this function to separate market_cap_by_available_supply, saving it's components into a new dataframe for that asset:

data2 = {}
#sorting function 
for coin in data:#seperates timestamp and marketcap from their respective list inside each elementTS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])cap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])#Creates DataFrame and stores timestamp and marketcap data into dictionairydf = DataFrame(columns=['timestamp','cap'])df.timestamp = TSdf.cap = capdf.columns = ['timestamp',str(coin)+'_cap']data2[coin] = df#converts timestamp into datetime 'yyy-mm-dd'data2[coin]['timestamp'] = pd.to_datetime(data2[coin]['timestamp'], unit='ms').dt.date

It seemed to work perfectly, producing the correct data, sample:

data2['Namecoin'].head()
timestamp   Namecoin_cap
0   2013-04-28  5969081
1   2013-04-29  7006114
2   2013-04-30  7049003
3   2013-05-01  6366350
4   2013-05-02  5848626 

However when I attempted to merge all dataframes, I got a memory error, I've spent hours trying to figure out the root and it seems like the 'sorting function' above is increasing the size of the dataframe from 12Mb to 131Mb! It should do this opposite. Any ideas ?

On a side note here is the data https://www.mediafire.com/?9pcwroe1x35nnwl

I open it with this pickle funtion

with open("CMC_no_tuple_data.pickle", "rb") as myFile:data = pickle.load(myFile)

EDIT: Sorry for the typo in the pickle file name. @Goyo to compute the size i simple saved data and data 2 via pickle.dump and looked at their respective sizes @ Padraic Cunningham you used the sort funtion i provided and it produced a smaller file ?? This is not my case and i get a memory error when trying to merge the dataframes

Answer

When you merge your dataframes, you are doing a join on values that are not unique. When you are joining all these dataframes together, you are getting many matches. As you add more and more currencies you are getting something similar to a Cartesian product rather than a join. In the snippet below, I added code to sort the values and then remove duplicates.

from pandas import Series, DataFrame
import pandas as pdcoins='''
Bitcoin
Ripple
Ethereum
Litecoin
Dogecoin
Dash
Peercoin
MaidSafeCoin
Stellar
Factom
Nxt
BitShares
'''
coins = coins.split('\n')
API = 'https://api.coinmarketcap.com/v1/datapoints/'
data = {}for coin in coins:print(coin)try:data[coin]=(pd.read_json(API + coin))except: pass
data2 = {}
for coin in data:TS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])TS = pd.to_datetime(TS,unit='ms').dt.datecap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])df = DataFrame(columns=['timestamp','cap'])df.timestamp = TSdf.cap = capdf.columns = ['timestamp',coin+'_cap']df.sort_values(by=['timestamp',coin+'_cap'])df= df.drop_duplicates(subset='timestamp',keep='last')data2[coin] = dfdf = data2['Bitcoin']
keys = data2.keys()
keys.remove('Bitcoin')
for coin in keys:df = pd.merge(left=df,right=data2[coin],left_on='timestamp', right_on='timestamp', how='left')print len(df),len(df.columns)
df.to_csv('caps.csv')

EDIT:I have added a table belowing showing how the size of the table grows as you do your join operation.

This table shows the number of rows after joining 5,10,15,20,25, and 30 currencies.

Rows,Columns
1015 5
1255 10
5095 15
132071 20
4195303 25
16778215 30

This table shows how removing duplicates makes your joins only match a single row.

Rows,Columns
1000 5
1000 10
1000 15
1000 20
1000 25
1000 30
https://en.xdnf.cn/q/120546.html

Related Q&A

is it possible to add colors to python output? [duplicate]

This question already has answers here:How do I print colored text to the terminal?(66 answers)Closed 10 years ago.so i made a small password strength tester for me, my friends and my family, as seen …

Tkinter Function attached to Button executed immediately [duplicate]

This question already has answers here:How can I pass arguments to Tkinter buttons callback command?(2 answers)Closed 8 years ago.What I need is to attach a function to a button that is called with a …

Error!!! cant concatenate the tuple to non float

stack = []closed = []currNode = problem.getStartState()stack.append(currNode)while (len(stack) != 0):node = stack.pop()if problem.isGoalState(node):print "true"closed.append(node)else:child =…

Using R to fit data from a csv with a gamma function?

Using the following data: time_stamp,secs,count 2013-04-30 23:58:55,1367366335,32 2013-04-30 23:58:56,1367366336,281 2013-04-30 23:58:57,1367366337,664 2013-04-30 23:58:58,1367366338,1255 2013-04-30…

regex multiple string match in a python list

I have a list of strings like this:["ra", "dec", "ra-error", "dec-error", "glat", "glon", "flux", "l", "b"]I ne…

python IndentationError: expected an indented block [closed]

Closed. This question is not reproducible or was caused by typos. It is not currently accepting answers.This question was caused by a typo or a problem that can no longer be reproduced. While similar q…

how to tell an infinite loop to end once one number repeats twice in a row (in python 3.4)

The title says it all. I have an infinite loop of randomly generated numbers from one to six that I need to end when 6 occurs twice in a row.

Is it possible to customize the random function to avoid too much repetition of words? [duplicate]

This question already exists:Customizing the random function without using append, or list, or other container?Closed last year.Theoretically, when randomization is used, the same word may be printed …

country convert to continent

def country_to_continent(country_name):country_alpha2 = pc.country_name_to_country_alpha2(country_name)country_continent_code = pc.country_alpha2_to_continent_code(country_alpha2)country_continent_name…

Iterate through a list and delete certain elements

Im working on an assignment in my computer class, and Im having a hard time with one section of the code. I will post the assignment guidelines (the bolded part is the code Im having issues with):You a…