I have dataframe eg. like below
Event['EVENT_ID'] = [ 4162, 4161, 4160, 4159,4158, 4157, 4156, 4155, 4154]
need to convert each row word to binary.
Event['b']=bin(Event['EVENT_ID']) doesn't work
TypeError: cannot convert the series to <class 'int'>
expected new column with binary, remove 0b and split the column to 16 separate column
bin(4162) = '0b1000001000010'
You can also work with numpy, much faster than pandas.
Edit: faster numpy using view
Couple of tricks here:
- Work only with the column of interest
- Convert the underlaying array to
uint16
, to ensure compatibility with any integer input
- Swapbytes to have a proper H,L order (at least on my architecture)
- Split H,L without actually moving any data with
view
- Run
unpackbits
and reshape accordingly
My machine requires a byteswap
to have the bytes of the uint16 in the proper place. Note that this aproach requires to have the data as int16
/uint16
, while the other one would work for int64 as well.
import pandas as pd
import numpy as npdf = pd.DataFrame({'EVENT_ID': [ 4162, 4161, 4160, 4159,4158, 4157, 4156, 4155, 4154]}, dtype='uint16')zz=np.unpackbits(df.EVENT_ID.values.astype('uint16').byteswap().view('uint8')).reshape(-1,16)
df3 = pd.concat([df,pd.DataFrame(zz)],axis=1)print(f"{df3 =}")df3 = EVENT_ID 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 4162 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0
1 4161 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1
2 4160 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0
3 4159 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 1
4 4158 0 0 0 1 0 0 0 0 0 0 1 1 1 1 1 0
5 4157 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 1
6 4156 0 0 0 1 0 0 0 0 0 0 1 1 1 1 0 0
7 4155 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 1
8 4154 0 0 0 1 0 0 0 0 0 0 1 1 1 0 1 0
older proposed method:
lh = np.unpackbits((df.values & 0xFF).astype('uint8')).reshape(-1,8)
uh = np.unpackbits((df.values >> 8).astype('uint8')).reshape(-1,8)df2 = pd.concat([df, pd.DataFrame(np.concatenate([uh,lh],axis=1),index=df.index)],axis=1)
Benchmark: numpy is orders of magnitude faster than pandas"
For 1M points:
- numpy view: 35ms for 1million
uint64
points
- numpy low/high: 50ms
- pandas list bin: 1.78s
- pandas apply format + list: 1.97s
- pandas apply lambda: 6.08s
df = pd.DataFrame({'EVENT_ID': (np.random.random(int(1e6))*65000).astype('uint16')})
pandas apply format list
In [13]: %timeit df2 = df.join(pd.DataFrame(df['EVENT_ID'].apply('{0:b}'.format).apply(list).tolist()))
1.97 s ± 42.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas list bin
In [10]: %%timeit...: binary_values = pd.DataFrame([list(bin(x)[2:]) for x in df['EVENT_ID']])...: df2 = df.join(binary_values)...:...:
1.78 s ± 53.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
pandas3 apply lambda
In [5]: %%timeit...: for i in range(16):...: df[f"bit{i}"] = df["EVENT_ID"].apply(lambda x: x & 1 << i).astype(bool).astype(int)...:
6.08 s ± 65.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
numpy
In [14]: %%timeit...: lh = np.unpackbits((df.values & 0xFF).astype('uint8')).reshape(-1,8)...: uh = np.unpackbits((df.values >> 8).astype('uint8')).reshape(-1,8)...: df3=pd.concat([df, pd.DataFrame(np.concatenate([uh,lh],axis=1),index=df.index)],axis=1)...:...:
49.9 ms ± 232 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)