I have 2 columns from 2 different dataframes. I want to check if column 1 is a subset of column 2.
I was using the following code:
set(col1).issubset(set(col2))
The issue with this is that if col1 has only integers and col2 has both integers and strings, then this returns false. This happens because elements of col2 are coerced into strings. For example,
set([376, 264, 365, 302]) &
set(['302', 'water', 'nist1950', '264', '365', '376'])
I tried using isin
from pandas. But if col1 and col2 are series then this gives a series of Boolean values. I want True or False
.
How do I solve this? Is there a simpler function that I have missed?
Edit 1
Adding an example.
col1
0 365
1 376
2 302
3 264
Name: subject, dtype: int64col2
0 nist1950
1 nist1950
2 water
3 water
4 376
5 376
6 302
7 302
8 365
9 365
10 264
11 264
12 376
13 376
Name: subject, dtype: object
Edit 2
col1 and col2 can have integers, strings, floats etc. I would like to not make any prejudgement about what is in these columns.
You could use isin
with all
to check whether all of your col1
elements contains in col2
. For converting to numeric you could use pd.to_numeric
:
s1 = pd.Series([376, 264, 365, 302])
s2 = pd.Series(['302', 'water', 'nist1950', '264', '365', '376'])res = s1.isin(pd.to_numeric(s2, errors='coerce')).all()In [213]: res
Out[213]: True
More detailed:
In [214]: pd.to_numeric(s2, errors='coerce')
Out[214]:
0 302
1 NaN
2 NaN
3 264
4 365
5 376
dtype: float64In [215]: s1.isin(pd.to_numeric(s2, errors='coerce'))
Out[215]:
0 True
1 True
2 True
3 True
dtype: bool
Note pd.to_numeric
works with pandas version >=0.17.0
for previous you cound use convert_objects
with convert_numeric=True
EDIT
If you prefer solution with set
you could convert your first set to str
as well and then compare them with your code:
s3 = set(map(str, s1))In [234]: s3
Out[234]: {'264', '302', '365', '376'}
Then you could use issubset
for s2
:
In [235]: s3.issubset(s2)
Out[235]: True
or for set(s2)
:
In [236]: s3.issubset(set(s2))
Out[236]: True
EDIT2
s1 = pd.Series(['376', '264', '365', '302'])
s4 = pd.Series(['nist1950', 'nist1950', 'water', 'water', '376', '376', '302', '302', '365', '365', '264', '264', '376', '376'])In [263]: s1.astype(float).isin(pd.to_numeric(s4, errors='coerce')).all()
Out[263]: True