pandas - Merging on string columns not working (bug?)

2024/11/19 11:31:56

I'm trying to do a simple merge between two dataframes. These come from two different SQL tables, where the joining keys are strings:

>>> df1.col1.dtype
dtype('O')
>>> df2.col2.dtype
dtype('O')

I try to merge them using this:

>>> merge_res = pd.merge(df1, df2, left_on='col1', right_on='col2')

The result of the inner join is empty, which first prompted me that there might not be any entries in the intersection:

>>> merge_res.shape
(0, 19)

But when I try to match a single element, I see this really odd behavior.

# Pick random element in second dataframe
>>> df2.iloc[5,:].col2
'95498208100000'# Manually look for it in the first dataframe
>>> df1[df1.col1 == '95498208100000']
0 rows × 19 columns
# Empty, which makes sense given the above merge result# Now look for the same value as an integer
>>> df1[df1.col1 == 95498208100000]
1 rows × 19 columns
# FINDS THE ELEMENT!?!

So, the columns are defined with the 'object' dtype. Searching for them as strings don't yield any results. Searching for them as integers does return a result, and I think this is the reason why the merge doesn't work above..

Any ideas what's going on?

It's almost as thought Pandas converts df1.col1 to an integer just because it can, even though it should be treated as a string while matching.

(I tried to replicate this using sample dataframes, but for small examples, I don't see this behavior. Any suggestions on how I can find a more descriptive example would be appreciated as well.)

Answer

The issue was that the object dtype is misleading. I thought it mean that all items were strings. But apparently, while reading the file pandas was converting some elements to ints, and leaving the remainders as strings.

The solution was to make sure that every field is a string:

>>> df1.col1 = df1.col1.astype(str)
>>> df2.col2 = df2.col2.astype(str)

Then the merge works as expected.

(I wish there was a way of specifying a dtype of str...)

https://en.xdnf.cn/q/26444.html

Related Q&A

Making a chart bigger in size

Im trying to get a bigger chart. However, the figure method from matplotlib does not seem to be working properly. I get a message, which is not an error: <matplotlib.figure.Figure at 0xa25f7f0>My…

index of non NaN values in Pandas

From Pandas data frame, how to get index of non "NaN" values?My data frame isA b c 0 1 q1 1 1 2 NaN 3 2 3 q2 3 3 4 q1 NaN 4 5 q2 7And I want the…

How can I type-check variables in Python? [duplicate]

This question already has answers here:Whats the canonical way to check for type in Python?(16 answers)Closed 3 months ago.I have a Python function that takes a numeric argument that must be an intege…

Py_INCREF/DECREF: When

Is one correct in stating the following:If a Python object is created in a C function, but the function doesnt return it, no INCREF is needed, but a DECREF is. [false]If the function does return it, yo…

pop/remove items out of a python tuple

I am not sure if I can make myself clear but will try.I have a tuple in python which I go through as follows (see code below). While going through it, I maintain a counter (lets call it n) and pop item…

Difference between frompyfunc and vectorize in numpy

What is the difference between vectorize and frompyfunc in numpy?Both seem very similar. What is a typical use case for each of them?Edit: As JoshAdel indicates, the class vectorize seems to be built…

Jupyter notebook command does not work on Mac

I installed jupyter using pip on my macbook air. Upon trying to execute the command jupyter notebook, I get an error jupyter: notebook is not a Jupyter commandI used the --h option to get a listing of …

Recursively compare two directories to ensure they have the same files and subdirectories

From what I observe filecmp.dircmp is recursive, but inadequate for my needs, at least in py2. I want to compare two directories and all their contained files. Does this exist, or do I need to build …

Specific reasons to favor pip vs. conda when installing Python packages

I use miniconda as my default python installation. What is the current (2019) wisdom regarding when to install something with conda vs. pip?My usual behavior is to install everything with pip, and onl…

Insert a link inside a Pandas table

Id like to insert a link (to a web page) inside a Pandas table, so when it is displayed in an IPython notebook, I could press the link. I tried the following: In [1]: import pandas as pdIn [2]: df = pd…