Pandas populate new dataframe column based on matching columns in another dataframe

2024/11/19 6:23:40

I have a df which contains my main data which has one million rows. My main data also has 30 columns. Now I want to add another column to my df called category. The category is a column in df2 which contains around 700 rows and two other columns that will match with two columns in df.

I begin with setting an index in df2 and df that will match between the frames, however some of the index in df2 doesn't exist in df.

The remaining columns in df2 are called AUTHOR_NAME and CATEGORY.

The relevant column in df is called AUTHOR_NAME.

Some of the AUTHOR_NAME in df doesn't exist in df2 and vice versa.

The instruction I want is: when index in df matches with index in df2 and title in df matches with title in df2, add category to df, else add NaN in category.

Example data:

df2AUTHOR_NAME              CATEGORY
Index       
Pub1        author1                 main
Pub2        author1                 main
Pub3        author1                 main
Pub1        author2                 sub
Pub3        author2                 sub
Pub2        author4                 subdfAUTHOR_NAME     ...n amount of other columns        
Index       
Pub1        author1                 
Pub2        author1     
Pub1        author2 
Pub1        author3
Pub2        author4 expected_resultAUTHOR_NAME             CATEGORY   ...n amount of other columns
Index
Pub1        author1                 main
Pub2        author1                 main
Pub1        author2                 sub
Pub1        author3                 NaN
Pub2        author4                 sub

If I use df2.merge(df,left_index=True,right_index=True,how='left', on=['AUTHOR_NAME']) my df becomes three times bigger than it is supposed to be.

So I thought maybe merging was the wrong way to go about this. What I am really trying to do is use df2 as a lookup table and then return type values to df depending on if certain conditions are met.

def calculate_category(df2, d):category_row = df2[(df2["Index"] == d["Index"]) & (df2["AUTHOR_NAME"] == d["AUTHOR_NAME"])]return str(category_row['CATEGORY'].iat[0])df.apply(lambda d: calculate_category(df2, d), axis=1)

However, this throws me an error:

IndexError: ('index out of bounds', u'occurred at index 7614')
Answer

Consider the following dataframes df and df2

df = pd.DataFrame(dict(AUTHOR_NAME=list('AAABBCCCCDEEFGG'),title=      list('zyxwvutsrqponml')))df2 = pd.DataFrame(dict(AUTHOR_NAME=list('AABCCEGG'),title      =list('zwvtrpml'),CATEGORY   =list('11223344')))

option 1
merge

df.merge(df2, how='left')

option 2
join

cols = ['AUTHOR_NAME', 'title']
df.join(df2.set_index(cols), on=cols)

both options yield

enter image description here

https://en.xdnf.cn/q/26471.html

Related Q&A

Remove an imported python module [duplicate]

This question already has answers here:Closed 11 years ago.Possible Duplicate:Unload a module in Python After importing Numpy, lets say I want to delete/remove numpy import referenceimport sys import…

Should I create each class in its own .py file?

Im quite new to Python in general.Im aware that I can create multiple classes in the same .py file, but Im wondering if I should create each class in its own .py file.In C# for instance, I would have a…

Should you put quotes around type annotations in python

Whats the difference between these two functions? Ive seen people put quotes around type annotations and other times leave them out but I couldnt find why people choose to use one or the other.def do_…

Handling GET and POST in same Flask view

When I type request.form["name"], for example, to retrieve the name from a form submitted by POST, must I also write a separate branch that looks something like request.form.get["name&qu…

Overriding inherited properties’ getters and setters in Python

I’m currently using the @property decorator to achieve “getters and setters” in a couple of my classes. I wish to be able to inherit these @property methods in a child class.I have some Python code …

Python Message Box Without huge library dependency

Is there a messagebox class where I can just display a simple message box without a huge GUI library or any library upon program success or failure. (My script only does 1 thing). Also, I only need it …

How to remove the last FC layer from a ResNet model in PyTorch?

I am using a ResNet152 model from PyTorch. Id like to strip off the last FC layer from the model. Heres my code:from torchvision import datasets, transforms, models model = models.resnet152(pretrained=…

Python Twitter library: which one? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, argum…

How do I use url_for if my method has multiple route annotations?

So I have a method that is accessible by multiple routes:@app.route("/canonical/path/") @app.route("/alternate/path/") def foo():return "hi!"Now, how can I call url_for(&q…

Shortest Python Quine?

Python 2.x (30 bytes): _=_=%r;print _%%_;print _%_Python 3.x (32 bytes) _=_=%r;print(_%%_);print(_%_)Is this the shortest possible Python quine, or can it be done better? This one seems to improve o…