1 to 2 matching in two dataframes with different sizes in Python/R

2024/11/20 3:19:49

please help me with this problem I've been struggling all day lol, solution in either Python or R is fine! Please help I'm really stuck!!!

I have two dataframes - df1 has 44 rows, df2 has 100 rows, they both have these columns: ID, status (0,1), Age, Gender, Race, Ethnicity, Height, Weight

for each row in df1, I need to find an age match in df2:

  1. it can be exact age match, but the criteria should be used is - df2[age]-5 <= df1[age]<= df2[age]+5
  2. I need a list/dictionary to store which are the age matches for df1, and their IDs
  3. Then I need to randomly select 2 IDs from df2 as the final match for df1 age
  4. I also need to make sure the 2 df2 matches shares the same gender and race as df1

I have tried R and Python, and both stuck on the nested loops part. I'm not sure how to loop through each record both df1 and df2, compare df1 age with df2 age-5 and df2 age+5, and store the matches

Here are the sample data format for df1 and df2: | ID | sex | age | race | | -------- | -------------- |--------|-------| | 284336 | female | 42.8 | 2 | | 294123 | male | 48.5 | 1 |

Here is what I've attempted in R:

id_match <- NULL
for (i in 1:nrow(gwi_case)){age <- gwi_case$age[i]gender <- gwi_case$gender[i]ethnicity <- gwi_case$hispanic_non[i]race <- gwi_case$race[i]x <- which(gwi_control$gender==gender & gwi_control$age>=age-5 & gwi_control$age<=age+5 & gwi_control$hispanic_non==ethnicity & gwi_control$race==race)y <- sample(x, min(2, length(x)))id_match <- c(id_match, y)
}id_match <- id_match[!duplicated(id_match)]
length(id_match)
Answer

The question asks this:

  • for each row in df1, find an age match in df2 such that df2[age] - 5 <= df1[age] <= df2[age] + 5
  • create a list/dictionary to hold age matches and IDs for df1
  • randomly select 2 IDs from df2 as the final match for df1 age

Here is some Python code that:

  • uses the criteria to populate list of lists ageMatches with a list of unique df2 ages matching each unique df1 age
  • calls DataFrame.query() on df2 for each age in df1 to populate idMatches with a list of df2 IDs with age matching each unique df1 age
  • populates age1ToID2 with unique df1 age keys and with values that are lists of 2 (or fewer if available number < 2) randomly selected df2 IDs of matching age
  • adds a column to df1 containing the pair of selected df2 IDs corresponding to each row's age (i.e., the values in age1ToID2)
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'ID':list(range(101,145)), 'Age':[v % 11 + 21 for v in range(44)], 'Height':[67]*44})
df2 = pd.DataFrame({'ID':list(range(1,101)), 'Age':[v % 10 + 14 for v in range(50)] + [v % 20 + 25 for v in range(0,100,2)], 'Height':[67]*100})ages1 = np.sort(df1['Age'].unique())
ages2 = np.sort(df2['Age'].unique())
ageMatches = [[] for _ in ages1]
j1, j2 = 0, 0
for i, age1 in enumerate(ages1):while j1 < len(ages2) and ages2[j1] < age1 - 5:j1 += 1if j2 <= j1:j2 = j1 + 1while j2 < len(ages2) and ages2[j2] <= age1 + 5:j2 += 1ageMatches[i] += list(ages2[j1:j2])
idMatches = [df2.query('Age in @m')['ID'].to_list() for i, m in enumerate(ageMatches)]# select random pair of df2 IDs for each unique df1 age and put them into a new df1 column
from random import sample
age1ToID2 = {ages1[i]:m if len(m) < 2 else sample(m, 2) for i, m in enumerate(idMatches)}
df1['df2_matches'] = df1['Age'].apply(lambda x: age1ToID2[x])
print(df1)

Output:

     ID  Age  Height df2_matches
0   101   21      67    [24, 30]
1   102   22      67    [50, 72]
2   103   23      67    [10, 37]
3   104   24      67    [63, 83]
4   105   25      67    [83, 49]
5   106   26      67    [20, 52]
6   107   27      67    [49, 84]
7   108   28      67    [54, 55]
8   109   29      67    [91, 55]
9   110   30      67    [65, 51]
10  111   31      67    [75, 72]
11  112   21      67    [24, 30]
...
42  143   30      67    [65, 51]
43  144   31      67    [75, 72]

This hopefully provides the result and intermediate collections that OP is asking for, or something close enough to get to the desired result.

Alternatively, to have the random selection be different for each row in df1, we can do this:

# select random pair of df2 IDs for each df1 row and put them into a new df1 column
from random import sample
age1ToID2 = {ages1[i]:m for i, m in enumerate(idMatches)}
def foo(x):m = age1ToID2[x]return m if len(m) < 2 else sample(m, 2)
df1['df2_matches'] = df1['Age'].apply(foo)
print(df1)

Output:

     ID  Age  Height df2_matches
0   101   21      67    [71, 38]
1   102   22      67     [71, 5]
2   103   23      67     [9, 38]
3   104   24      67    [49, 61]
4   105   25      67    [27, 93]
5   106   26      67    [40, 20]
6   107   27      67     [9, 19]
7   108   28      67    [53, 72]
8   109   29      67    [82, 53]
9   110   30      67    [74, 62]
10  111   31      67    [52, 62]
11  112   21      67    [71, 39]
...
42  143   30      67    [96, 66]
43  144   31      67    [63, 83]
https://en.xdnf.cn/q/119873.html

Related Q&A

Can someone fix this? pip install intents

Im not sure why this is failing but I need help to fix it please Ive been trying to fix it for about 2 hours now and cant figure it out. I have tried to restart my computer, look it up on google etc. b…

Calculating Average Performance Measures in M/M/1 System Using SimPy

I am seeking to calculate the average waiting times, and average service times in the following M/M/1 queueing system, but I am not able to calculate the averages. It writes down to the console each cu…

Python: Whats the difference between import X and from X import *? [duplicate]

This question already has answers here:Use import module or from module import?(23 answers)Closed 7 years ago.I use to think both are equal until I tried this:$python Python 2.7.13 (default, Dec 17 20…

Python iterate through pixels of image

Im trying to iterate through pixels of an image. I set the size and then use a for loop, however I get a type error: object not iterable. I have imported PIL and Imagew=100 h=200 im=im.resize((w,h), Im…

Geocoding with Geopy and big data

I have this CSV file which im feeding with this python scriptimport csv from geopy.geocoders import OpenCagegeolocator = OpenCage() #here some parameters are needed with open(/Users/Ian/Desktop/Test02/…

an error in sending json data to flask server [duplicate]

This question already has answers here:How to get POSTed JSON in Flask?(13 answers)Closed 1 year ago.I have a json data as {"age":59.0,"bp":70.0,"sg":1.01,"al":…

terminate a python program when it hanged using subprocess python

I have a main.py which open a new cmd (subprocess) when another program (test.py, in same directory) is hanged. To determining test.py is hanged or not, I used latest modified time (os.path.getmtime(te…

causes of Python IOError: [Errno 13] Permission denied

When attempting to write a file, I can get this same error when any of following conditions applies:The file exists and is marked read-only. I dont have write permission for the folder and therefore ca…

Python Invalid Syntax IF statement

Im trying to make a quiz in python but I keep getting invalid syntax errors.#This is for addition questions.if (question=add) <---- That is where i get the error for i in range(0,10):first_number_a…

record a web page using python [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.Want to improve this question? Update the question so it focuses on one problem only by editing this post.Closed 3…