Pandas representative sampling across multiple columns

2024/9/28 21:19:30

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representative of the population as a whole across all characteristics.

Suppose I have a dataframe which represents a workforce of 650 people as follows:

import pandas as pd
import numpy as np
c = np.random.choicecolours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),'favourite_colour':c(colours, 650),'favourite_knight':c(knights, 650),'favourite_quality':c(qualities, 650)})

I can get a sample of the above that reflects the distribution of a single column as follows:

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

This will return a sample dataframe (df_sample) such that:

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

My question is this: How can I generate a sample dataframe (df_sample) such that the above i.e.:

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

is true for all columns (except 'name_id') instead of just one of them? population of 650 with a sample size of 140 is approximately the sizes I'm working with so performance isn't too much of an issue. I'll happily accept solutions that take a couple of minutes to run as this will still be considerably faster than producing the above sample manually. Thank you for any help.

Answer

You create a combined feature column, weight that one and draw with it as weights:

df["combined"] = list(zip(df["favourite_colour"],df["favourite_knight"],df["favourite_quality"]))combined_weight = df['combined'].value_counts(normalize=True)df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])df_sample = df.sample(140, weights=df['combined_weight'])

This will need an additional step of dividing by the count of the specific weight so sum up to 1 - see Ehsan Fathi post.

https://en.xdnf.cn/q/71289.html

Related Q&A

TensorFlow - Ignore infinite values when calculating the mean of a tensor

This is probably a basic question, but I cant find a solution:I need to calculate the mean of a tensor ignoring any non-finite values.For example mean([2.0, 3.0, inf, 5.0]) should return 3.333 and not …

encode unicode characters to unicode escape sequences

Ive a CSV file containing sites along with addresses. I need to work on this file to produce a json file that I will use in Django to load initial data to my database. To do that, I need to convert all…

Python: Regarding variable scope. Why dont I need to pass x to Y?

Consider the following code, why dont I need to pass x to Y?class X: def __init__(self):self.a = 1self.b = 2self.c = 3class Y: def A(self):print(x.a,x.b,x.c)x = X() y = Y() y.A()Thank you to…

Python/Pandas - partitioning a pandas DataFrame in 10 disjoint, equally-sized subsets

I want to partition a pandas DataFrame into ten disjoint, equally-sized, randomly composed subsets. I know I can randomly sample one tenth of the original pandas DataFrame using:partition_1 = pandas.Da…

How to fix pylint error Unnecessary use of a comprehension

With python 3.8.6 and pylint 2.4.4 the following code produces a pylint error (or recommendation) R1721: Unnecessary use of a comprehension (unnecessary-comprehension)This is the code: dict1 = {"A…

conv2d_transpose is dependent on batch_size when making predictions

I have a neural network currently implemented in tensorflow, but I am having a problem making predictions after training, because I have a conv2d_transpose operations, and the shapes of these ops are d…

How SelectKBest (chi2) calculates score?

I am trying to find the most valuable features by applying feature selection methods to my dataset. Im using the SelectKBest function for now. I can generate the score values and sort them as I want, b…

Refer to multiple Models in View/Template in Django

Im making my first steps with Python/Django and wrote an example application with multiple Django apps in one Django project. Now I added another app called "dashboard" where Id like to displ…

Can I use a machine learning model as the objective function in an optimization problem?

I have a data set for which I use Sklearn Decision Tree regression machine learning package to build a model for prediction purposes. Subsequently, I am trying to utilize scipy.optimize package to solv…

How to store data like Freebase does?

I admit that this is basically a duplicate question of Use freebase data on local server? but I need more detailed answers than have already been given thereIve fallen absolutely in love with Freebase…