Question 1

I have a dataframe which represents a population, with each column denoting a different quality/ characteristic of that person. How can I get a sample of that dataframe/ population, which is representative of the population as a whole across all characteristics.

Suppose I have a dataframe which represents a workforce of 650 people as follows:

import pandas as pd
import numpy as np
c = np.random.choicecolours = ['blue', 'yellow', 'green', 'green... no, blue']
knights = ['Bedevere', 'Galahad', 'Arthur', 'Robin', 'Lancelot']
qualities = ['wise', 'brave', 'pure', 'not quite so brave']df = pd.DataFrame({'name_id':c(range(3000), 650, replace=False),'favourite_colour':c(colours, 650),'favourite_knight':c(knights, 650),'favourite_quality':c(qualities, 650)})

I can get a sample of the above that reflects the distribution of a single column as follows:

# Find the distribution of a particular column using value_counts and normalize:
knight_weight = df['favourite_knight'].value_counts(normalize=True)# Add this to my dataframe as a weights column:
df['knight_weight'] = df['favourite_knight'].apply(lambda x: knight_weight[x])# Then sample my dataframe using the weights column I just added as the 'weights' argument:
df_sample = df.sample(140, weights=df['knight_weight'])

This will return a sample dataframe (df_sample) such that:

df_sample['favourite_knight'].value_counts(normalize=True)
is approximately equal to
df['favourite_knight'].value_counts(normalize=True)

My question is this: How can I generate a sample dataframe (df_sample) such that the above i.e.:

df_sample[column].value_counts(normalize=True)
is approximately equal to
df[column].value_counts(normalize=True)

is true for all columns (except 'name_id') instead of just one of them? population of 650 with a sample size of 140 is approximately the sizes I'm working with so performance isn't too much of an issue. I'll happily accept solutions that take a couple of minutes to run as this will still be considerably faster than producing the above sample manually. Thank you for any help.

Question 2

You create a combined feature column, weight that one and draw with it as weights:

df["combined"] = list(zip(df["favourite_colour"],df["favourite_knight"],df["favourite_quality"]))combined_weight = df['combined'].value_counts(normalize=True)df['combined_weight'] = df['combined'].apply(lambda x: combined_weight[x])df_sample = df.sample(140, weights=df['combined_weight'])

This will need an additional step of dividing by the count of the specific weight so sum up to 1 - see Ehsan Fathi post.

Pandas representative sampling across multiple columns

Related Q&A

TensorFlow - Ignore infinite values when calculating the mean of a tensor

encode unicode characters to unicode escape sequences

Python: Regarding variable scope. Why dont I need to pass x to Y?

Python/Pandas - partitioning a pandas DataFrame in 10 disjoint, equally-sized subsets

How to fix pylint error Unnecessary use of a comprehension

conv2d_transpose is dependent on batch_size when making predictions

How SelectKBest (chi2) calculates score?

Refer to multiple Models in View/Template in Django

Can I use a machine learning model as the objective function in an optimization problem?

How to store data like Freebase does?