I would like to randomly select a value in consideration of weightings using Pandas
.
df
:
0 1 2 3 4 5
0 40 5 20 10 35 25
1 24 3 12 6 21 15
2 72 9 36 18 63 45
3 8 1 4 2 7 5
4 16 2 8 4 14 10
5 48 6 24 12 42 30
I am aware of using np.random.choice
, e.g:
x = np.random.choice(['0-0','0-1',etc.], 1,p=[0.4,0.24 etc.]
)
And so, I would like to get an output, in a similar style/alternative method to np.random.choice
from df
, but using Pandas
. I would like to do so in a more efficient way in comparison to manually inserting the values as I have done above.
Using np.random.choice
I am aware that all values must add up to 1
. I'm not sure as to how to go about solving this, nor randomly selecting a value based on weightings using Pandas
.
When referring to an output, if the randomly selected weight was for example, 40, then the output would be 0-0 since it is located in that column 0
, row 0
and so on.
Stack the DataFrame:
stacked = df.stack()
Normalize the weights (so that they add up to 1):
weights = stacked / stacked.sum()
# As GeoMatt22 pointed out, this part is not necessary. See the other comment.
And then use sample:
stacked.sample(1, weights=weights)
Out:
1 2 12
dtype: int64# Or without normalization, stacked.sample(1, weights=stacked)
DataFrame.sample method allows you to either sample from rows or from columns. Consider this:
df.sample(1, weights=[0.4, 0.3, 0.1, 0.1, 0.05, 0.05])
Out: 0 1 2 3 4 5
1 24 3 12 6 21 15
It selects one row (the first row with 40% chance, the second with 30% chance etc.)
This is also possible:
df.sample(1, weights=[0.4, 0.3, 0.1, 0.1, 0.05, 0.05], axis=1)
Out: 1
0 5
1 3
2 9
3 1
4 2
5 6
Same process but 40% chance is associated with the first column and we are selecting from columns. However, your question seems to imply that you don't want to select rows or columns - you want to select the cells inside. Therefore, I changed the dimension from 2D to 1D.
df.stack()Out:
0 0 401 52 203 104 355 25
1 0 241 32 123 64 215 15
2 0 721 92 363 184 635 45
3 0 81 12 43 24 75 5
4 0 161 22 83 44 145 10
5 0 481 62 243 124 425 30
dtype: int64
So if I now sample from this, I will both sample a row and a column. For example:
df.stack().sample()
Out:
1 0 24
dtype: int64
selects row 1 and column 0.