pandas, dataframe, groupby, std

2024/10/3 21:28:16

New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?

It works for grouping by a single column:

df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial    132564  non-null values
host           132564  non-null values
idnum          132564  non-null values
operation      132564  non-null values
time           132564  non-null values
...
dtypes: float32(1), int64(2), object(6)byhost = df.groupby('host')byhost.std()
Out[362]:datespecial         idnum      time
host
ahost1.test  11946.961952  40367.033852  0.003699
host1.test   15484.975077  38206.578115  0.008800
host10.test           NaN  37644.137631  0.018001
...

Nice. Now:

byhostandop = df.groupby(['host', 'operation'])byhostandop.std()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)386         # todo, implement at cython level?387         if ddof == 1:
--> 388             return self._cython_agg_general('std')389         else:390             f = lambda x: x.std(ddof=ddof)/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)16151616     def _cython_agg_general(self, how, numeric_only=True):
-> 1617         new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)1618         return self._wrap_agged_blocks(new_blocks)1619/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)1653                 values = com.ensure_float(values)1654
-> 1655             result, _ = self.grouper.aggregate(values, how, axis=agg_axis)16561657             # see if we can cast the block back to the original dtype/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)838                 if is_numeric:839                     result = lib.row_bool_subset(result,
--> 840                                                  (counts > 0).view(np.uint8))841                 else:842                     result = lib.row_bool_subset_object(result,/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'

Huh?? Why do I get this exception?

More questions:

  • how do I calculate std deviation on dataframe.groupby([several columns])?

  • how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.

Answer

It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:

df.astype('float64')

To calculate std() on selected columns, just select columns :)

>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> dfa   b  c  g
0  0  10  a  1
1  1  11  b  1
2  2  12  c  1
3  3  13  d  2
4  4  14  e  2
5  5  15  f  2
6  6  16  g  3
7  7  17  h  3
8  8  18  i  3
9  9  19  j  3
>>> df.groupby('g')[['a', 'b']].std()a         b
g                    
1  1.000000  1.000000
2  1.000000  1.000000
3  1.290994  1.290994

update

As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():

byhostandop['time'].apply(lambda x: x.std())
https://en.xdnf.cn/q/70684.html

Related Q&A

Count occurrences of a couple of specific words

I have a list of words, lets say: ["foo", "bar", "baz"] and a large string in which these words may occur. I now use for every word in the list the "string".coun…

numpy: how to fill multiple fields in a structured array at once

Very simple question: I have a structured array with multiple columns and Id like to fill only some of them (but more than one) with another preexisting array.This is what Im trying:strc = np.zeros(4, …

Combine date column and time column into datetime column

I have a Pandas dataframe like this; (obtained by parsing an excel file)| | COMPANY NAME | MEETING DATE | MEETING TIME| --------------------------------------------------------…

Matplotlib Plot Lines Above Each Bar

I would like to plot a horizontal line above each bar in this chart. The y-axis location of each bar depends on the variable target. I want to use axhline, if possible, or Line2D because I need to be …

Flask-SQLAlchemy Lower Case Index - skipping functional, not supported by SQLAlchemy reflection

First off. Apologies if this has been answered but I can not find the answer any where.I need to define a lowercase index on a Flask-SQLAlchemy object.The problem I have is I need a models username and…

pip listing global packages in active virtualenv

After upgrading pip from 1.4.x to 1.5 pip freeze outputs a list of my globally installed (system) packages instead of the ones installed inside of my virtualenv. Ive tried downgrading to 1.4 again but …

Scrapy Crawler in python cannot follow links?

I wrote a crawler in python using the scrapy tool of python. The following is the python code:from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLin…

Remove commas in a string, surrounded by a comma and double quotes / Python

Ive found some similar themes on stackoverflow, but Im newbie to Python and Reg Exps.I have a string,"Completely renovated in 2009, the 2-star Superior Hotel Ibis BerlinMesse, with its 168 air-con…

I need help making a discord py temp mute command in discord py

I got my discord bot to have a mute command but you have to unmute the user yourself at a later time, I want to have another command called "tempmute" that mutes a member for a certain number…

How to clip polar plot in pylab/pyplot

I have a polar plot where theta varies from 0 to pi/2, so the whole plot lies in the first quater, like this:%pylab inline X=linspace(0,pi/2) polar(X,cos(6*X)**2)(source: schurov.com) Is it possible b…