Correct way to obtain confidence interval with scipy

2024/11/20 17:22:23

I have a 1-dimensional array of data:

a = np.array([1,2,3,4,4,4,5,5,5,5,4,4,4,6,7,8])

for which I want to obtain the 68% confidence interval (ie: the 1 sigma).

The first comment in this answer states that this can be achieved using scipy.stats.norm.interval from the scipy.stats.norm function, via:

from scipy import stats
import numpy as np
mean, sigma = np.mean(a), np.std(a)conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma)

But a comment in this post states that the actual correct way of obtaining the confidence interval is:

conf_int = stats.norm.interval(0.68, loc=mean, scale=sigma / np.sqrt(len(a)))

that is, sigma is divided by the square-root of the sample size: np.sqrt(len(a)).

The question is: which version is the correct one?

Answer

The 68% confidence interval for a single draw from a normal distribution withmean mu and std deviation sigma is

stats.norm.interval(0.68, loc=mu, scale=sigma)

The 68% confidence interval for the mean of N draws from a normal distributionwith mean mu and std deviation sigma is

stats.norm.interval(0.68, loc=mu, scale=sigma/sqrt(N))

Intuitively, these formulas make sense, since if you hold up a jar of jelly beans and ask a large number of people to guess the number of jelly beans, each individual may be off by a lot -- the same std deviation sigma -- but the average of the guesses will do a remarkably fine job of estimating the actual number and this is reflected by the standard deviation of the mean shrinking by a factor of 1/sqrt(N).


If a single draw has variance sigma**2, then by the Bienaymé formula, the sum of N uncorrelated draws has variance N*sigma**2.

The mean is equal to the sum divided by N. When you multiply a random variable (like the sum) by a constant, the variance is multiplied by the constant squared. That is

Var(cX) = c**2 * Var(X)

So the variance of the mean equals

(variance of the sum)/N**2 = N * sigma**2 / N**2 = sigma**2 / N

and so the standard deviation of the mean (which is the square root of the variance) equals

sigma/sqrt(N).

This is the origin of the sqrt(N) in the denominator.


Here is some example code, based on Tom's code, which demonstrates the claims made above:

import numpy as np
from scipy import statsN = 10000
a = np.random.normal(0, 1, N)
mean, sigma = a.mean(), a.std(ddof=1)
conf_int_a = stats.norm.interval(0.68, loc=mean, scale=sigma)print('{:0.2%} of the single draws are in conf_int_a'.format(((a >= conf_int_a[0]) & (a < conf_int_a[1])).sum() / float(N)))M = 1000
b = np.random.normal(0, 1, (N, M)).mean(axis=1)
conf_int_b = stats.norm.interval(0.68, loc=0, scale=1 / np.sqrt(M))
print('{:0.2%} of the means are in conf_int_b'.format(((b >= conf_int_b[0]) & (b < conf_int_b[1])).sum() / float(N)))

prints

68.03% of the single draws are in conf_int_a
67.78% of the means are in conf_int_b

Beware that if you define conf_int_b with the estimates for mean and sigma based on the sample a, the mean may not fall in conf_int_b with the desired frequency.


If you take a sample from a distribution and compute the sample mean and std deviation,

mean, sigma = a.mean(), a.std()

be careful to note that there is no guarantee that these will equal the population mean and standard deviation and that we are assuming the population is normally distributed -- those are not automatic givens!

If you take a sample and want to estimate the population mean and standard deviation, you should use

mean, sigma = a.mean(), a.std(ddof=1)

since this value for sigma is the unbiased estimator for the population standard deviation.

https://en.xdnf.cn/q/26297.html

Related Q&A

How to supply a mock class method for python unit test?

Lets say I have a class like this. class SomeProductionProcess(CustomCachedSingleTon):@classmethoddef loaddata(cls):"""Uses an iterator over a large file in Production for the Data pipel…

View pdf image in an iPython Notebook

The following code allows me to view a png image in an iPython notebook. Is there a way to view pdf image? I dont need to use IPython.display necessarily. I am looking for a way to print a pdf image i…

Is the use of del bad?

I commonly use del in my code to delete objects:>>> array = [4, 6, 7, hello, 8] >>> del(array[array.index(hello)]) >>> array [4, 6, 7, 8] >>> But I have heard many …

Find how many lines in string

I am creating a python movie player/maker, and I want to find the number of lines in a multiple line string. I was wondering if there was any built in function or function I could code to do this:x = &…

AttributeError: Cant get attribute new_block on module pandas.core.internals.blocks

I was using pyspark on AWS EMR (4 r5.xlarge as 4 workers, each has one executor and 4 cores), and I got AttributeError: Cant get attribute new_block on <module pandas.core.internals.blocks. Below is…

Disable python import sorting in VSCode

I am trying to disable vscode from formatting my python imports when I save my file. I have some code that must run in between various imports so order is important, but every time I save it just shove…

Log-log lmplot with seaborn

Can Seaborns lmplot plot on log-log scale? This is lmplot with linear axes: import numpy as np import pandas as pd import seaborn as sns x = 10**arange(1, 10) y = 10** arange(1,10)*2 df1 = pd.DataFra…

Django on IronPython

I am interested in getting an install of Django running on IronPython, has anyone had any success getting this running with some level of success? If so can you please tell of your experiences, perfo…

How to create a DataFrame while preserving order of the columns?

How can I create a DataFrame from multiple numpy arrays, Pandas Series, or Pandas DataFrames while preserving the order of the columns?For example, I have these two numpy arrays and I want to combine …

Dynamically limiting queryset of related field

Using Django REST Framework, I want to limit which values can be used in a related field in a creation. For example consider this example (based on the filtering example on https://web.archive.org/web/…