Sunday, March 17, 2019

SciPy - 8 (scipy.stats module and descriptive statistics)

The scipy.stats module contains a large number of statistical distributions, statistical functions and tests. The complete listing of these functions can be obtained using info(stats). The list of the random variables available can also be obtained from the docstring for the stats sub-package. Before using any of these sub-packages, it must be explicitly imported. For example, to use functions in the scipy.stats package, we must execute:

from scipy import stats

 Though in some cases we assume that individual objects are imported as

from scipy.stats import norm

Descriptive statistics

To illustrate basic functions I'll use some pseudo-random numbers from a Gaussian or Normal distribution. The function scipy.randn can be used to generate random numbers from a standard Gaussian. This function is the same as the numpy.random.randn function.

s = sp.randn(100)

We have generated 100 random numbers from a standard Gaussian distribution. Since the value returned is a Numpy array we can use its methods to find descriptive statistics for the data (mean,min,max,var,std). See the following program:

import scipy as sp
import numpy as np
import matplotlib as mpl
from matplotlib import pyplot as plt
from scipy import stats

s = sp.randn(100)# Hundred random numbers from a standard Gaussian
print("Mean : {0:8.6f}".format(s.mean()))
print("Minimum : {0:8.6f}".format(s.min()))
print("Maximum : {0:8.6f}".format(s.max()))
print("Variance : {0:8.6f}".format(s.var()))
print("Std. deviation : {0:8.6f}".format(s.std()))

The output of the program is as follows:

Mean : 0.057854
Minimum : -2.143875
Maximum : 3.021479
Variance : 0.987173
Std. deviation : 0.993566

------------------
(program exited with code: 0)

Press any key to continue . . .

Even we can also use Numpy functions for performing the same calculations as shown in the next program:

print("Mean : {0:8.6f}".format(sp.mean(s)))
print("Variance : {0:8.6f}".format(sp.var(s)))
print("Std. deviation : {0:8.6f}".format(sp.std(s)))

The output of the program is as follows:

Mean : 0.225148
Variance : 1.022019
Std. deviation : 1.010950
Median : 0.179047
------------------
(program exited with code: 0)

Press any key to continue . . .

The calculations above use N in the denominator i.e., they are biased estimators of the variance of the parent distribution. But when we are merely trying to describe the data, these are the appropriate equations to use.

The scipy.stats sub-package has a function describe that will provide most of the above numbers. In this case, the variance has N - 1 in the denominator. See the following program which prints the above numbers using describe function:

n, min_max, mean, var, skew, kurt = stats.describe(s)
print("Number of elements: {0:d}".format(n))
print("Minimum: {0:8.6f} Maximum: {1:8.6f}".format(min_max[0], min_max[1]))
print("Mean: {0:8.6f}".format(mean))
print("Variance: {0:8.6f}".format(var))
print("Skew : {0:8.6f}".format(skew))
print("Kurtosis: {0:8.6f}".format(kurt))

The output of the program is as follows:

Number of elements: 100
Minimum: -1.581945 Maximum: 2.144670
Mean: 0.081402
Variance: 0.765518
Skew : 0.220977
Kurtosis: -0.681634

------------------
(program exited with code: 0)

Press any key to continue . . .

Here I am ending this post, in the next post we'll discuss about scipy functions that deal with several common probability distributions. So till we meet next keep practicing and learning Python as Python is easy to learn!








Share:

0 comments:

Post a Comment