This is the seventh featured recipe from the IPython Cookbook, the definitive guide to high-performance scientific computing and data science in Python.

In Chapter 7, Statistical Data Analysis, we introduce statistical methods for data analysis. In addition to covering statistical packages such as pandas, statsmodels, and PyMC, we explain the basics of the underlying mathematical principles. Therefore, this chapter will be most profitable if you have basic experience with probability theory and calculus.

The next chapter, Chapter 8, Machine Learning, is closely related; the underlying mathematics is very similar, but the goals are slightly different. While in the present chapter, we show how to gain insight into real-world data and how to make informed decisions in the presence of uncertainty, in the next chapter the goal is to learn from data, that is, to generalize and to predict outcomes from partial observations.

In this featured recipe, we give a broad, high-level overview of the methods we will see in Chapter 7, and we give an introduction to frequentist and Bayesian statistics.

The full chapter is available in the book, and covers many more statistical methods.

What is statistical data analysis?

The goal of statistical data analysis is to understand a complex, real-world phenomenon from partial and uncertain observations. The uncertainty in the data results in uncertainty in the knowledge we get about the phenomenon. A major goal of the theory is to quantify this uncertainty.

It is important to make the distinction between the mathematical theory underlying statistical data analysis, and the decisions made after conducting an analysis. The former is perfectly rigorous; perhaps surprisingly, mathematicians were able to build an exact mathematical framework to deal with uncertainty. Nevertheless, there is a subjective part in the way statistical analysis yields actual human decisions. Understanding the risk and the uncertainty behind statistical results is critical in the decision-making process.

In this chapter, we will see the basic notions, principles, and theories behind statistical data analysis, covering in particular how to make decisions with a quantified risk. Of course, we will always show how to implement these methods with Python.

A bit of vocabulary

There are many terms that need introduction before we get started with the recipes. These notions allow us to classify statistical techniques within multiple axes.

Exploration, inference, decision, and prediction

Exploratory methods allow us to get a preliminary look at a dataset through basic statistical aggregates and interactive visualization. We covered these basic methods in the first chapter of this book and in our previous book Learning IPython for Interactive Computing and Data Visualization, Packt Publishing.

Statistical inference consists of getting information about an unknown process through partial and uncertain observations. In particular, estimation entails obtaining approximate quantities for the mathematical variables describing this process.

Decision theory allows us to make decisions about an unknown process from random observations, with a controlled risk.

Prediction consists of learning from data, that is, predicting the outcomes of a random process based on a limited number of observations. This is the topic of the next chapter, Chapter 8, Machine Learning.

Univariate and multivariate methods

In most cases, you can consider two dimensions in your data:

  • Observations (or samples, for machine learning people)
  • Variables (or features)

Typically, observations are independent realizations of the same random process. Each observation is made of one or several variables. Most of the time, variables are either numbers, or elements belonging to a finite set (that is, taking a finite number of values). The first step in an analysis is to understand what your observations and variables are.

Your problem is univariate if you have one variable. It is bivariate if you have two variables and multivariate if you have at least two variables. Univariate methods are typically simpler. That being said, univariate methods may be used on multivariate data, using one dimension at a time. Although interactions between variables cannot be explored in that case, it is often an interesting first approach.

Frequentist and Bayesian methods

There are at least two different ways of considering uncertainty, resulting in two different classes of methods for inference, decision, and other statistical questions. These are called frequentist and Bayesian methods. Some people prefer frequentist methods, while others prefer Bayesian methods.

Frequentists interpret a probability as a statistical average across many independent realizations (law of large numbers). Bayesians interpret it as a degree of belief (no need for many realizations). The Bayesian interpretation is very useful when only a single trial is considered. In addition, Bayesian theory takes into account our prior knowledge about a random process. This prior probability distribution is updated into a posterior distribution as we get more and more data.

Both frequentist and Bayesian methods have their advantages and disadvantages. For instance, one could say that frequentist methods might be easier to apply than Bayesian methods, but more difficult to interpret. For classic misuses of frequentist methods, see this reference.

In any case, if you are a beginner in statistical data analysis, you probably want to learn the basics of both approaches before choosing sides. This chapter introduces you to both types of methods.

Jake Vanderplas has written several blog posts about frequentism and Bayesianism, with examples in Python. The first post of the series is available here.

Parametric and nonparametric inference methods

In many cases, you base your analysis on a probabilistic model. This model describes how your data is generated. A probabilistic model has no reality; it is only a mathematical object that guides you in your analysis. A good model can be helpful, whereas a bad model may misguide you.

With a parametric method, you assume that your model belongs to a known family of probability distributions. The model has one or multiple numerical parameters that you can estimate.

With a nonparametric model, you do not make such an assumption in your model. This gives you more flexibility. However, these methods are typically more complicated to implement and to interpret.

This chapter only gives you an idea of the wide range of possibilities that Python offers for statistical data analysis. You can find many books and online courses that cover statistical methods in much greater detail, such as:

In this featured recipe, we give an introduction to frequentist and Bayesian methods.

Getting started with statistical hypothesis testing – a simple z-test

Statistical hypothesis testing allows us to make decisions in presence of incomplete data. By definition, these decisions are uncertain. Statisticians have developed rigorous methods to evaluate this risk. Nevertheless, a part of subjectivity is always involved in the decision-making process. The theory is just a tool that helps making decisions in an uncertain world.

Here, we introduce the most basic ideas behind statistical hypothesis testing, following an extremely simple example: coin tossing. More precisely, we will show how to perform a z-test, and we will briefly explain the mathematical ideas underlying it. This kind of method (also called frequentist method), although widely used in science, is subject to many criticisms. We will show later a more modern approach based on Bayesian theory. It is very helpful to understand both approaches, because many studies and publications still follow frequentist methods.

How to do it...

Many frequentist methods for hypothesis testing roughly involve the following steps:

  1. Writing down the hypotheses, notably the null hypothesis which is the opposite of the hypothesis we want to prove (with a certain degree of confidence).
  2. Computing a test statistics, a mathematical formula depending on the test type, the model, the hypotheses, and the data.
  3. Using the computed value to accept the hypothesis, reject it, or fail to conclude.

Here, we flip a coin \(n\) times and we observe \(h\) heads. We want to know whether the coin is fair (null hypothesis). This example is extremely simple yet quite good for pedagogical purposes. Besides, it is the basis of many more complex methods.

We denote by \(\mathcal B(q)\) the Bernoulli distribution with unknown parameter \(q\). A Bernoulli variable:

  • is 0 (tail) with probability \(1-q\),
  • is 1 (head) with probability \(q\).

Here are the steps required to conduct a simple statistical z-test:

  1. Let's suppose that, after \(n=100\) flips, we get \(h=61\) heads. We choose a significance level of 0.05: is the coin fair or not? Our null hypothesis is: the coin is fair (\(q = 1/2\)).
import numpy as np
import scipy.stats as st
import scipy.special as sp
n = 100  # number of coin flips
h = 61  # number of heads
q = .5  # null-hypothesis of fair coin
  1. Let's compute the z-score, which is defined by the following formula (xbar is the estimated average of the distribution). We will explain this formula in the next section How it works...
xbar = float(h)/n
z = (xbar - q) * np.sqrt(n / (q*(1-q))); z
  1. Now, from the z-score, we can compute the \(p\)-value as follows:
pval = 2 * (1 - st.norm.cdf(z)); pval
  1. This \(p\)-value is less than 0.05, so we reject the null hypothesis and conclude that the coin is probably not fair.

How it works...

The coin tossing experiment is modeled as a sequence of \(n\) independent random variables \(x_i \in \{0,1\}\) following the Bernoulli distribution \(\mathcal B(q)\). Each \(x_i\) represents one coin flip. After our experiment, we get actual values (samples) for these variables. A different notation is sometimes used to distinguish between the random variables (probabilistic objects) and the actual values (samples).

The following formula gives the sample mean (proportion of heads here):

$$\overline x = \frac{1}{n} \sum_i x_i$$

Knowing the expectancy \(\mu=q\) and variance \(\sigma^2=q(1-q)\) of the distribution \(\mathcal B(q)\), we compute:

\begin{align*} E\left[\overline x\right] &= \mu = q\\ \mathrm{var}\left(\overline x\right) &= \frac{\sigma^2}{n} = \frac{q(1-q)}{n} \end{align*}

The \(z\)-test is the normalized version of \(\overline x\) (we remove its mean, and divide by the standard deviation, so that we get a variable with 0 mean and a standard deviation of 1).

$$z = \frac{\overline x - E[\overline x]}{\mathrm{std}(\overline x)} = (\overline x - q) \sqrt{\frac{n}{q(1-q)}}$$

Under the null-hypothesis, what is the probability of obtaining a \(z\)-test higher than some quantity \(z_0\)? This probability is called the (two-sided) \(p\)-value. According to the central limit theorem, the \(z\)-test approximately follows a standard Gaussian distribution \(\mathcal N(0,1)\) for large \(n\), so we get:

$$p = P[\left|z\right|>z_0] = 2P[z>z_0] \simeq 2(1-\Phi(z_0))$$

The following diagram illustrates the \(z\)-score and the \(p\)-value:

Illustration of the z-score and the p-value.

In this formula, \(\Phi\) is the cumulative distribution function of a standard normal distribution. In SciPy, we can get it with scipy.stats.norm.cdf. So, given the \(z\)-test computed from the data, we compute the \(p\)-value: the probability of observing a \(z\)-test more extreme than the observed test, under the null hypothesis.

If the \(p\)-value is less than 5% (the significance level chosen at the beginning of the experiment), we conclude that either:

  1. The null hypothesis is false: we conclude that the coin is unfair.
  2. The null hypothesis is true, and it is just bad luck if we obtained these values. We cannot conclude.

We cannot disambiguate between these two options in this framework, but typically the first option is chosen. We hit the limits of statistics... although there are ways to mitigate this problem (for example, by conducting several independent studies and looking at all their conclusions).

There's more...

Many statistical tests following this pattern and testing various properties of the data exist. Reviewing these tests is beyond the scope of this book, but you can take a look at this reference.

As a \(p\)-value is not easy to interpret, it can lead to wrong conclusions, even in peer-reviewed scientific publications. See this reference for an in-depth treatment of the subject.

Getting started with Bayesian methods

In the last recipe, we used a frequentist method to test a hypothesis on incomplete data. Here, we will see an alternative approach based on Bayesian theory. The main idea is to consider that unknown parameters are random variables, just like the variables describing the experiment. Prior knowledge about the parameters is integrated in the model. This knowledge is updated as more and more data is observed.

Frequentists and Bayesians interpret probabilities differently. Frequentists interpret a probability as a limit of frequencies when the number of samples tends to infinity. Bayesians interpret it as a belief, which is updated as more and more data is observed.

Here, we revisit the previous coin flipping example with a Bayesian approach. This example is sufficiently simple to admit an analytical treatment. In general, as we will see later in this chapter, analytical results cannot be obtained and numerical methods become essential.

How to do it...

Let \(q\) be the probability of obtaining a head. Whereas \(q\) was just a fixed number in the previous recipe, we consider here that it is a random variable. Initially, this variable follows a distribution called the prior distribution. It represents our knowledge about \(q\) before we start flipping the coin. We will update this distribution after each trial (posterior distribution).

  1. First, we assume that \(q\) is a uniform random variable on the interval \([0, 1]\). That's our prior distribution: for all \(q\), \(P(q)=1\).
  2. Then, we flip our coin \(n\) times. We note \(x_i\) the outcome of the \(i\)-th flip (\(0\) for tail, \(1\) for head).
  3. What is the probability distribution of \(q\) knowing the observations \(x_i\)? Bayes' formula allows us to compute the posterior distribution analytically (see the next section for the mathematical details):

$$P(q | \{x_i\}) = \frac{P(\{x_i\} | q) P(q)}{\displaystyle\int_0^1 P(\{x_i\} | q) P(q) dq} = (n+1)\binom n h q^h (1-q)^{n-h}$$

  1. We define in Python the posterior distribution according to the mathematical formula above. We remark this this expression is \((n+1)\) times the probability mass function (PMF) of the binomial distribution, which is directly available in scipy.stats.
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
posterior = lambda n, h, q: (n+1) * st.binom(n, q).pmf(h)
  1. Let's plot this distribution for an observation of \(h=61\) heads and \(n=100\) total flips.
n = 100
h = 61
q = np.linspace(0., 1., 1000)
d = posterior(n, h, q)
plt.plot(q, d, '-k')
plt.xlabel('q parameter')
plt.ylabel('Posterior distribution')
plt.ylim(0, d.max()+1)

This curve represents our belief about the parameter \(q\) after we have observed 61 heads.

How it works...

In this section, we explain Bayes' theorem, and we give the mathematical details underlying this example.

Bayes' theorem

There is a very general idea in data science that consists of explaining data with a mathematical model. This is formalized with a one-way process model → data.

Once this process is formalized, the task of the data scientist is to exploit the data to recover information about the model. In other words, we want to invert the original process and get data → model.

In a probabilistic setting, the direct process is represented as a conditional probability distribution \(P(\textrm{data} \mid \textrm{model})\). This is the probability of observing the data when the model is entirely specified.

Similarly, the inverse process is \(P(\textrm{model} \mid \textrm{data})\). It gives us information about the model (what we're looking for), knowing the observations (what we have).

Bayes' theorem is at the core of a general framework for inverting a probabilistic process model → data. It can be stated as follows:

$$P\left(\textrm{model} \mid \textrm{data}\right) = \frac{P\left(\textrm{data} \mid \textrm{model}\right) \, P\left(\textrm{model}\right)}{P\left(\textrm{data}\right)}$$

This equation gives us information about our model, knowing the observed data. Bayes' equation is widely used in signal processing, statistics, machine learning, inverse problems, and in many other scientific applications.

In Bayes' equation, \(P(\textrm{model})\) reflects our prior knowledge about the model. Also, \(P(\textrm{data})\) is the distribution of the data. It is generally expressed as an integral of \(P(\textrm{data} \mid \textrm{model}) \, P(\textrm{model})\).

In conclusion, Bayes' equation gives us a general roadmap for data inference:

  1. Specify a mathematical model for the direct process model → data (\(P(\textrm{data} \mid \textrm{model})\) term).
  2. Specify a prior for the model (\(P(\textrm{model})\) term).
  3. Perform analytical or numerical calculations for solving this equation.

Computation of the posterior distribution

In this recipe's example, we found the posterior distribution with the following equation (deriving directly from Bayes' theorem):

$$P\left(q \mid \{x_i\}\right) = \frac{P\left(\{x_i\} \mid q\right) P(q)}{\displaystyle\int_0^1 P\left(\{x_i\} \mid q\right) \, P(q) \, dq}$$

Knowing that the \(x_i\) are independent, we get (\(h\) being the number of heads):

$$P\left(\{x_i\} \mid q \right) = \prod_{i=1}^n P\left(x_i \mid q\right) = q^h (1-q)^{n-h}$$

In addition, we can compute analytically the following integral (using an integration by parts and an induction):

$$\int_0^1 P\left(\{x_i\} \mid q \right) P(q) \, dq = \int_0^1 q^h (1-q)^{n-h} dq = \frac{1}{\displaystyle(n+1)\binom n h}$$

Finally, we get:

$$P\left(q \mid \{x_i\} \right) = \frac{P\left(\{x_i\} \mid q\right) P(q)}{\displaystyle\int_0^1 P\left(\{x_i\} \mid q \right) P(q) \, dq} = (n+1)\binom n h q^h (1-q)^{n-h}$$

Maximum a posteriori estimation

We can get a point estimate from the posterior distribution. For example, the maximum a posteriori (MAP) estimation consists of considering the maximum of this distribution as an estimate for \(q\). We can find this maximum analytically or numerically.

Here, we can get this estimate analytically by deriving the posterior distribution with respect to \(q\). We get (assuming \(1 \leq h \leq n-1\)):

$$\frac{d P\left(q \mid \{x_i\}\right)}{dq} = (n+1)\frac{n!}{(n-h)! \, h!} \left( h q^{h-1} (1-q)^{n-h} - (n-h)q^h (1-q)^{n-h-1} \right)$$

This expression is equal to zero when \(q = h/n\). This is the MAP estimate of the parameter \(q\). This value happens to be the proportion of heads obtained in the experiment.

There's more...

In this recipe, we showed a few basic notions in Bayesian theory. We illustrated them with a simple example. The fact that we were able to derive the posterior distribution analytically is not very common in real-world applications. This example is nevertheless informative because it explains the core mathematical ideas behind the complex numerical methods we will see later.

Credible interval

The posterior distribution indicates the plausible values for \(q\) given the observations. We could use it to derive a credible interval, likely to contain the actual value. Credible intervals are the Bayesian analog to confidence intervals in frequentist statistics.

Conjugate distributions

In this recipe, the prior and posterior distributions are conjugate, meaning that they belong to the same family (the beta distribution). For this reason, we were able to compute the posterior distribution analytically.

Non-informative priors

We chose a uniform distribution as prior distribution for the unknown parameter \(q\). It is a simple choice and it leads to tractable computations. It reflects the intuitive fact that we do not favor any particular value a priori. However, there are rigorous ways of choosing completely non-informative priors. An example is the Jeffreys prior, based on the idea that the prior distribution should not depend on the parametrization of the parameters. In our example, the Jeffreys prior is:

$$P(q) = \frac{1}{\sqrt{q(1-q)}}$$

What's next

The rest of the chapter contains the following recipes:

You'll find the rest of the chapter in the full version of the IPython Cookbook, by Cyrille Rossant, Packt Publishing, 2014.