Tuesday, January 4, 2022

Statistical foundations for data analysis

When we want to make observations about the data we are analyzing, we often, if not always, turn to statistics in some fashion. The data we have is referred to as the sample, which was observed from (and is a subset of) the population.

Two broad categories of statistics are descriptive and inferential statistics. With descriptive statistics, as the name implies, we are looking to describe the sample. Inferential statistics involves using the sample statistics to infer, or deduce, something about the population, such as the underlying distribution.

Often, the goal of an analysis is to create a story for the data; unfortunately, it is very easy to misuse statistics.This is especially true of inferential statistics, which is used in many scientific studies and papers to show the significance of the researchers' findings. We will focus on descriptive statistics to help explain the data we are analyzing.

There's an important thing to remember before we attempt any analysis: our sample must be a random sample that is representative of the population. This means that the data must be sampled without bias (for example, if we are asking people whether they like a certain sports team, we can't only ask fans of the team) and that we should have (ideally) members of all distinct groups from the population in our sample (in the sports team example, we can't just ask men).

When we will discuss machine learning in future, we will need to sample our data, which will be a sample to begin with. This is called resampling. Depending on the data, we will have to pick a different method of sampling. Often, our best bet is a simple random sample: we use a random number generator to pick rows at random. When we have distinct groups in the data, we want our sample to be a stratified
random sample, which will preserve the proportion of the groups in the data.

In some cases, we don't have enough data for the aforementioned sampling strategies, so we may turn to random sampling with replacement (bootstrapping); this is called a bootstrap sample. Note that our underlying sample needs to have been a random sample or we risk increasing the bias of the estimator (we could pick certain rows more often because they are in the data more often if it was a convenience sample, while in the true population these rows aren't as prevalent).

We will continue our discussion of descriptive statistics in the next post.

Share:

0 comments:

Post a Comment