Let us begin our discussion of descriptive statistics with univariate statistics; univariate simply means that these statistics are calculated from one (uni) variable. Everything in this section can be extended to the whole dataset, but the statistics will be calculated per variable we are recording (meaning that if we had 100 observations of speed and distance pairs, we could calculate the averages across the dataset, which would give us the average speed and average distance statistics).
Descriptive statistics are used to describe and/or summarize the data we are working with. We can start our summarization of the data with a measure of central tendency, which describes where most of the data is centered around, and a measure of spread or dispersion, which indicates how far apart values are.
Measures of central tendency describe the center of our distribution of data. There are three common statistics that are used as measures of center: mean, median, and mode. Each has its own strengths, depending on the data we are working with.
1. Mean
Perhaps the most common statistic for summarizing data is the average, or mean. The population mean is denoted by μ (the Greek letter mu), and the sample mean is written as x̄ (pronounced X-bar). The sample mean is calculated by summing all the values and dividing by the count of values; for example, the mean of the numbers 0, 1, 1, 2, and 9 is 2.6 ((0 + 1 + 1 + 2 +9)/5):
We use x to represent the i observation of the variable X. Note how the variable as a whole is represented with a capital letter, while the specific observation is lowercase. Σ (the Greek capital letter sigma) is used to represent a summation, which, in the equation for the mean, goes from 1 to n, which is
the number of observations.
One important thing to note about the mean is that it is very sensitive to outliers (values created by a different generative process than our distribution). In the previous example, we were dealing with only five values; nevertheless, the 9 is much larger than the other numbers and pulled the mean higher than all but the 9. In cases where we suspect outliers to be present in our data, we may want to instead use the median as our measure of central tendency.
Median and mode will be topic of discussion of our next post
0 comments:
Post a Comment