Multivariate statistics ~ Python is easy to learn

With multivariate statistics, we seek to quantify relationships between variables and attempt to make predictions for future behavior. The covariance is a statistic for quantifying the relationship between variables by showing how one variable changes with respect to another (also referred to as their joint variance):

E[X] is a new notation for us. It is read as the expected value of X or the expectation of X, and it is calculated by summing all the possible values of X multiplied by their probability—it's the long-run average of X.

The magnitude of the covariance isn't easy to interpret, but its sign tells us whether the variables are positively or negatively correlated. However, we would also like to quantify how strong the relationship is between the variables, which brings us to correlation. Correlation tells us how variables change together both in direction (same or opposite) and magnitude (strength of the relationship). To find the correlation, we calculate the Pearson correlation coefficient, symbolized by ρ (the Greek letter rho), by dividing the covariance by the product of the standard deviations of the variables:

This normalizes the covariance and results in a statistic bounded between -1 and 1, making it easy to describe both the direction of the correlation (sign) and the strength of it (magnitude). Correlations of 1 are said to be perfect positive (linear) correlations, while those of -1 are perfect negative correlations. Values near 0 aren't correlated. If correlation coefficients are near 1 in absolute value, then the variables are said to be strongly correlated; those closer to 0.5 are said to be weakly correlated.

Let's look at some examples using scatter plots. In the leftmost subplot of Figure 1.12 (ρ = 0.11), we see that there is no correlation between the variables: they appear to be random noise with no pattern. The next plot with ρ = -0.52 has a weak negative correlation: we can see that the variables appear to move together with the x variable increasing, while the y variable decreases, but there is still a bit of randomness. In the third plot from the left (ρ = 0.87), there is a strong positive correlation: x and y are increasing together. The rightmost plot with ρ = -0.99 has a near-perfect negative correlation: as x increases, y decreases. We can also see how the points form a line:

To quickly eyeball the strength and direction of the relationship between two variables (and see whether there even seems to be one), we will often use scatter plots rather than calculating the exact correlation coefficient. This is for a couple of reasons:

1. It's easier to find patterns in visualizations, but it's more work to arrive at the same conclusion by looking at numbers and tables.

2. We might see that the variables seem related, but they may not be linearly related. Looking at a visual representation will make it easy to see if our data is actually quadratic, exponential, logarithmic, or some other non-linear function.

Both of the following plots depict data with strong positive correlations, but it's pretty obvious when looking at the scatter plots that these are not linear. The one on the left is logarithmic, while the one on the right is exponential:

It's very important to remember that while we may find a correlation between X and Y, it doesn't mean that X causes Y or that Y causes X. There could be some Z that actually causes both; perhaps X causes some intermediary event that causes Y, or it is actually just a coincidence. Keep in mind that we often don't have enough information to report causation—correlation does not imply causation.

Next we will take up pitfalls of summary statistics. See you in the next post

Python is easy to learn

Sunday, January 16, 2022

Multivariate statistics

0 comments:

Post a Comment