Friday, January 14, 2022

Kernel density estimate

KDEs are similar to histograms, except rather than creating bins for the data, they draw a smoothed curve, which is an estimate of the distribution's probability density function (PDF). The PDF is for continuous variables and tells us how probability is distributed over the values. Higher values for the PDF indicate higher likelihoods:


When the distribution starts to get a little lopsided with long tails on one side, the mean measure of center can easily get pulled to that side. Distributions that aren't symmetric have some skew to them. A left (negative) skewed distribution has a long tail on the left-hand side; a right (positive) skewed distribution has a long tail on the right-hand side. In the presence of negative skew, the mean will be less than the median, while the reverse happens with a positive skew. When there is no skew, both will be equal:


There is also another statistic called kurtosis, which compares the density of the center of the distribution with the density at the tails. Both skewness and kurtosis can be calculated with the SciPy package.

Each column in our data is a random variable, because every time we observe it, we get a value according to the underlying distribution—it's not static. When we are interested in the probability of getting a value of x or less, we use the cumulative distribution function (CDF), which is the integral (area under the curve) of the PDF:



The probability of the random variable X being less than or equal to the specific value of x is denoted as P(X ≤ x). With a continuous variable, the probability of getting exactly x is 0. This is because the probability will be the integral of the PDF from x to x (area under a curve with zero width), which is 0:

In order to visualize this, we can find an estimate of the CDF from the sample, called the empirical cumulative distribution function (ECDF). Since this is cumulative, at the point where the value on the x-axis is equal to x, the y value is the cumulative probability of P(X ≤ x). Let's visualize P(X ≤ 50), P(X = 50), and P(X > 50) as an example:


In addition to examining the distribution of our data, we may find the need to utilize probability distributions for uses such as simulation or hypothesis testing. In the next posts we will take a look at a few distributions that we are likely to come across.

Share:

0 comments:

Post a Comment