Friday, December 13, 2019

Data Imputation

Imputation is a cleaning process that allows you to maintain valuable data in your DataFrames, even if they have null values. This is important in situations where eliminating rows that contain null values might eliminate a lot of data from your dataset. Instead of losing all values, you can use the median or mean of the column in place of the null value.

Using the example above, and assuming a new column for earnings from gate receipts earned by the clubs over the season. Some values are missing in that revenue column. To begin, you must extract the revenue column and use it as a variable. This is done as shown below:


earnings = squad_df[‘earnings_billions’]

Take note that when you are selecting columns to use from a DataFrame, you must enclose them with square brackets as shown above. To handle the missing values, we can use the mean as follows:

earnings_mean = earnings.mean()
earnings_mean

The output should deliver the mean of all the values in the specified cells. Once you have this, you replace it in the null values using the following syntax:

fillna() as shown below:

earnings.fillna(earnings_mean, inplace=True)

This will replace all the null values in the earnings column with the mean of that column. The syntax inplace=True changes the original squad_df.
Share:

0 comments:

Post a Comment