Thursday, December 12, 2019

Computation with Missing Values

One thing you can be certain about as a data analyst is that you will not always come across complete sets of data. Since data is collected by different people, they might not use the same conventions you prefer. Therefore, you can always expect to bump into some challenges with missing values in datasets.

In Python, you will encounter None or np.nan in NumPy whenever you come across such types of data. Since you must proceed with your work, you must learn how to handle such scenarios. You have two options: either replace the null values with non-null values or eliminate all the columns and rows that have null values.

First, you must determine the number of null values present in each column within your dataset. You can do this in the syntax below:

squad_df.isnull()

The result is a DataFrame that has True or False in each cell, in relation to the null status of the cell in question. From here, you can also determine the number of null returns in every column through an aggregate summation function as shown below:

squad_df.isnull() .sum()

The result will list all the columns, and the number of null values in each. To eliminate null values from your data, you have to be careful. It is only advisable to eliminate such data if you have deep knowledge of the explanation behind the null values. Besides, it is only advisable to eliminate null data if you are missing a small amount. This should not have a noteworthy effect on the data. The following syntax will help you eliminate null data from your work:

squad_df.dropna()

The syntax above eliminates all rows with at least one null value from your dataset. However, this syntax will also bring forth a new DataFrame without changing the original DataFrame you have been using.

The problem with this operation is that it will eliminate data from the rows with null values. However, some of the columns might still contain some useful information in the eliminated rows. To circumvent this challenge, we must learn how to perform imputation on such datasets.

Instead of eliminating rows, you can choose to eliminate columns that contain null values too. This is performed with the syntax below:

axis=1

For example, squad_df.dropna(axis=1)

What is the explanation behind the axis=1 attribute? Why does it have to be 1 in order to work for columns? To understand this, we take a closer look at the .shape output discussed earlier.

squad_df.shape

Output
(20,2)

In the example above, the syntax returns the DataFrame in the form of a tuple of 20 rows and 2 columns. In this tuple, rows are represented as index zero, while columns are represented as index one. From this explanation, therefore, axis=1 will work on columns.

Share:

0 comments:

Post a Comment