Tuesday, December 10, 2019

Dealing with Duplicates

The example we used in the previous post does not have any duplicate rows, thus we need to learn how to identify duplicates to ensure that we perform accurate computations. In the example in our previous post, we can append the squad Dataframe to itself and double it as shown:

temp_df = squad_df.append(squad_df)
temp_df.shape

Our output will be as follows:

(40, 2)

The append() attribute copies the data without altering the initial DataFrame. The example above does not use the real data, hence display in temp . In order to do away with the duplicates, we can use the following attribute:

temp_df = temp_df.drop_duplicates()
temp_df.shape

Our output will be as follows:

(20, 2)

The drop_duplicates() attribute works in the same manner that the append() attribute does. However, instead of doubling the DataFrame, it results in a fresh copy without duplicates. In the same example, .shape helps to confirm whether the dataset we are using has 20 rows as was present in the original file.

In Pandas, the keyword inplace is used to alter the DataFrame objects as shown below:

temp_df.drop_duplicates(inplace=True)

The syntax above will change your data automatically. The drop_duplicates() argument is further complemented with the keep argument in the following ways:

● False – This argument will eliminate all duplicates
● Last – This argument will eliminate all duplicates other than the last one.
● First – This argument will eliminate all duplicates other than the first one.

In the examples we used above, the keep argument has not been defined. Any argument that is not defined will always default to first . What this means is that if you have two duplicate rows, Pandas will maintain the first one but do away with the second.

If you use last , Pandas will drop the first row but maintain the second one. Using keep , however, will eliminate all the duplicates. Assuming that both rows are similar, keep will eliminate both of them. Let’s look at an example using temp_df below:

temp_df = squad_df.append(squad_df) # generate a fresh copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape

We will have the output below:

(0, 2)

In the above example, we appended the squad list, generating new duplicate rows. As a result, keep=False eliminated all the rows, leaving us with zero rows. This might sound absurd, but it is actually a useful technique that will help you determine all the duplicates present in the dataset you are working on.

Share:

0 comments:

Post a Comment