Thursday, December 19, 2019

How to Clean Data

Having gone through the procedures described in the previous post and identified unclean data, your next challenge is how to clean it and use accurate data for analysis.


You have five possible alternatives for handling such a situation:

● Data imputation

If you are unable to find the necessary values, you can impute them by filling in the gaps for the inaccurate values. The closest explanation for imputation is that it is a clever way of guessing the missing values, but through a data-driven scientific procedure. Some of the techniques you can use to impute missing data include stratification and statistical indicators like mode, mean and median. If you have studied the data and identified unique patterns, you can stratify the missing values based on the trend identified. For example, men are generally taller than women. You can use this presumption to fill in missing values based on the data you already have.

The most important thing, however, is to try and seek a second opinion on the data before imputing your new values. Some datasets are very critical, and imputing might introduce a personal bias which eventually affects the outcome.

● Data scaling

Data scaling is a process where you change the data range so that you have a reasonable range. Without this, some values that might appear larger than others might be given prominence by some algorithms. For example, the age of a sample population generally exists within a smaller range compared to the average population of a city. Some algorithms will give the population priority
over age, and might ignore the age variable altogether.

By scaling such entries, you maintain a proportional relationship between different variables, ensuring that they are within a similar range. A simple way of doing this is to use a baseline for the large values, or use percentage values for the variables.

● Correcting data


Correcting data is a far better alternative than removing data. This involves intuition and clarification. If you are concerned about the accuracy of some data, getting clarification can help allay your fears. With the new information, you can fix the problems you identified and use data you are confident about in your analysis.

● Data removal

One of the first things you could think about is to eliminate the missing entries from your dataset. Before you do this, it is advisable that you investigate to determine why the entries are missing. In some cases, the best option is to remove the data from your analysis altogether. If, for example, more than 80% of entries in a row is missing and you cannot replace them from any other source, that row will not be useful to your analysis. It makes sense to remove it.

Data removal comes with caveats. If you have to eliminate any data from your analysis, you must give a reason for this decision in a report accompanying your analysis. This is important so as to safeguard yourself from claims of data manipulation or doctoring data to suit a narrative. Some types of data are irreplaceable, so you must consult experts in the associated fields before you remove them. Most of the time, data removal is applied when you identify duplicates in the data, especially if removing the duplicates does not affect the outcome of your analysis.

● Flagging data

There are situations where you have columns missing some values, but you cannot afford to eliminate all of them. If you are working with numeric data, a reprieve would be to introduce a new column where you indicate all the missing values. The algorithm you are using should identify these values as such. In case the flagged values are necessary in your analysis, you can impute them or find a better way to correct them then use them in your analysis. In case this is not possible, make sure you highlight this in your report.

Cleaning erroneous data can be a difficult process. A lot of data scientists generally hope to avoid it, especially since it is time-consuming. However, it is a necessary process that will bring you closer to using appropriate data for your analysis. Remember that the main objective is to use clean data that will give you the closest reflection of the true picture of events.

Share:

0 comments:

Post a Comment