Wednesday, December 18, 2019

Data Cleaning

Data cleaning is one of the most important procedures you should learn in data analysis. You will constantly be working with different sets of data and the accuracy or completeness of the same is never guaranteed. Because of this reason, you should learn how to handle such data and make sure the incompleteness or errors present do not affect the final outcome.

Image result for data cleaning in data science

Why should you clean data, especially if you did not produce it in the first place?

Using unclean data is a sure way to get poor results. You might be using a very powerful computer capable of performing calculations at a very high speed, but what they lack is intuition. Without this, you must make a judgement call each time you go through a set of data. In data analysis, your final presentation should be a reflection of the reality in the data you use. For this reason, you must eliminate any erroneous entries.

Possible Causes of Unclean Data

One of the most expensive overheads in many organizations is data cleaning. Unclean data is present in different forms. Your company might suffer in the form of omissions and errors present in the master data you need for analytical purposes. Since this data is used in important decision-making processes, the effects are costly. By understanding the different ways dirty data finds its way into your organization, you can find ways of preventing it, thereby improving the quality of data you use.
In most instances, automation is applied in data collection. Because of this, you might experience some challenges with the quality of data collected or consistency of the same. Since some data is obtained from different sources, they must be collated into one file before processing. It is during this process that concerns as to the integrity of the data might arise. The following are some explanations as to why you have unclean data:

● Incomplete data

The problem of incomplete data is very common in most organizations. When using incomplete data, you end up with many important parts of the data blank. For example, if you are yet to categorize your customers according to the target industry, it is impossible to create a segment in your sales report according to industry classification. This is an important part of your data analysis that will be
missing, hence your efforts will be futile, or expensive in terms of time and resources invested before you get the complete and appropriate data.

● Errors at input

Most of the mistakes that lead to erroneous data happen at data entry points. The individual in charge might enter the wrong data, use the wrong formula, misread the data, or innocently mistype the wrong data. In the case of an open-ended report like questionnaires, the respondents might input data with typos or use words and phrases that computers cannot decipher appropriately. Human error at
input points is always the biggest challenge in data accuracy.

● Data inaccuracies

Inaccurate data is in most cases a matter of context. You could have the correct data, but for the wrong purpose. Using such data can have far-reaching effects, most of which are very costly in the long run. Think about the example of a data analyst preparing a delivery schedule for clients, but the addresses are inaccurate. The company could end up delivering products to their customers, but with the wrong address details. As a matter of context, the company does have the correct addresses for their clients, but they are not matched correctly.


● Duplicate data

In cases where you collect data from different sources, there is always a high chance of data duplication. You must have a lot of checks in place to ensure that duplicates are identified. For example, one report might list student scores under Results, while another will have them under Performance. The data under these tags will be similar, but your sensors will consider them as two independent entities.

● Problematic sensors

Unless you are using a machine that periodically checks for errors and corrects them or alerts you, it is possible to encounter errors as a result of problematic sensors. Machines can be faulty or breakdown too, which increases the likelihood of a problematic data entry.

● Incorrect data entries

An incorrect entry will always deliver the wrong result. Incorrect entry happens when your dataset includes entries that are not within the acceptable range. For example, data for the month of February should range from 1 to 28 or 29. If you have data for February ranging up to 31, there is definitely an error in your entries.

● Data mungling

If at your data entry point you use a machine with problematic sensors, it is possible to record erroneous values. You might be recording people’s ages, and the machine inputs a negative figure. In some cases, the machine could actually record correct data, but between the input point and the data collection point, the data might be mungled, hence the erroneous results. If you are accessing data
from a public internet connection, a network outage during data transmission might also affect the integrity of the data.

● Standardization concerns

For data obtained from different sources, one of the concerns is often how to standardize the data. You should have a system or method in place to identify similar data and represent them accordingly. Unfortunately, it is not easy to manage this level of standardization. As a result, you end up with erroneous entries. Apart from data obtained from multiple sources, you can also experience challenges dealing with data obtained from the same source. Everyone inputs data uniquely, and this might pose a challenge at data analysis.

Share:

0 comments:

Post a Comment