Thursday, December 19, 2019

Identify Inaccurate Data

More often, you need to make a judgement call to determine whether the data you are accessing is accurate or not.


As you go through data, you must make a logical decision based on what you see. The following are some factors you should think about:

● Study the range

First, check the range of data. This is usually one of the easiest problems to identify. Let’s say you are working on data for primary school kids. You know the definitive age bracket for the students. If you identify age entries that are either too young or too old for primary school kids whose data you have, you need to investigate further.

Essentially what you are doing here is an overview of a max-min approach. With these ranges in mind, you can skim through data and identify erroneous entries. Skimming through is easy if you are working with a few entries. If you have thousands or millions of data entries, a max-min function code can help you identify the wrong entries in an instant. You can also plot the data on a graph and
visually detect the values that don’t fall within the required distribution pattern.

● Investigate the categories

How many categories of data do you expect? This is another important factor that will help you determine whether your data is accurate or not. If you expect a dataset with nine categories, anything less is acceptable, but not more. If you have more than nine categories, you should investigate to determine the legitimacy of the additional categories. Say you are working with data on marital status, and your expected options are single, married, divorced, or widowed. If the data has six categories, you should investigate to determine why there are two more.

● Data consistency

Look at the data in question and ensure all entries are consistent. In some cases, inaccuracies appear as a result of inconsistency. This is common when working with percentages. Percentages can either be fed into data sets as basis points or decimal points. If you have data that has both sets of entries, they might be incompatible.

● Inaccuracies across multiple fields

This is perhaps one of the most difficult challenges you will overcome when cleaning inaccurate data. The following entries, for example, are valid individually. A 4-year old girl is a valid age entry. 5 children is also a valid entry. However, a datapoint that depicts Grace as a 4-year old girl with 5 children is absurd. You would need to check for inconsistencies and inaccuracies in several rows and columns.

● Data visualization

Plotting data in visual form is one of the easiest ways of identifying abnormal distributions or any other errors in the data. Say you are working with data whose visualization should result in a bimodal distribution, but when you plot the data you end up with a normal distribution. This would  immediately alert you that something is not right, and you need to check your data for accuracy.

● Number of errors in your data set

Having identified the unique errors in the data set, you must enumerate them. Enumeration will help you make a final decision on how and whether to use the data. How many errors are there? If you have more than half of the data as inaccurate, it is obvious that your presentation would be greatly flawed. You must then follow up with the individuals who prepared the data for clarification or find an alternative.

● Missing entries

A common data concern that data analysts deal with is working with datasets missing some entries. Missing entries is relative. If you are missing two or three entries, this should not be a big issue. However, if your data set is missing many entries, you have to find out the reason behind this. Missing entries usually happen when you are collating data from multiple sources, and in the process some of the data is either deleted, overwritten, or skipped. You must investigate the missing entries because the answer might help you determine whether you are missing only a few entries that might be insignificant going forward, or important entries whose absence affects the outcome.
Share:

0 comments:

Post a Comment