Sunday, January 2, 2022

Data wrangling

Data wrangling is the process of preparing the data and getting it into a format that can be used for analysis. The unfortunate reality of data is that it is often dirty, meaning that it requires cleaning (preparation) before it can be used. The following are some issues we may encounter with our data:

Human errors: Data is recorded (or even collected) incorrectly, such as putting 100 instead of 1000, or typos. In addition, there may be multiple versions of the same entry recorded, such as New York City, NYC, and nyc.

Computer error: Perhaps we weren't recording entries for a while (missing data).

Unexpected values: Maybe whoever was recording the data decided to use a question mark for a
missing value in a numeric column, so now all the entries in the column will be treated as text instead
of numeric values.

Incomplete information: Think of a survey with optional questions; not everyone will answer them,
so we will have missing data, but not due to computer or human error.

Resolution: The data may have been collected per second, while we need hourly data for our analysis.

Relevance of the fields: Often, data is collected or generated as a product of some process rather than explicitly for our analysis. In order to get it to a usable state, we will have to clean it up.

Format of the data: Data may be recorded in a format that isn't conducive to analysis, which will
require us to reshape it.

Misconfigurations in the data-recording process: Data coming from sources such as misconfigured trackers and/or webhooks may be missing fields or passed in the wrong order. 

Most of these data quality issues can be remedied, but some cannot, such as when the data is collected daily and we need it on an hourly resolution. It is our responsibility to carefully examine our data and handle any issues so that our analysis doesn't get distorted.

Once we have performed an initial cleaning of the data, we are ready for EDA. Note that during EDA, we may need some additional data wrangling: these two steps are highly intertwined. We will see EDA in the next post

Share:

0 comments:

Post a Comment