During EDA, we use visualizations and summary statistics to get a better understanding of the data. Since the human brain excels at picking out visual patterns, data visualization is essential to any analysis. In fact, some characteristics of the data can only be observed in a plot. Depending on our data, we may create plots to see how a variable of interest has evolved over time, compare how many observations belong to each category, find outliers, look at distributions of continuous and discrete variables, and much more.
In the workflow diagram shown below, EDA and data wrangling shared a box.
This is because they are closely tied:
- Data needs to be prepped before EDA.
- Visualizations that are created during EDA may indicate the need for additional data cleaning.
- Data wrangling uses summary statistics to look for potential data issues, while EDA uses them to
understand the data. Improper cleaning will distort the findings when we're conducting EDA. In
addition, data wrangling skills will be required to get summary statistics across subsets of the data.
When calculating summary statistics, we must keep the type of data we collected in mind. Data can be quantitative (measurable quantities) or categorical (descriptions, groupings, or categories). Within these classes of data, we have further subdivisions that let us know what types of operations we can perform on them.
For example, categorical data can be nominal, where we assign a numeric value to each level of the category, such as on = 1/off = 0. Note that the fact that on is greater than off is meaningless because we arbitrarily chose those numbers to represent the states on and off. When there is a ranking among the
categories, they are ordinal, meaning that we can order the levels (for instance, we can have low < medium < high).
Quantitative data can use an interval scale or a ratio scale. The interval scale includes things such as temperature. We can measure temperatures in Celsius and compare the temperatures of two cities, but it doesn't mean anything to say one city is twice as hot as the other. Therefore, interval scale values can be
meaningfully compared using addition/subtraction, but not multiplication/division. The ratio scale, then, are those values that can be meaningfully compared with ratios (using multiplication and division).
Examples of the ratio scale include prices, sizes, and counts. When we complete our EDA, we can decide on the next steps by drawing conclusions. After we have collected the data for our analysis, cleaned it up, and performed some thorough EDA, it is time to draw conclusions. This is where we
summarize our findings from EDA and decide the next steps:
- Did we notice any patterns or relationships when visualizing the data?
- Does it look like we can make accurate predictions from our data? Does it make sense to move to
- modeling the data?
- Should we handle missing data points? How?
- How is the data distributed?
- Does the data help us answer the questions we have or give insight into the problem we are
- investigating?
- Do we need to collect new or additional data?
If we decide to model the data, this falls under machine learning and statistics.
0 comments:
Post a Comment