Data analysis is a highly iterative process involving collection, preparation (wrangling), exploratory data analysis (EDA), and drawing conclusions. During an analysis, we will frequently revisit each of these steps. The following diagram depicts a generalized workflow:
Soon we will get an overview of each of these steps, starting with data collection. In practice, this process is heavily skewed toward the data preparation side. Surveys have found that although data scientists enjoy the data preparation side of their job the least, it makes up 80% of their work. This data
preparation step is where pandas do wonders. Now let us first discuss about the data collection.
Data collection is the natural first step for any data analysis—we can't analyze data we don't have. In reality, our analysis can begin even before we have the data. When we decide what we want to investigate or analyze, we have to think about what kind of data we can collect that will be useful for our analysis. While data can come from anywhere, the following are the main sources:
- Web scraping to extract data from a website's HTML (often with Python packages such as selenium, requests, scrapy, and beautifulsoup)
- Application programming interfaces (APIs) for web services from which we can collect data with
HTTP requests (perhaps using cURL or the requests Python package) - Databases (data can be extracted with SQL or another database-querying language)
- Internet resources that provide data for download, such as government websites or Yahoo! Finance
- Log files
We are surrounded by data, so the possibilities are limitless. It is important, however, to make sure that we are collecting data that will help us draw conclusions. For example, if we are trying to determine whether hot chocolate sales are higher when the temperature is lower, we should collect data on the
amount of hot chocolate sold and the temperatures each day. While it might be interesting to see how far people traveled to get the hot chocolate, it's not relevant to our analysis.
Don't worry too much about finding the perfect data before beginning an analysis. Odds are, there will always be something we want to add/remove from the initial dataset, reformat, merge with other data, or change in some way. This is where data wrangling comes into play, which we will discuss in the next post. See ya !
0 comments:
Post a Comment