Let's take a conceptual look at the steps involved in data processing, also known as the data processing pipeline. The usual steps applied to the data are:
. Acquisition
. Cleansing
. Transformation
. Analysis
. Storage
You'll notice that these steps aren’t always clear-cut. In some applications you’ll be able to combine multiple steps into one or omit some steps altogether. Now, let's explore these further.
Acquisition
Before you can do anything with data, you need to acquire it. That’s why data acquisition is the first step in any data processing pipeline. We just learned about the most common types of data sources. Some of those sources allow you to load only the required portion of the data in accordance with your request. For example, a request to the Yahoo Finance API requires you to specify the ticker of a company and a period of time over which to retrieve stock prices for that company. Similarly, the News API, which allows you to retrieve news articles, can process a number of parameters to narrow down the list of articles being requested, including the source and date of publication. Despite these qualifying parameters, however, the retrieved list may still need to be filtered further. That is, the data may require cleansing.
Cleansing
Data cleansing is the process of detecting and correcting corrupt or inaccurate data, or removing unnecessary data. In some cases, this step isn’t required, and the data being obtained is immediately ready for analysis. For example, the yfinance library (a Python wrapper for Yahoo Finance API) returns stock data as a readily usable pandas DataFrame object. This usually allows you to skip the cleansing and transformation steps and move straight to data analysis.
However, if your acquisition tool is a web scraper, the data certainly will need cleansing because fragments of HTML markup will probably be included along with the payload data, as shown here:
6.\tThe development shall comply with the requirements of
DCCâ\x80\x99s Drainage Division as follows\r\n\r\n
After cleansing, this text fragment should look like this:
6. The development shall comply with the requirements of
DCC's Drainage Division as follows
Besides the HTML markup, the scraped text may include other unwanted text, as in the following example, where the phrase A View full text is simply hyperlink text. You might need to open this link to access the text within it:
Permission for proposed amendments to planning permission
received on the 30th A View full text
You can also use a data cleansing step to filter out specific entities. After requesting a set of articles from the News API, for example, you may need to select only those articles in the specified period where the titles include a money or percent phrase. This filter can be considered a data cleansing operation because the goal is to remove unnecessary data and prepare for the data transformation and data analysis operations.
Transformation
Data transformation is the process of changing the format or structure of data in preparation for analysis. For example, to extract the information from our GoodComp unstructured text data as we did in “Structured Data,” you might shred it into individual words or tokens so that a named entity recognition (NER) tool can look for the desired information. In information extraction, a named entity typically represents a real-world object, such as a person, an organization, or a product, that can be identified by a proper noun. There are also named entities that represent dates, percentages, financial terms, and more.
Many NLP tools can handle this kind of transformation for you automatically. After such a transformation, the shredded GoodComp data would look like this:
['GoodComp', 'shares', 'soared', 'as', 'much', 'as', '8.2%',
'on',
'2021-01-07', 'after', 'the', 'company', 'announced',
'positive',
'early-stage', 'trial', 'results', 'for', 'its', 'vaccine']
Other forms of data transformation are deeper, with text data being converted into numerical data. For example, if we’ve gathered a collection of news articles, we might transform them by performing sentiment analysis, a text processing technique that generates a number representing the emotions expressed within a text.
Sentiment analysis can be implemented with tools like SentimentAnalyzer, which can be found in the nltk.sentiment package. A typical analysis output might look like this:
Sentiment URL
--------- --------------------------------------------------
--------------
0.9313 https://mashable.com/uk/shopping/amazon-face-maskstore-
july-28/
0.9387 https://skillet.lifehacker.com/save-thosecrustacean-
shells-to
-make-a-sauce-base-1844520024
Each entry in our dataset now includes a number, such as 0.9313, representing the sentiment expressed within the corresponding article. With the sentiment of each article expressed numerically, we can calculate the average sentiment of the entire dataset, allowing us to determine the overall sentiment toward an object of interest, such as a certain company or product.
We'll continue our discussion over remaining steps in the next post