Structured data has a predefined format that specifies how the data is organized. Such data is usually stored in a repository like a relational database or just a .csv (comma-separated values) file. The data fed into such a repository is called a record, and the information in it is organized in fields that must arrive in a sequence matching the expected structure. Within a database, records of the same structure are logically grouped in a container called a table. A database may contain various tables, with each table having a set structure of fields.
There are two basic types of structured data: numerical and categorical. Categorical data is that which can be categorized on the basis of similar characteristics; cars, for example, might be categorized by make and model. Numerical data, on the other hand, expresses information in numerical form, allowing you to perform mathematical operations on it.
Keep in mind that categorical data can sometimes take on numerical values. For example, consider ZIP codes or phone numbers. Although they are expressed with numbers, it wouldn’t make any sense to perform math operations on them, such as finding the median ZIP code or average phone number.
How can we organize the text sample introduced in the previous section into structured data? We’re interested in specific information in this text, such as company names, dates, and stock prices. We want to present that information in fields in the following format, ready for insertion into a database:
Company: ABC
Date: yyyy-mm-dd
Stock: nnnnn
Using techniques of natural language processing (NLP), a discipline that trains machines to understand human-readable text, we can extract information appropriate for these fields. For example, we look for a company name by recognizing a categorical data variable that can only be one of many preset values, such as Google, Apple, or GoodComp.
Likewise, we can recognize a date by matching its explicit ordering to one of a set of explicit ordering formats, such as yyyy-mm-dd. In our example, we recognize, extract, and present our data in the predefined format like this:
Company: GoodComp
Date: 2021-01-07
Stock: +8.2%
To store this record in a database, it’s better to present it as a row-like sequence of fields. We therefore might reorganize the record as a rectangular data object, or a 2D matrix:
Company | Date | Stock
---------------------------
GoodComp |2021-01-07 | +8.2%
The information you choose to extract from the same unstructured data source depends on your requirements. Our example statement not only contains the change in GoodComp’s stock value for a certain date but also indicates the reason for that change, in the phrase “the company announced positive early-stage trial results for its vaccine.” Taking the statement from this angle, you might create a record with these fields:
Company: GoodComp
Date: 2021-01-07
Product: vaccine
Stage: early-stage trial
Compare this to the first record we extracted:
Company: GoodComp
Date: 2021-01-07
Stock: +8.2%
Notice that these two records contain different fields and therefore have different structures. As a result, they must be stored in two different database tables.
0 comments:
Post a Comment