Wednesday, June 22, 2022

Semistructured Data

on June 22, 2022 with No comments

In cases where the structural identity of the information doesn’t conform to stringent formatting requirements, we may need to process semistructured data formats, which let us have records of different structures within the same container (database table or document). Like unstructured data, semistructured data isn’t tied to a predefined organizational schema; unlike unstructured data, however, samples of semistructured data do exhibit some degree of structure, usually in the form of self-describing tags or other markers.

The most common semistructured data formats include XML and JSON. This is what our financial statement might look like in JSON format:

{

"Company": "GoodComp",

"Date": "2021-01-07",

"Stock": 8.2,

"Details": "the company announced positive early-stage

trial results for its vaccine."

}

Here you can recognize the key information that we previously extracted from the statement. Each piece of information is paired with a descriptive tag, such as "Company" or "Date". Thanks to the tags, the information is organized similarly to how it appeared in the previous section, but now we have a fourth tag, "Details", paired with an entire fragment of the original statement, which looks unstructured. This example shows how semistructured data formats can accommodate both structured and unstructured pieces of data within a single record.

Moreover, you can put multiple records of unequal structure into the same container. Here, we store the two different records derived from our example financial statement in the same JSON document:

[

{

"Company": "GoodComp",

"Date": "2021-01-07",

"Stock": 8.2

{

"Company": "GoodComp",

"Date": "2021-01-07",

"Product": "vaccine",

"Stage": "early-stage trial"

}

]

Recall from the discussion in the previous post that a relational database, being a rigidly structured data repository, cannot accommodate records of varying structures in the same table.

Python is easy to learn

Wednesday, June 22, 2022

Semistructured Data

0 comments:

Post a Comment