Wednesday, June 29, 2022

Web Pages

Web pages can be static or generated on the fly in response to a user’s interaction, in which case they may contain information from many different sources. In either case, a program can read a web page and  extract parts of it. Called web scraping, this is quite legal as long as the page is publicly available.

A typical scraping scenario in Python involves two libraries: Requests and BeautifulSoup. Requests fetches the source code of the page, and then BeautifulSoup creates a parse tree for the page, which is a hierarchical representation of the page’s content. You can search the parse tree and extract data from it using Pythonic idioms. For example, the following fragment of a parse tree:

[<td title="03/01/2020 00:00:00"><a href="Download.aspx?

ID=630751" id="lnkDownload630751"

target="_blank">03/01/2020</a></td>,

<td title="03/01/2020 00:00:00"><a href="Download.aspx?

ID=630753" id="lnkDownload630753"

target="_blank">03/01/2020</a></td>,

<td title="03/01/2020 00:00:00"><a href="Download.aspx?

ID=630755" id="lnkDownload630755"

target="_blank">03/01/2020</a></td>]

can be easily transformed into the following list of items within a for loop in your Python script:

[

{'Document_Reference': '630751', 'Document_Date':

'03/01/2020',

'link': 'http://www.dummy.com/Download.aspx?ID=630751'}

{'Document_Reference': '630753', 'Document_Date':

'03/01/2020',

'link': 'http://www.dummy.com/Download.aspx?ID=630753'}

{'Document_Reference': '630755', 'Document_Date':

'03/01/2020',

'link': 'http://www.dummy.com/Download.aspx?ID=630755'}

]

This is an example of transforming semistructured data into structured data. In the next post we'll cover Databases.

Share:

0 comments:

Post a Comment