Wednesday, June 29, 2022

Web Pages

Web pages can be static or generated on the fly in response to a user’s interaction, in which case they may contain information from many different sources. In either case, a program can read a web page and  extract parts of it. Called web scraping, this is quite legal as long as the page is publicly available.

A typical scraping scenario in Python involves two libraries: Requests and BeautifulSoup. Requests fetches the source code of the page, and then BeautifulSoup creates a parse tree for the page, which is a hierarchical representation of the page’s content. You can search the parse tree and extract data from it using Pythonic idioms. For example, the following fragment of a parse tree:

[<td title="03/01/2020 00:00:00"><a href="Download.aspx?

ID=630751" id="lnkDownload630751"


<td title="03/01/2020 00:00:00"><a href="Download.aspx?

ID=630753" id="lnkDownload630753"


<td title="03/01/2020 00:00:00"><a href="Download.aspx?

ID=630755" id="lnkDownload630755"


can be easily transformed into the following list of items within a for loop in your Python script:


{'Document_Reference': '630751', 'Document_Date':


'link': ''}

{'Document_Reference': '630753', 'Document_Date':


'link': ''}

{'Document_Reference': '630755', 'Document_Date':


'link': ''}


This is an example of transforming semistructured data into structured data. In the next post we'll cover Databases.



Post a Comment