Web pages can be static or generated on the fly in response to a user’s interaction, in which case they may contain information from many different sources. In either case, a program can read a web page and extract parts of it. Called web scraping, this is quite legal as long as the page is publicly available.
A typical scraping scenario in Python involves two libraries: Requests and BeautifulSoup. Requests fetches the source code of the page, and then BeautifulSoup creates a parse tree for the page, which is a hierarchical representation of the page’s content. You can search the parse tree and extract data from it using Pythonic idioms. For example, the following fragment of a parse tree:
[<td title="03/01/2020 00:00:00"><a href="Download.aspx?
ID=630751" id="lnkDownload630751"
target="_blank">03/01/2020</a></td>,
<td title="03/01/2020 00:00:00"><a href="Download.aspx?
ID=630753" id="lnkDownload630753"
target="_blank">03/01/2020</a></td>,
<td title="03/01/2020 00:00:00"><a href="Download.aspx?
ID=630755" id="lnkDownload630755"
target="_blank">03/01/2020</a></td>]
can be easily transformed into the following list of items within a for loop in your Python script:
[
{'Document_Reference': '630751', 'Document_Date':
'03/01/2020',
'link': 'http://www.dummy.com/Download.aspx?ID=630751'}
{'Document_Reference': '630753', 'Document_Date':
'03/01/2020',
'link': 'http://www.dummy.com/Download.aspx?ID=630753'}
{'Document_Reference': '630755', 'Document_Date':
'03/01/2020',
'link': 'http://www.dummy.com/Download.aspx?ID=630755'}
]
This is an example of transforming semistructured data into structured data. In the next post we'll cover Databases.
0 comments:
Post a Comment