Sunday, December 16, 2018

Web scraping in Python (using BeautifulSoup)

Beautiful Soup is a Python library for pulling data out of HTML and XML files. The BeautifulSoup module’s name is bs4 (for Beautiful Soup, version 4). To install it, you will need to run pip install beautifulsoup4 from the command line as shown below_


C:\Users\Python>pip install beautifulsoup4
Collecting beautifulsoup4
  Downloading https://files.pythonhosted.org/packages/21/0a/47fdf541c97fd9b6a610
cb5fd518175308a7cc60569962e776ac52420387/beautifulsoup4-4.6.3-py3-none-any.whl (
90kB)
    100% |████████████████████████████████| 92kB 1.2MB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.3

C:\Users\Python>

Now we are ready to use beautifulsoup4 which we do by importing the module bs4 in our programs. Let's make a program that will parse an HTML file. Create a file bs_example.html and enter the following code:

<html>
<head>
<title>Beautiful Soup Example </title>
</head>
<body>
<p>Read my <strong>Python</strong> posts <a href="http://
pythoniseasytolearn.blogspot.com">my website</a>.</p>
<p class="slogan">Python with Vee </p>
<p>By <span id="author">Veevaeck Swami</span></p><

</body>
</html>


Now let's start using beautifulsoup4 and our first step will be to create a BeautifulSoup object from HTML. To do so the bs4.BeautifulSoup() function needs to be called with a string containing the HTML it will parse. The bs4.BeautifulSoup() function returns is a BeautifulSoup object. Enter the following in a new program bs_parsing.py:

import requests, bs4

response = requests.get("http://pythoniseasytolearn.blogspot.com")

try:
response.raise_for_status()
except Exception as exp:
print('There was a problem: %s' %(exp))
else:
mySoup = bs4.BeautifulSoup(response.text)
value = type(mySoup)
print(value)


The output of the program is shown below:
bs_parsing.py:15: UserWarning: No parser was explicitly specified, so I'm using
the best available HTML parser for this system ("html.parser"). This usually isn
't a problem, but if you run this code on another system, or in a different virt
ual environment, it may use a different parser and behave differently.

The code that caused this warning is on line 15 of the file bs_parsing.py. To ge
t rid of this warning, pass the additional argument 'features="html.parser"' to
the BeautifulSoup constructor.

  mySoup = bs4.BeautifulSoup(response.text)
<class 'bs4.BeautifulSoup'>


------------------
(program exited with code: 0)

Press any key to continue . . .

As per warning i'll pass  the additional argument 'features="html.parser"' to the BeautifulSoup constructor as shown:

  mySoup = bs4.BeautifulSoup(response.text, 'html.parser')

Now if I run the program the output is a simple one line:

<class 'bs4.BeautifulSoup'>

------------------
(program exited with code: 0)

Press any key to continue . . .

My program uses requests.get() to download the main page from the pdfdrive website and then passes the text attribute of the response to bs4.BeautifulSoup(). The BeautifulSoup object that it returns is stored in a variable named mySoup.

We can also load an HTML file from your hard drive by passing a File object to bs4.BeautifulSoup(). Let's use the bs_example.html file we created in this program bs_example_parsing.py:

import requests, bs4

myFile = open('bs_example.html')
myExampleSoup = bs4.BeautifulSoup(myFile,'html.parser')
value = type(myExampleSoup)
print(value)

The output is shown below:

<class 'bs4.BeautifulSoup'>

------------------
(program exited with code: 0)

Press any key to continue . . .

This indicates we have a  BeautifulSoup object ready to be used for parsing. Actually we use BeautifulSoup object's methods to parse HTML document.

The next step is to retrieve a web page element which is usually done with the select() method. We can retrieve a web page element from a BeautifulSoup object by calling the select()method and passing a string of a CSS selector for the element we are looking for. 

Selectors are like regular expressions: They specify a pattern to look for, in this case, in HTML pages instead of general text strings. Some of the frequently used CSS selector patterns are:

soup.select('div')                             All elements named <div>

soup.select('#author')                      The element with an id attribute of author

soup.select('.notice')                        All elements that use a CSS class attribute named notice

soup.select('div span')                     All elements named <span> that are within an element named <div>

soup.select('div > span')                  All elements named <span> that are directly within an element                                                               named  <div>, with no other element in between 

soup.select('input[name]')               All elements named <input> that have a name attribute with any                                                            value

soup.select('input[type="button"]') All elements named <input> that have an attribute named type                                                                with value button

The selector patterns can be combined to make sophisticated matches. For example, soup.select('p #author') will match any element that has an id attribute of author, as long as it is also inside a <p> element.

The select() method will return a list of Tag objects, which is how Beautiful Soup represents an HTML element. The list will contain one Tag object for every match in the BeautifulSoup object’s HTML. Tag values can be passed to the str() function to show the HTML tags they represent. Tag values also have an attrs attribute that shows all the HTML attributes of the tag as a dictionary.

Now let's use our bs_example.html and use selector pattern soup.select('#author') . See the code below:

import bs4

myFile = open('bs_example.html')
myExampleSoup = bs4.BeautifulSoup(myFile,'html.parser')

elements = myExampleSoup.select('#author')

value = elements[0].getText()

print(value)

print(str(elements[0]))

print(elements[0].attrs)

Run this program and your output should be as shown below:

Veevaeck Swami
<span id="author">Veevaeck Swami</span>
{'id': 'author'}

------------------
(program exited with code: 0)

Press any key to continue . . .


This code will pull the element with id="author" out of our example HTML. We use select('#author') to return a list of all the elements with id="author". We store this list of Tag objects in the variable elems, and len(elems) tells us there is one Tag object in the list; there was one match. Calling getText() on the element returns the element’s text, or inner HTML. The text of an element is the content between the opening and closing tags: in this case, 'Kenneth N. Berk'.

Passing the element to str() returns a string with the starting and closing tags and the element’s text. Finally, attrs gives us a dictionary with the element’s attribute, 'id', and the value of the id attribute, 'author'. 

Instead of pulling one element , we can also pull all the <p> elements from the BeautifulSoup object as shown in the following program:

import bs4

myFile = open('bs_example.html')
myExampleSoup = bs4.BeautifulSoup(myFile,'html.parser')

all_elements = myExampleSoup.select('p')

print(str(all_elements[0]))

value = all_elements[0].getText()

print(value +'\n')


print(str(all_elements[1]))

value = all_elements[1].getText()

print(value +'\n')


print(str(all_elements[2]))

value = all_elements[2].getText()

print(value +'\n')


Run this program and your output should be as shown below:

<p>Read my <strong>Python</strong> posts <a href="http://
pythoniseasytolearn.blogspot.com">my website</a>.</p>
Read my Python posts my website.

<p class="slogan">Python with Vee </p>
Python with Vee

<p>By <span id="author">Veevaeck Swami</span></p>
By Veevaeck Swami


------------------
(program exited with code: 0)

Press any key to continue . . .


In this program the select() gives us a list of three matches, which we store in all_elements. Using str() on all_elements[0], all_elements[1], and all_elements[2] shows us each element as a string, and using getText() on each element shows us its text.

It is also possible to access attribute values from an element using the get() for Tag objects. In the get() method we pass the attribute name as a string and the method returns that attribute's value. See the program :

import bs4

myFile = open('bs_example.html')
myExampleSoup = bs4.BeautifulSoup(myFile,'html.parser')

mySpanElement = myExampleSoup.select('span')[0]

print(str(mySpanElement ))

myID = mySpanElement.get('id')

print('\n' +myID +'\n')

print(mySpanElement.attrs )

The output of this program is shown below:

<span id="author">Veevaeck Swami</span>

author

{'id': 'author'}

------------------
(program exited with code: 0)

Press any key to continue . . .

In this program we use select() to find any <span> elements and then store the first matched element in mySpanElement . Passing the attribute name 'id' to get() returns the attribute’s value, 'author'.

BeautifulSoup is a power approach of web scraping and must be explored further. Try some programs on your own and see if you can do something unique using this technique. In the next post we shall discuss the selenium module, also used for web scraping. So till we meet next keep practicing and learning Python as Python is easy to learn!














Share:

0 comments:

Post a Comment