Monday, December 17, 2018

Web scraping in Python (using Selenium)

The previously discussed web scraping methods using the requests and BeautifulSoup modules are great as long as you can figure out the URL you need to pass to requests.get(). However, sometimes
this isn’t so easy to find and even sometimes he website you want your program to navigate requires you to log in first. There is another web scraping method using the selenium module which will gives our programs the power to perform such sophisticated tasks.

Selenium allows you to interact with web pages in a much more advanced way than Requests and
Beautiful Soup and lets Python directly control the browser by programmatically clicking links and filling in login information, almost as though there is a human user interacting with the page.; but because it launches a web browser, it is a bit slower and hard to run in the background if, say, you just need to download some files from the Web.

The first step to use Selenium is to install it using pip install selenium. Next download the required selenium drivers for you browser. I am using google chrome so I've downloaded chromedriver_win32 from http://chromedriver.chromium.org/downloads. If you use other browsers download the appropriate driver. Make sure you include this driver in your class path setting.

Now let us make a program which will open www.covrisolutions.com in a new window once the program runs. See the following code-


from selenium import webdriver

browser = webdriver.Chrome()

mybrowser = type(browser)
print(mybrowser)

browser.get('http://www.covrisolutions.com/')

Importing the modules for Selenium is slightly different from what we normally do. Instead of import selenium, you need to run from selenium import webdriver. When webdriver.Chrome() is called, a chrome browser starts up. In the program we have used the type() on the value 'browser' to know it's type, this is however not required and may be skipped. Finally calling the browser.get('http://www.covrisolutions.com/') directs the started browser to http://www.covrisolutions.com. The output window will show the following:


DevTools listening on ws://127.0.0.1:50408/devtools/browser/8ad9bb13-aece-463d-9
512-96055ecc41a1
<class 'selenium.webdriver.chrome.webdriver.WebDriver'>
------------------
(program exited with code: 0)

Press any key to continue . . .

The main usage of the webdriver object is for finding elements on a page. There are some predefined methods for finding elements called on a WebDriver object which are divided into the find_element_* and find_elements_* methods. The find_element_* methods return a single WebElement object, representing the first element on the page that matches your query. The find_elements_* methods return a list of WebElement_* objects for every matching element on the page. Some of these methods are:

1. browser.find_element_by_class_name(name) and browser.find_elements_by_class_name(name) whic returns Elements that use the CSS class name

2. browser.find_element_by_css_selector(selector) and browser.find_elements_by_css_selector(selector) which returns Elements that match the CSS selector

3. browser.find_element_by_id(id) and browser.find_elements_by_id(id) which returns Elements with a matching id attribute value

4. browser.find_element_by_link_text(text) and browser.find_elements_by_link_text(text) which returns <a> elements that completely match the text provided

5. browser.find_element_by_partial_link_text(text) and browser.find_elements_by_partial_link_text(text) which returns <a> elements that contain the text
provided

6. browser.find_element_by_name(name) and browser.find_elements_by_name(name) which returns Elements with a matching name attribute value

7. browser.find_element_by_tag_name(name) and browser.find_elements_by_tag_name(name) which returns Elements with a matching tag name (case insensitive; an <a> element is matched by 'a' and 'A')

Except for the *_by_tag_name() methods, the arguments to all the methods are case sensitive. If no elements exist on the page that match what the method is looking for, the selenium module raises a NoSuchElement exception.

The WebElement object can be used to find more about its attributes or calling the methods associated with it. Some of the WebElement Attributes and Methods are:

tag_name                 The tag name, such as 'a' for an <a> element
get_attribute(name) The value for the element’s name attribute
text                           The text within the element, such as 'hello' in <span>hello</span>
clear()                       For text field or text area elements, clears the text typed into it
is_displayed()           Returns True if the element is visible; otherwise returns False
is_enabled()              For input elements, returns True if the element is enabled; otherwise returns        False
is_selected()              For checkbox or radio button elements, returns True if the element is selected;  otherwise returns False
location                     A dictionary with keys 'x' and 'y' for the position of the element in the page


Let's make a program and try to use webdriver object to find an element using find_element_by_class_name() method.


from selenium import webdriver

browser = webdriver.Chrome()
mybrowser = type(browser)
browser.get('http://www.covrisolutions.com/')

try:
   
    element = browser.find_element_by_class_name('dropdown')
    print('Found <%s> element with that class name!' % (element.tag_name))
   
except:
   
    print('element not found')


The output window shows the following:

DevTools listening on ws://127.0.0.1:51031/devtools/browser/18da9c80-1a46-453e-a
5d6-5e6d33f53085
Found <li> element with that class name!


------------------
(program exited with code: 0)

Press any key to continue . . .

On the specified page we wanted to find elements with the class name 'dropdown' . If the element exists it's tag name will be printed else element not found statement will be printed. In our program we found an element with the class name 'dropdown' and the tag name 'li'.

The click() method

WebElement objects returned from the find_element_* and find_elements_* methods have a click() method that simulates a mouse click on that element. This method can be used to follow a link, make a selection on a radio button, click a Submit button, or trigger whatever else might happen when the element is clicked by the mouse. See the following example:


from selenium import webdriver

browser = webdriver.Chrome()
mybrowser = type(browser)
browser.get('http://www.covrisolutions.com/')
link_element = browser.find_element_by_link_text('SERVICES')
link_element.click()


This program opens http://www.covrisolutions.com in chrome gets the WebElement object for the <a> element with the text SERVICES, and then simulates clicking that <a> element. It’s just like if you clicked the link yourself; the browser then follows that link opening this link:

http://www.covrisolutions.com/services.html

The send_keys() method

Sending keystrokes to text fields on a web page is a matter of finding the <input> or <textarea> element for that text field and then calling the send_keys() method. See the following example:


from selenium import webdriver

browser = webdriver.Chrome()

browser.get('http://yahoo.com/')
email_element = browser.find_element_by_id('Email')
email_element.send_keys('xyz_123@yahoo.com')
pwd_element = browser.find_element_by_id('Passwd')
pwd_element.send_keys('123456789')
pwd_element.submit()




When we run this code will fill in Username and Password text fields with the provided text. Here calling the submit() method on any element will have the same result as clicking the Submit button for the form that element is in.

Selenium also has a module for keyboard keys that are impossible to type into a string value, which function much like escape characters. These values are stored in attributes in the selenium.webdriver.common.keys module. Since that is such a long module name, it’s much easier to run from selenium.webdriver .common.keys import Keys at the top of your program; if you do, then you can simply write Keys anywhere you’d normally have to write:

selenium.webdriver.common.keys

Some of the commonly Used Variables in the selenium.webdriver.common.keys module are:

Keys.DOWN, Keys.UP, Keys.LEFT, Keys.RIGHT   The keyboard arrow keys
Keys.ENTER, Keys.RETURN                                    The enter and return keys
Keys.HOME, Keys.END, Keys.PAGE_DOWN,
Keys.PAGE_UP                                                           The home, end, pagedown, and pageup keys
Keys.ESCAPE, Keys.BACK_SPACE,
Keys.DELETE                                                              The esc, backspace, and delete keys
Keys.F1, Keys.F2, . . . , Keys.F12                                The F1 to F12 keys at the top of the keyboard
Keys.TAB                                                                      The tab key


In the next program we will implement the functionality that if the cursor is not currently in a text field, pressing the home and end keys will scroll the browser to the top and bottom of the page,
respectively. See the following code:

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

browser = webdriver.Chrome()

browser.get('http://covrisolutions.com/')
htmlElem = browser.find_element_by_tag_name('html')
htmlElem.send_keys(Keys.END)
htmlElem.send_keys(Keys.HOME)


Run this program and the program opens covrisolutions.com in chrome. Press the End key and Home key to verify if it works as expected.

Viewing page source

It is also possible to get the page source through the webdriver object with going to the website and then manually clicking and viewing page source. See the following code:


from selenium import webdriver
from selenium.webdriver.common.keys import Keys

browser = webdriver.Chrome()

browser.get('http://covrisolutions.com/')

html = browser.page_source
print(html)


Run this program and the program opens covrisolutions.com in chrome. In the output window we can see the page source as shown below:

                <span>Client Focus</span>
                <span>Quality Assurance</span>
                <span>Diverse Client Base</span>
            </div> Us:
        </div>

    </div>

    <a href="about.html" class="animate" data-anim-type="fadeInRight">Read More!
</a>

.....

......

<!-- lightbox -->
<script type="text/javascript" src="js/lightbox/jquery.fancybox.js"></script>
<script type="text/javascript" src="js/lightbox/custom.js"></script>

</body></html>


------------------
(program exited with code: 0)

Press any key to continue . . . 


As an exercise, rather then printing the page source, save it in a file.

Extract links from a web page

Selenium automates browsers. The selenium module can make the browser do anything you want including automated testing, automating web tasks and data extraction. In this program we’ll use it for data mining, extracting the links from a web page. See the code below:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://covrisolutions.com/')
for a in browser.find_elements_by_xpath('.//a'):
    print(a.get_attribute('href'))

Run this program and check the output window. It should contain the links in the source page http://covrisolutions.com. The output is shown below:

DevTools listening on ws://127.0.0.1:51590/devtools/browser/4deb006c-d524-423b-9
682-0a0f1bbf0437
mailto:contact@covrisolutions.com
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/index.html
http://covrisolutions.com/index.html
http://covrisolutions.com/about.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/portfolio.html
http://covrisolutions.com/careers.html
http://covrisolutions.com/contact.html
http://covrisolutions.com/portfolio.html
http://covrisolutions.com/about.html
http://covrisolutions.com/services.html
http://store.covrisolutions.com/Covri_Cascaded_Lookup.aspx
http://store.covrisolutions.com/Covri_CrossSite_Lookup.aspx
http://store.covrisolutions.com/Covri_ParentSelector.aspx
http://store.covrisolutions.com/Font_Size_Change_WebPart.aspx
http://store.covrisolutions.com/Covri_ImageUploadColumn.aspx
http://store.covrisolutions.com/Covri_VideoColumnWebpart.aspx
http://store.covrisolutions.com/Password_Change_WebPart.aspx
http://store.covrisolutions.com/Password_Reset_WebPart.aspx
http://covrisolutions.com/college_app.html
http://covrisolutions.com/hospital_app.html
http://covrisolutions.com/portfolio.html
http://covrisolutions.com/about.html
http://covrisolutions.com/images/illis_1.png
http://covrisolutions.com/images/pandit.png
http://covrisolutions.com/images/mandir.png
http://covrisolutions.com/images/dargah.png
http://covrisolutions.com/images/emrc.png
http://covrisolutions.com/images/pbgm.png
http://covrisolutions.com/images/mindfulness.png
http://covrisolutions.com/images/bholekuti.png
http://covrisolutions.com/about.html
http://covrisolutions.com/about.html
http://covrisolutions.com/index.html
http://covrisolutions.com/portfolio.html
http://covrisolutions.com/#
http://covrisolutions.com/services.html
http://covrisolutions.com/careers.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
http://covrisolutions.com/services.html
mailto:contact@covrisolutions.com
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/#
http://covrisolutions.com/#


------------------
(program exited with code: 0)

Press any key to continue . . .

Selenium can do much more beyond the functions described here. It can modify your browser’s cookies, take screenshots of web pages, and run custom JavaScript.

To take screenshot use this code:

from selenium import webdriver

browser = webdriver.Chrome()
browser.get('http://covrisolutions.com/')
browser.save_screenshot("my_screenshot.png")

The program opens the browser and takes the screen shot which will be stored in the same directory as the program.

We can do a lot more with Selenium in a lesser complicated way than previously discussed approaches. Make more programs to use other methods and attributes to have a more clear understanding. So till we meet next keep practicing and learning Python as Python is easy to learn!

 
 








































































Share:

4 comments: