Wednesday, December 12, 2018

Web scraping with Python (using webbrowser module)

Web scraping is the term for using a program to download and process content from the Web. For
example, Google runs many web scraping programs to index web pages for its search engine. Python has several modules that make it easy to scrape web pages. Some of these modules are:

  • webbrowser      Comes with Python and opens a browser to a specific page.
  • Requests            Downloads files and web pages from the Internet.
  • Beautiful Soup  Parses HTML, the format that web pages are written in.
  • Selenium            Launches and controls a web browser. Selenium is able to fill in forms and  simulate mouse clicks in this browser. 
Let's use the webbrowser module and make some programs to understand web scraping. Usually this module is used to launch a new browser with a specified URL. This is achieved through the webbrowser module’s open() function. Try this code:

import webbrowser

webbrowser.open('http://covrisolutions.com/')

When we run this program, a web browser tab will open to the URL  http://covrisolutions.com. Now let's make another program to automatically launch the map in your browser using the contents of your clipboard or using the command line. This way, you only have to copy the address to a clipboard and run the script, and the map will be loaded for you.

Make sure your classpath settings are configured so that you can run this program from command line along with your IDE, which in my case is geany. I'll use the address of Covri Comunication and Management Solutions which is Chapel Rd, Bagher Complex, Fateh Maidan, Abids, Hyderabad, Telangana 500001. You may use your own or continue with mine. What I intend to do is type this :

C:\Users\Python>mapit1.py Chapel Rd, Bagher Complex, Fateh Maidan, Abids, Hyderabad, Telangana 500001

in the command prompt and my program should open a browser with the google map for this address. See the code below:

#! python3
import webbrowser, sys

if len(sys.argv) >1:
   
   
    address = ' '.join(sys.argv[1:])
   
   
    webbrowser.open('https://www.google.com/maps/place/' + address)



The first line is the the program’s #! shebang line which is a directive for your command line interpreter how it should execute a script. Next we need to import the webbrowser module for launching the browser and import the sys module for reading the potential command line arguments. The sys.argv variable stores a list of the program’s filename and command line arguments. If this list has more than just the filename in it, then len(sys.argv) evaluates to an integer greater than 1, meaning that command line arguments have indeed been provided.

Command line arguments are usually separated by spaces, but in this case, you want to interpret all of the arguments as a single string. Since sys.argv is a list of strings, you can pass it to the join() method, which returns a single string value. You don’t want the program name in this string, so instead of sys.argv, you should pass sys.argv[1:] to chop off the first element of the array. The final string that this expression evaluates to is stored in the address variable.

Last line uses the  webbrowser.open() to open a browser with the URL provided.

Now run the program by entering this into the command line . . .

mapit1.py Chapel Rd, Bagher Complex, Fateh Maidan, Abids, Hyderabad, Telangana 500001

A new page opens showing on Covri Comunication and Management Solutions on the google map as seen in the screen shot shown below:



Now let's consider a scenario where there are no command line arguments, now our program will assume that the address is stored on the clipboard. See the modified program below:

#! python3
# mapIt.py - Launches a map in the browser using an address from the
# command line or clipboard.


import webbrowser, sys, pyperclip

if len(sys.argv) >1:
   
   
    address = ' '.join(sys.argv[1:])
   
else:
   
    address = pyperclip.paste()
   
   
webbrowser.open('https://www.google.com/maps/place/' + address)



Now our program assume that the address is stored on the clipboard. We can get the clipboard content with pyperclip.paste() and store it in a variable named address. Notice that we have imported pyperclip module to use it's paste(). If pyperclip module is not present, please install it otherwise this program won't work.

Finally, to launch a web browser with the Google Maps URL, call webbrowser.open(). Now copy the address in clip board and run this program. Again a new page opens showing on Covri Comunication and Management Solutions on the google map.

The webbrowser module lets users cut out the step of opening the browser and directing themselves to a website. Other programs could use this functionality to do the following:

• Open all links on a page in separate browser tabs.
• Open the browser to the URL for your local weather.
• Open several social network sites that you regularly check.

Try to make programs to implement the above mentioned functionalities. Again I am reminding to set up the PATH variable so that you may run the program through command prompt. Here we end today's discussion, in the next post we shall look into the requests module, so till we meet next keep practicing and learning Python as Python is easy to learn!

















Share:

0 comments:

Post a Comment