Thursday, December 13, 2018

Web scraping in Python (using requests module)

Requests is a  Python HTTP library. It provides methods for accessing Web resources via HTTP. Requests is a built-in Python module which lets you easily download files from the Web without having to worry about complicated issues such as network errors, connection problems, and data compression. The requests module doesn’t come with Python, so you’ll have to install it first.

From the command line, run pip install requests -

C:\Users\Python>pip install requests

Once you execute this command, requests will be installed in your environment. We can also verify this by checking the version of requests with the help of a program. See the code below:

import requests

print(requests.__version__)
print(requests.__copyright__)

This program prints the version and copyright details as shown below:

2.21.0
Copyright 2018 Kenneth Reitz

------------------
(program exited with code: 0)

Press any key to continue . . .

Reading a web page

Using the requests.get()function we can read the content of a web page. The get() method issues a GET request; it fetches documents identified by the given URL. See the example below:

#! python3
import requests as req

content = req.get("http://www.covrisolutions.com")

print(content.text)

The script grabs the content of the www.covrisolutions.com web page.

resp = req.get("http://www.covrisolutions.com")

The get() method returns a response object.

print(resp.text)

The text attribute contains the content of the response, in Unicode. The output of the above program is shown below:

                <span>Client Focus</span>
                <span>Quality Assurance</span>
                <span>Diverse Client Base</span>
            </div> Us:
        </div>

    </div>

    <a href="about.html" class="animate" data-anim-type="fadeInRight">Read More!
</a>

</div>
</div><!-- end features section2 -->

<div class="divider_line1"></div>

<div class="clearfix"></div>

<div class="footer1">
<div class="container">
<div class="margin_top3"></div>

        <!--<div class="one_half animate" data-anim-type="fadeInLeft">
        <div class="twitter_feeds_two">

                <div class="left"><i class="fa fa-twitter"></i> <h5 class="white
">Twitter Feeds</h5></div>

            <div class="right">gsrthemes9: Avira - Responsive html5 Professional
 and Brand New Look Template on ThemeForest. <em>.9 days ago .<a href="#">reply<
/a> .<a href="#">retweet</a> .<a href="#">favorite</a></em></div>

        </div>
    </div>--><!-- end twitter feeds -->

    <!--<div class="one_half last animate" data-anim-type="fadeInRight">
        <div class="newsletter_two">

        <div class="left"><i class="fa fa-envelope"></i> <h5 class="white">Sign
up Newsletter</h5></div>

        <div class="right">
        <form method="get" action="index.html">
        <input class="enter_email_input" name="samplees" id="samplees" value="Pl
ease enter your Email Address" onFocus="if(this.value == 'Please enter your Emai
l Address') {this.value = '';}" onBlur="if (this.value == '') {this.value = 'Ple
ase enter your Email Address';}" type="text">
            <div class="clearfix"></div>
            <input name="" value="Subscribe Now!" class="input_submit" type="sub
mit">
        </form>
        </div>

        </div>
        </div>--><!-- end newsletter sign up -->

    <!--<div class="clearfix divider_dashed1"></div>-->
    <div class="clearfix"></div>

    <div class="one_fourth animate" data-anim-type="fadeInUp">
    <div class="siteinfo">

        <h4 class="lmb">About COVRI</h4>

        <p>COVRI is a team of young professionals providing IT solutions for ent
erprises. We are based in Hyderabad, the prime location for IT in India. COVRI w
as established in 2010 with the goal of providing solid and inexpensive SharePoi
nt 2010Ar solutions.</p>
        <br />
        <a href="about.html">Read more <i class="fa fa-long-arrow-right"></i></a
>

        </div>
    </div><!-- end site info -->


    <div class="one_fourth animate" data-anim-type="fadeInUp">
    <div class="qlinks">

        <h4 class="lmb">Quick Links</h4>

        <ul>
            <li><a href="index.html"><i class="fa fa-angle-right"></i> Home</a><
/li>
            <li><a href="portfolio.html"><i class="fa fa-angle-right"></i> Key P
roducts</a></li>
            <li><a href="#"><i class="fa fa-angle-right"></i> Media </a></li>
            <li><a href="services.html"><i class="fa fa-angle-right"></i> Servic
es</a></li>
            <li><a href="careers.html"><i class="fa fa-angle-right"></i> Careers
</a></li>

        </ul>

    </div>
        </div><!-- end links -->


    <div class="one_fourth animate" data-anim-type="fadeInUp">
    <div class="qlinks">

        <h4 class="lmb">Services</h4>

        <ul>
            <li><a href="services.html"><i class="fa fa-angle-right"></i> Applic
ation Development</a></li>
            <li><a href="services.html"><i class="fa fa-angle-right"></i> Sharep
oint Solutions</a></li>
            <li><a href="services.html"><i class="fa fa-angle-right"></i> Mobile
 Apps</a></li>
            <li><a href="services.html"><i class="fa fa-angle-right"></i> e-Cont
ent</a></li>
            <li><a href="services.html"><i class="fa fa-angle-right"></i> Web De
velopment</a></li>

        </ul>

    </div>
        </div><!-- end links -->


    <div class="one_fourth last animate" data-anim-type="fadeInUp">
    <h4 class="lmb">Reach Us</h4>
        <ul class="faddress">
            <!--<li><img src="images/logo_1.png" alt="" /></li>-->
            <li><i class="fa fa-map-marker fa-lg"></i>&nbsp; 1st Floor, Baquer C
omplex Phase-II,<br> 5-9-165/A,&nbsp &nbsp Near Sujatha School,<br> Chapel Road,
 Hyderabad.</li>
            <li><i class="fa fa-phone"></i>&nbsp; +91 - 40 - 64645477</li>
            <!--<li><i class="fa fa-print"></i>&nbsp; 1 -234 -456 -7890</li>-->
            <li><a href="mailto:contact@covrisolutions.com"><i class="fa fa-enve
lope"></i> contact@covrisolutions.com</a></li>
            <li><img src="images/footer-wmap.png" alt="" /></li>
        </ul>
        </div><!-- end address -->

</div>
</div><!-- end footer -->

<div class="clearfix"></div>

<div class="copyright_info">
<div class="container">

        <div class="clearfix divider_dashed10"></div>

    <div class="one_half animate" data-anim-type="fadeInRight">

        Copyright Ac 2016 covri.com. All rights reserved.  <a href="#">Terms of
Use</a> | <a href="#">Privacy Policy</a>

    </div>

    <div class="one_half last">

        <ul class="footer_social_links">
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-facebook"></i></a></li>
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-twitter"></i></a></li>
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-google-plus"></i></a></li>
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-linkedin"></i></a></li>
            <!--<li class="animate" data-anim-type="zoomIn"><a href="#"><i class
="fa fa-skype"></i></a></li>
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-flickr"></i></a></li>
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-html5"></i></a></li>
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-youtube"></i></a></li>
            <li class="animate" data-anim-type="zoomIn"><a href="#"><i class="fa
 fa-rss"></i></a></li>-->
        </ul>

    </div>

</div>
</div><!-- end copyright info -->


<a href="#" class="scrollup">Scroll</a><!-- end scroll to top of the page-->

</div>-->


<!-- ######### JS FILES ######### -->
<!-- get jQuery from the google apis -->
<script type="text/javascript" src="js/universal/jquery.js"></script>

<!-- style switcher -->
<script src="js/style-switcher/jquery-1.js"></script>
<script src="js/style-switcher/styleselector.js"></script>

<!-- animations -->
<script src="js/animations/js/animations.min.js" type="text/javascript"></script
>


<!-- slide panel -->
<script type="text/javascript" src="js/slidepanel/slidepanel.js"></script>

<!-- Master Slider -->
<script src="js/masterslider/jquery.easing.min.js"></script>
<script src="js/masterslider/masterslider.min.js"></script>
<script type="text/javascript">
(function($) {
 "use strict";

var slider = new MasterSlider();
 slider.setup('masterslider' , {
     width: 1400,    // slider standard width
     height:580,   // slider standard height
     space:0,
         speed:45,
     fullwidth:true,
     loop:true,
     preload:0,
     autoplay:true,
         view:"basic"
});
// adds Arrows navigation control to the slider.
slider.control('arrows');
slider.control('bullets');

})(jQuery);
</script>

<!-- mega menu -->
<script src="js/mainmenu/bootstrap.min.js"></script>
<script src="js/mainmenu/customeUI.js"></script>

<!-- jquery jcarousel -->
<script type="text/javascript" src="js/carousel/jquery.jcarousel.min.js"></scrip
t>

<!-- scroll up -->
<script src="js/scrolltotop/totop.js" type="text/javascript"></script>

<!-- tabs -->
<script src="js/tabs/assets/js/responsive-tabs.min.js" type="text/javascript"></
script>

<!-- jquery jcarousel -->
<script type="text/javascript">
(function($) {
 "use strict";

        jQuery(document).ready(function() {
                        jQuery('#mycarouselthree').jcarousel();
        });

})(jQuery);
</script>


<!-- accordion -->
<script type="text/javascript" src="js/accordion/custom.js"></script>

<!-- sticky menu -->
<script type="text/javascript" src="js/mainmenu/sticky.js"></script>
<script type="text/javascript" src="js/mainmenu/modernizr.custom.75180.js"></scr
ipt>

<!-- cubeportfolio -->
<script type="text/javascript" src="js/cubeportfolio/jquery.cubeportfolio.min.js
"></script>
<script type="text/javascript" src="js/cubeportfolio/main.js"></script>
<script type="text/javascript" src="js/cubeportfolio/main5.js"></script>
<script type="text/javascript" src="js/cubeportfolio/main6.js"></script>

<!-- carousel -->
<script defer src="js/carousel/jquery.flexslider.js"></script>
<script defer src="js/carousel/custom.js"></script>

<!-- lightbox -->
<script type="text/javascript" src="js/lightbox/jquery.fancybox.js"></script>
<script type="text/javascript" src="js/lightbox/custom.js"></script>

</body>
</html>



------------------
(program exited with code: 0)

Press any key to continue . . .


We can also strip the HTML tags from our web page as shown in the following program:

#! python3
import requests as req
import re

resp = req.get("http://www.covrisolutions.com")

content = resp.text

stripped = re.sub('<[^<]+?>', '', content)
print(stripped)

The output is shown below:

                                    DARGAH INFO
                                    Devotional
            Join with

                Extensive Experience
                Client Focus
                Quality Assurance
                Diverse Client Base
             Us:

    Read More!

        <!--


                 Twitter Feeds

            gsrthemes9: Avira - Responsive html5 Professional and Brand New Look
 Template on ThemeForest. .9 days ago .reply .retweet .favorite

    -->

    <!--
         Sign up Newsletter
        -->

    <!---->
        About COVRI

        COVRI is a team of young professionals providing IT solutions for enterp
rises. We are based in Hyderabad, the prime location for IT in India. COVRI was
established in 2010 with the goal of providing solid and inexpensive SharePoint
2010Ar solutions.

        Read more
        Quick Links

             Home
             Key Products
             Media
             Services
             Careers

        Services

             Application Development
             Sharepoint Solutions
             Mobile Apps
             e-Content
             Web Development

    Reach Us

            <!---->
            &nbsp; 1st Floor, Baquer Complex Phase-II, 5-9-165/A,&nbsp &nbsp Nea
r Sujatha School, Chapel Road, Hyderabad.
            &nbsp; +91 - 40 - 64645477
            <!--&nbsp; 1 -234 -456 -7890-->
             contact@covrisolutions.com

        Copyright Ac 2016 covri.com. All rights reserved.  Terms of Use | Privacy Policy
            <!--

            -->

Scroll
-->
(function($) {
 "use strict";

var slider = new MasterSlider();
 slider.setup('masterslider' , {
     width: 1400,    // slider standard width
     height:580,   // slider standard height
     space:0,
         speed:45,
     fullwidth:true,
     loop:true,
     preload:0,
     autoplay:true,
         view:"basic"
});
// adds Arrows navigation control to the slider.
slider.control('arrows');
slider.control('bullets');

})(jQuery);

(function($) {
 "use strict";

        jQuery(document).ready(function() {
                        jQuery('#mycarouselthree').jcarousel();
        });

})(jQuery);
------------------
(program exited with code: 0)

Press any key to continue . . .

The script strips the HTML tags of the www.covrisolutions.com web page:

stripped = re.sub('<[^<]+?>', '', content)


There must be a way to cut short the characters in the output window of our programs. In the next program we'll see how to do this and also introduce the type() and status_code attribute.

By calling type() on requests.get()’s return value, you can see that it returns a Response object, which contains the response that the web server gave for your request. The Response object has a status_code attribute that can be checked against requests.codes.ok to see whether the download succeeded. Using the status_code attribute of the Response object we check whether the request for this web page succeeded or not. 

Let's make another program and implement what we just discussed:

#! python3
import requests as req


resp = req.get("http://store.covrisolutions.com/Covri_Cascaded_Lookup.aspx")

type(resp)

if resp.status_code == req.codes.ok:
length = len(resp.text)
print(length)
print(resp.text[:1000])

The output of the program is shown below:

41658

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.or
g/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="EN" lang="EN" dir="ltr">
<head profile="http://gmpg.org/xfn/11">
    <title>Covri Cascaded Lookup | Font Size Zoom Web Part | Text Size Zoom Web
Part
        | Text Size Change Web Part| SharePoint Web Parts | SharePoint 2010 | Mo
bile Apps
        | Android Apps | iPhone Apps</title>
    <meta charset="utf-8">
    <meta name="resource-type" content="document" />
    <meta http-equiv="pragma" content="no-cache" />
    <meta name="classification" content="SharePoint Web Parts, SharePoint 2010,
Mobile Apps, Android Apps, iPhone Apps, Hospital App, University App" />
    <meta name="description" content="COVRI is a team of young professionals pro
viding IT solutions for enterprises. We are based in Hyderabad, the prime locati
on for IT in India. COVRI was established in 2010 with the goal of providing sol
id


------------------
(program exited with code: 0)

Press any key to continue . . .

If the request succeeded,i.e if the status_code is equal to the value of requests.codes.ok, then the downloaded web page is stored as a string in the Response object’s text variable. This variable holds a large string of the entire play; the call to len(res.text) shows you that it is more than 41658 characters long. Finally, calling print(res.text[:1000]) displays only the first 1000 characters from the text.

There is another way to check for request success, call the raise_for_status() method on the Response object. Let's see how this works:

#! python3
import requests 

response = requests.get('http://www.covrisolutions.com/contactus.html')

try:
response.raise_for_status()
except Exception as exp:
print('There was a problem: %s' %(exp))

The output of the program is shown below:

There was a problem: 404 Client Error: Not Found for url: http://www.covrisoluti
ons.com/contactus.html

------------------
(program exited with code: 0)

Press any key to continue . . .


The call to raise_for_status() method on the Response object  raised an exception as there was an error downloading the file hence the output message was printed.

The raise_for_status() method is a good way to ensure that a program halts if a bad download occurs as it will raise an exception if there was an error downloading the file and will do nothing if the download succeeded. So always call raise_for_status() after calling requests.get() as we want to be
sure that the download has actually worked before our program continues.

It is also possible to save the downloaded content in to a file rather than printing it. Here we'll use the open() and write() methods we learnt in the file handling post. The only constraint in doing so it to open the file in wb mode no matter the web page is in plain text.

Let's modify our program as shown below:

#! python3
import requests 

response = requests.get('http://www.covrisolutions.com/about.html')

try:
response.raise_for_status()
except Exception as exp:
print('There was a problem: %s' %(exp))
else:
mycontent = open('covri_content.txt','wb')
for mytext in response.iter_content(100000):
mycontent.write(mytext)


When you run the program the output will be stored in covri_content.txt file which will be created in your working directory if not already present. 

Here we have used the iter_content() method which returns “chunks” of the content on each iteration through the loop. Each chunk is of the bytes data type, and you get to specify how many bytes each chunk will contain. One hundred thousand bytes is generally a good size, so pass 100000 as the argument to iter_content().

Now we have seen how requests module simply handles downloading the contents of web pages. Once the page is downloaded, it is simply data in your program which is advantageous as if you were to lose your Internet connection after downloading the web page, all the page data would still be on your computer.

Here we end today's discussion, in the next post we shall look into the Beautiful Soup library, so till we meet next keep practicing and learning Python as Python is easy to learn!











Share:

0 comments:

Post a Comment