Sunday, April 14, 2019

Pandas - 14 (Reading and Writing HTML Files)

pandas provides the following pair of I/O API functions for the HTML format thus providing the ability to convert complex data structures such as dataframes directly into HTML tables and vice versa:

• read_html()
• to_html()

The read_html() is required as data on the Internet does not always exist in “ready to use,” that is packaged in some TXT or CSV file. Very often, however, the data are reported as part of the text of web pages. So also having available a function for reading could prove to be really useful.

This activity is so widespread that it is currently identified as web scraping. This process is becoming a fundamental part of the set of processes that will be integrated in the first part of data analysis: data mining and data preparation.

Writing into an HTML table

Using the to_html() function, we can directly convert the dataframe into a HTML table.The internal structure of the dataframe is automatically converted into nested tags <TH>, <TR>, and <TD> retaining any internal hierarchies. Because the data structures as the dataframe can be quite complex and large, it’s great to have a function like this when you need to develop web pages. In the following program we'll create a dataframe and convert it into a HTML table:


import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(4).reshape(2,2))
print(frame)
print("\nConverted dataframe into an HTML table\n")
print(frame.to_html())


The output of the program is shown below:

    0  1
0  0  1
1  2  3

Converted dataframe into an HTML table

<table border="1" class="dataframe">
  <thead>
    <tr style="text-align: right;">
      <th></th>
      <th>0</th>
      <th>1</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th>0</th>
      <td>0</td>
      <td>1</td>
    </tr>
    <tr>
      <th>1</th>
      <td>2</td>
      <td>3</td>
    </tr>
  </tbody>
</table>
------------------
(program exited with code: 0)

Press any key to continue . . .


As seen in the output the whole structure formed by the HTML tags needed to create an HTML table was generated correctly in order to respect the internal structure of the dataframe. Let's make another program with a complex dataframe. See the following program :

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.random.random((4,4)),
                     index = ['white','black','red','blue'],
                     columns = ['up','down','right','left'])
                    
print(frame)
s = ['<HTML>']
s.append('<HEAD><TITLE>My DataFrame</TITLE></HEAD>')
s.append('<BODY>')
s.append(frame.to_html())
s.append('</BODY></HTML>')
html = ''.join(s)
html_file = open('myFrame.html','w')
html_file.write(html)
html_file.close()


We start by creating a dataframe containing the labels of the indexes and column names. Next we focus on writing an HTML page through the generation of a string. The code s = ['<HTML>'] create a string that contains the code of the HTML page. Then we append the other HTML tags and the dataframe converted to HTML to the string.

The line html = ".join(s) stores the listing of the HTML page within the html variable. Finally we open a HTML file and write the content stored in the html variable into this file.

The output of the program is shown below:

             up      down     right      left
white  0.175439  0.281051  0.197299  0.363159
black  0.060482  0.191544  0.065633  0.380458
red    0.440554  0.419918  0.906410  0.126702
blue   0.884483  0.463891  0.673394  0.094335
------------------
(program exited with code: 0)

Press any key to continue . . .


In the project directory a new file myFrame.html will be created which when opened displays the following table:


updownrightleft
white0.1754390.2810510.1972990.363159
black0.0604820.1915440.0656330.380458
red0.4405540.4199180.9064100.126702
blue0.8844830.4638910.6733940.094335

Thus you can see the dataframe is converted into a HTML table.

Reading Data from an HTML File

the function read_html () will perform a parsing an HTML page looking for an HTML table. If the table exists then this method converts that table into an object dataframe ready to be used in our data analysis. The read_html() function returns a list of dataframes even if there is only one table.

In the following program we'll parse the HTML file we created in the previous example :

import pandas as pd
import numpy as np

web_frames = pd.read_html('myFrame.html')

print(web_frames[0])


The output of the program is shown below:

  Unnamed: 0          up        down          right      left
0      white    0.175439  0.281051  0.197299  0.363159
1      black    0.060482  0.191544  0.065633  0.380458
2        red      0.440554  0.419918  0.906410  0.126702
3       blue     0.884483  0.463891  0.673394  0.094335
------------------
(program exited with code: 0)

Press any key to continue . . .


As seen in the output, the tags irrelevant to the HTML table are not considered absolutely. Also the web_frames is a list of dataframes, though in our case, the dataframe that we are extracting is only one. However, we can select the item in the list that we want to use, calling it by using it's index. As  the item is unique, the index will be 0.

In the next program we'll use another mode of a direct parsing of an URL on the Web with the read_html(). In this mode the web pages in the network are directly parsed with the extraction of the tables in them. We will call a web page(http://www.worldometers.info/world-population/india-population) where there is an HTML table that shows population of India.

See the following program :

import pandas as pd
import numpy as np

population = pd.read_html('http://www.worldometers.info/world-population/india-population/')

for p in population:
   
    print(p)


The output of the program is shown below:

                                                      0
0     India Population (1950 - 2019)
1  Yearly Population Growth Rate (%)
    Year  Population  ... World Population  IndiaGlobal Rank
0   2019  1368737513  ...       7714576923                 2
1   2018  1354051854  ...       7632819325                 2
2   2017  1339180127  ...       7550262101                 2
3   2016  1324171354  ...       7466964280                 2
4   2015  1309053980  ...       7383008820                 2
5   2010  1230980691  ...       6958169159                 2
6   2005  1144118674  ...       6542159383                 2
7   2000  1053050912  ...       6145006989                 2
8   1995   960482795  ...       5751474416                 2
9   1990   870133480  ...       5330943460                 2
10  1985   781666671  ...       4873781796                 2
11  1980   696783517  ...       4458411534                 2
12  1975   621301720  ...       4079087198                 2
13  1970   553578513  ...       3700577650                 2
14  1965   497702365  ...       3339592688                 2
15  1960   449480608  ...       3033212527                 2
16  1955   409269055  ...       2772242535                 2

[17 rows x 13 columns]


     Year    Population  ... World Population  IndiaGlobal Rank
0     NaN           NaN  ...              NaN               NaN
1  2020.0  1.383198e+09  ...     7.795482e+09               2.0
2  2025.0  1.451829e+09  ...     8.185614e+09               1.0
3  2030.0  1.512985e+09  ...     8.551199e+09               1.0
4  2035.0  1.564570e+09  ...     8.892702e+09               1.0
5  2040.0  1.605356e+09  ...     9.210337e+09               1.0
6  2045.0  1.636496e+09  ...     9.504210e+09               1.0
7  2050.0  1.658978e+09  ...     9.771823e+09               1.0

[8 rows x 13 columns]
------------------
(program exited with code: 0)

Press any key to continue . . .




Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment