pandas provides the following pair of I/O API functions for the HTML format thus providing the ability to convert complex data structures such as dataframes directly into HTML tables and vice versa:
• read_html()
• to_html()
The read_html() is required as data on the Internet does not always exist in “ready to use,” that is packaged in some TXT or CSV file. Very often, however, the data are reported as part of the text of web pages. So also having available a function for reading could prove to be really useful.
This activity is so widespread that it is currently identified as web scraping. This process is becoming a fundamental part of the set of processes that will be integrated in the first part of data analysis: data mining and data preparation.
Writing into an HTML table
Using the to_html() function, we can directly convert the dataframe into a HTML table.The internal structure of the dataframe is automatically converted into nested tags <TH>, <TR>, and <TD> retaining any internal hierarchies. Because the data structures as the dataframe can be quite complex and large, it’s great to have a function like this when you need to develop web pages. In the following program we'll create a dataframe and convert it into a HTML table:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(4).reshape(2,2))
print(frame)
print("\nConverted dataframe into an HTML table\n")
print(frame.to_html())
The output of the program is shown below:
0 1
0 0 1
1 2 3
Converted dataframe into an HTML table
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>3</td>
</tr>
</tbody>
</table>
------------------
(program exited with code: 0)
Press any key to continue . . .
As seen in the output the whole structure formed by the HTML tags needed to create an HTML table was generated correctly in order to respect the internal structure of the dataframe. Let's make another program with a complex dataframe. See the following program :
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.random((4,4)),
index = ['white','black','red','blue'],
columns = ['up','down','right','left'])
print(frame)
s = ['<HTML>']
s.append('<HEAD><TITLE>My DataFrame</TITLE></HEAD>')
s.append('<BODY>')
s.append(frame.to_html())
s.append('</BODY></HTML>')
html = ''.join(s)
html_file = open('myFrame.html','w')
html_file.write(html)
html_file.close()
We start by creating a dataframe containing the labels of the indexes and column names. Next we focus on writing an HTML page through the generation of a string. The code s = ['<HTML>'] create a string that contains the code of the HTML page. Then we append the other HTML tags and the dataframe converted to HTML to the string.
The line html = ".join(s) stores the listing of the HTML page within the html variable. Finally we open a HTML file and write the content stored in the html variable into this file.
The output of the program is shown below:
up down right left
white 0.175439 0.281051 0.197299 0.363159
black 0.060482 0.191544 0.065633 0.380458
red 0.440554 0.419918 0.906410 0.126702
blue 0.884483 0.463891 0.673394 0.094335
------------------
(program exited with code: 0)
Press any key to continue . . .
In the project directory a new file myFrame.html will be created which when opened displays the following table:
Thus you can see the dataframe is converted into a HTML table.
Reading Data from an HTML File
the function read_html () will perform a parsing an HTML page looking for an HTML table. If the table exists then this method converts that table into an object dataframe ready to be used in our data analysis. The read_html() function returns a list of dataframes even if there is only one table.
In the following program we'll parse the HTML file we created in the previous example :
import pandas as pd
import numpy as np
web_frames = pd.read_html('myFrame.html')
print(web_frames[0])
The output of the program is shown below:
Unnamed: 0 up down right left
0 white 0.175439 0.281051 0.197299 0.363159
1 black 0.060482 0.191544 0.065633 0.380458
2 red 0.440554 0.419918 0.906410 0.126702
3 blue 0.884483 0.463891 0.673394 0.094335
------------------
(program exited with code: 0)
Press any key to continue . . .
As seen in the output, the tags irrelevant to the HTML table are not considered absolutely. Also the web_frames is a list of dataframes, though in our case, the dataframe that we are extracting is only one. However, we can select the item in the list that we want to use, calling it by using it's index. As the item is unique, the index will be 0.
In the next program we'll use another mode of a direct parsing of an URL on the Web with the read_html(). In this mode the web pages in the network are directly parsed with the extraction of the tables in them. We will call a web page(http://www.worldometers.info/world-population/india-population) where there is an HTML table that shows population of India.
See the following program :
import pandas as pd
import numpy as np
population = pd.read_html('http://www.worldometers.info/world-population/india-population/')
for p in population:
print(p)
The output of the program is shown below:
0
0 India Population (1950 - 2019)
1 Yearly Population Growth Rate (%)
Year Population ... World Population IndiaGlobal Rank
0 2019 1368737513 ... 7714576923 2
1 2018 1354051854 ... 7632819325 2
2 2017 1339180127 ... 7550262101 2
3 2016 1324171354 ... 7466964280 2
4 2015 1309053980 ... 7383008820 2
5 2010 1230980691 ... 6958169159 2
6 2005 1144118674 ... 6542159383 2
7 2000 1053050912 ... 6145006989 2
8 1995 960482795 ... 5751474416 2
9 1990 870133480 ... 5330943460 2
10 1985 781666671 ... 4873781796 2
11 1980 696783517 ... 4458411534 2
12 1975 621301720 ... 4079087198 2
13 1970 553578513 ... 3700577650 2
14 1965 497702365 ... 3339592688 2
15 1960 449480608 ... 3033212527 2
16 1955 409269055 ... 2772242535 2
[17 rows x 13 columns]
Year Population ... World Population IndiaGlobal Rank
0 NaN NaN ... NaN NaN
1 2020.0 1.383198e+09 ... 7.795482e+09 2.0
2 2025.0 1.451829e+09 ... 8.185614e+09 1.0
3 2030.0 1.512985e+09 ... 8.551199e+09 1.0
4 2035.0 1.564570e+09 ... 8.892702e+09 1.0
5 2040.0 1.605356e+09 ... 9.210337e+09 1.0
6 2045.0 1.636496e+09 ... 9.504210e+09 1.0
7 2050.0 1.658978e+09 ... 9.771823e+09 1.0
[8 rows x 13 columns]
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
• read_html()
• to_html()
The read_html() is required as data on the Internet does not always exist in “ready to use,” that is packaged in some TXT or CSV file. Very often, however, the data are reported as part of the text of web pages. So also having available a function for reading could prove to be really useful.
This activity is so widespread that it is currently identified as web scraping. This process is becoming a fundamental part of the set of processes that will be integrated in the first part of data analysis: data mining and data preparation.
Writing into an HTML table
Using the to_html() function, we can directly convert the dataframe into a HTML table.The internal structure of the dataframe is automatically converted into nested tags <TH>, <TR>, and <TD> retaining any internal hierarchies. Because the data structures as the dataframe can be quite complex and large, it’s great to have a function like this when you need to develop web pages. In the following program we'll create a dataframe and convert it into a HTML table:
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.arange(4).reshape(2,2))
print(frame)
print("\nConverted dataframe into an HTML table\n")
print(frame.to_html())
The output of the program is shown below:
0 1
0 0 1
1 2 3
Converted dataframe into an HTML table
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>0</td>
<td>1</td>
</tr>
<tr>
<th>1</th>
<td>2</td>
<td>3</td>
</tr>
</tbody>
</table>
------------------
(program exited with code: 0)
Press any key to continue . . .
As seen in the output the whole structure formed by the HTML tags needed to create an HTML table was generated correctly in order to respect the internal structure of the dataframe. Let's make another program with a complex dataframe. See the following program :
import pandas as pd
import numpy as np
frame = pd.DataFrame(np.random.random((4,4)),
index = ['white','black','red','blue'],
columns = ['up','down','right','left'])
print(frame)
s = ['<HTML>']
s.append('<HEAD><TITLE>My DataFrame</TITLE></HEAD>')
s.append('<BODY>')
s.append(frame.to_html())
s.append('</BODY></HTML>')
html = ''.join(s)
html_file = open('myFrame.html','w')
html_file.write(html)
html_file.close()
We start by creating a dataframe containing the labels of the indexes and column names. Next we focus on writing an HTML page through the generation of a string. The code s = ['<HTML>'] create a string that contains the code of the HTML page. Then we append the other HTML tags and the dataframe converted to HTML to the string.
The line html = ".join(s) stores the listing of the HTML page within the html variable. Finally we open a HTML file and write the content stored in the html variable into this file.
The output of the program is shown below:
up down right left
white 0.175439 0.281051 0.197299 0.363159
black 0.060482 0.191544 0.065633 0.380458
red 0.440554 0.419918 0.906410 0.126702
blue 0.884483 0.463891 0.673394 0.094335
------------------
(program exited with code: 0)
Press any key to continue . . .
In the project directory a new file myFrame.html will be created which when opened displays the following table:
up | down | right | left | |
---|---|---|---|---|
white | 0.175439 | 0.281051 | 0.197299 | 0.363159 |
black | 0.060482 | 0.191544 | 0.065633 | 0.380458 |
red | 0.440554 | 0.419918 | 0.906410 | 0.126702 |
blue | 0.884483 | 0.463891 | 0.673394 | 0.094335 |
Thus you can see the dataframe is converted into a HTML table.
Reading Data from an HTML File
the function read_html () will perform a parsing an HTML page looking for an HTML table. If the table exists then this method converts that table into an object dataframe ready to be used in our data analysis. The read_html() function returns a list of dataframes even if there is only one table.
In the following program we'll parse the HTML file we created in the previous example :
import pandas as pd
import numpy as np
web_frames = pd.read_html('myFrame.html')
print(web_frames[0])
The output of the program is shown below:
Unnamed: 0 up down right left
0 white 0.175439 0.281051 0.197299 0.363159
1 black 0.060482 0.191544 0.065633 0.380458
2 red 0.440554 0.419918 0.906410 0.126702
3 blue 0.884483 0.463891 0.673394 0.094335
------------------
(program exited with code: 0)
Press any key to continue . . .
As seen in the output, the tags irrelevant to the HTML table are not considered absolutely. Also the web_frames is a list of dataframes, though in our case, the dataframe that we are extracting is only one. However, we can select the item in the list that we want to use, calling it by using it's index. As the item is unique, the index will be 0.
In the next program we'll use another mode of a direct parsing of an URL on the Web with the read_html(). In this mode the web pages in the network are directly parsed with the extraction of the tables in them. We will call a web page(http://www.worldometers.info/world-population/india-population) where there is an HTML table that shows population of India.
See the following program :
import pandas as pd
import numpy as np
population = pd.read_html('http://www.worldometers.info/world-population/india-population/')
for p in population:
print(p)
The output of the program is shown below:
0
0 India Population (1950 - 2019)
1 Yearly Population Growth Rate (%)
Year Population ... World Population IndiaGlobal Rank
0 2019 1368737513 ... 7714576923 2
1 2018 1354051854 ... 7632819325 2
2 2017 1339180127 ... 7550262101 2
3 2016 1324171354 ... 7466964280 2
4 2015 1309053980 ... 7383008820 2
5 2010 1230980691 ... 6958169159 2
6 2005 1144118674 ... 6542159383 2
7 2000 1053050912 ... 6145006989 2
8 1995 960482795 ... 5751474416 2
9 1990 870133480 ... 5330943460 2
10 1985 781666671 ... 4873781796 2
11 1980 696783517 ... 4458411534 2
12 1975 621301720 ... 4079087198 2
13 1970 553578513 ... 3700577650 2
14 1965 497702365 ... 3339592688 2
15 1960 449480608 ... 3033212527 2
16 1955 409269055 ... 2772242535 2
[17 rows x 13 columns]
Year Population ... World Population IndiaGlobal Rank
0 NaN NaN ... NaN NaN
1 2020.0 1.383198e+09 ... 7.795482e+09 2.0
2 2025.0 1.451829e+09 ... 8.185614e+09 1.0
3 2030.0 1.512985e+09 ... 8.551199e+09 1.0
4 2035.0 1.564570e+09 ... 8.892702e+09 1.0
5 2040.0 1.605356e+09 ... 9.210337e+09 1.0
6 2045.0 1.636496e+09 ... 9.504210e+09 1.0
7 2050.0 1.658978e+09 ... 9.771823e+09 1.0
[8 rows x 13 columns]
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment