Python has many other libraries (besides pandas) that manage the reading and writing of data in XML format as in the list of I/O API functions, there is no specific tool regarding the XML (Extensible
Markup Language) format.
The lxml library is well known for its excellent performance during the parsing of very large files. In the following program we'll learn how to use this module for parsing XML files and how to integrate it with pandas to finally get the dataframe containing the requested data. But first we'll create a XML file and take the data structure described in the XML file to convert it directly into a dataframe.
from lxml import objectify
xml = objectify.parse('books.xml')
root = xml.getroot()
print(root.Book.Author)
print(root.Book.PublishDate)
print([child.tag for child in root.Book.getchildren()])
print([child.text for child in root.Book.getchildren()])
The first thing to do is use the sub-module objectify of the lxml library, then we did the parsing of the XML file with the parse() function. As a result of parsing we got an object tree, which is an internal data structure of the lxml module.
To navigate in this tree structure, so as to select element by element, we first define the root with the
getroot() function. Once the root of the structure has been defined, we can access the various nodes
of the tree, each corresponding to the tag contained in the original XML file. The items will have the same name as the corresponding tags. So to select them, we write the various separate tags with points, reflecting in a certain way the hierarchy of nodes in the tree.
After we know how to access nodes individually, we can access various elements at the same time using getchildren(). With this function, we’ll get all the child nodes of the reference element.
The output of the program is shown below:
Swami, Vivek
2014-22-01
['Author', 'Title', 'Genre', 'Price', 'PublishDate']
['Swami, Vivek', 'Python with Vee', 'Computer', '23.56', '2014-22-01']
------------------
(program exited with code: 0)
Press any key to continue . . .
Now we have the ability to move through the lxml.etree tree structure,next we need to convert it into a dataframe. See the following program :
from lxml import objectify
import pandas as pd
import numpy as np
xml = objectify.parse('books.xml')
root = xml.getroot()
def etree2df(root):
column_names = []
for i in range(0, len(root.getchildren()[0].getchildren())):
column_names.append(root.getchildren()[0].getchildren()[i].tag)
xmlframe = pd.DataFrame(columns=column_names)
for j in range(0, len(root.getchildren())):
obj = root.getchildren()[j].getchildren()
texts = []
for k in range(0, len(column_names)):
texts.append(obj[k].text)
row = dict(zip(column_names, texts))
row_s = pd.Series(row)
row_s.name = j
xmlframe = xmlframe.append(row_s)
return xmlframe
print(etree2df(root))
The output of the program is shown below:
Author Title Genre Price PublishDate
0 Swami, Vivek Python with Vee Computer 23.56 2014-22-01
1 Swami, Veevaeck Python is easy to learn Computer 35.95 2014-12-16
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Markup Language) format.
The lxml library is well known for its excellent performance during the parsing of very large files. In the following program we'll learn how to use this module for parsing XML files and how to integrate it with pandas to finally get the dataframe containing the requested data. But first we'll create a XML file and take the data structure described in the XML file to convert it directly into a dataframe.
from lxml import objectify
xml = objectify.parse('books.xml')
root = xml.getroot()
print(root.Book.Author)
print(root.Book.PublishDate)
print([child.tag for child in root.Book.getchildren()])
print([child.text for child in root.Book.getchildren()])
The first thing to do is use the sub-module objectify of the lxml library, then we did the parsing of the XML file with the parse() function. As a result of parsing we got an object tree, which is an internal data structure of the lxml module.
To navigate in this tree structure, so as to select element by element, we first define the root with the
getroot() function. Once the root of the structure has been defined, we can access the various nodes
of the tree, each corresponding to the tag contained in the original XML file. The items will have the same name as the corresponding tags. So to select them, we write the various separate tags with points, reflecting in a certain way the hierarchy of nodes in the tree.
After we know how to access nodes individually, we can access various elements at the same time using getchildren(). With this function, we’ll get all the child nodes of the reference element.
The output of the program is shown below:
Swami, Vivek
2014-22-01
['Author', 'Title', 'Genre', 'Price', 'PublishDate']
['Swami, Vivek', 'Python with Vee', 'Computer', '23.56', '2014-22-01']
------------------
(program exited with code: 0)
Press any key to continue . . .
Now we have the ability to move through the lxml.etree tree structure,next we need to convert it into a dataframe. See the following program :
from lxml import objectify
import pandas as pd
import numpy as np
xml = objectify.parse('books.xml')
root = xml.getroot()
def etree2df(root):
column_names = []
for i in range(0, len(root.getchildren()[0].getchildren())):
column_names.append(root.getchildren()[0].getchildren()[i].tag)
xmlframe = pd.DataFrame(columns=column_names)
for j in range(0, len(root.getchildren())):
obj = root.getchildren()[j].getchildren()
texts = []
for k in range(0, len(column_names)):
texts.append(obj[k].text)
row = dict(zip(column_names, texts))
row_s = pd.Series(row)
row_s.name = j
xmlframe = xmlframe.append(row_s)
return xmlframe
print(etree2df(root))
The output of the program is shown below:
Author Title Genre Price PublishDate
0 Swami, Vivek Python with Vee Computer 23.56 2014-22-01
1 Swami, Veevaeck Python is easy to learn Computer 35.95 2014-12-16
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment