Monday, April 15, 2019

Pandas - 15 (Reading data from XML File)

Python has many other libraries (besides pandas) that manage the reading and writing of data in XML format as in the list of I/O API functions, there is no specific tool regarding the XML (Extensible
Markup Language) format.


The lxml library is well known for its excellent performance during the parsing of very large files. In the following program we'll learn how to use this module for parsing XML files and how to integrate it with pandas to finally get the dataframe containing the requested data. But first we'll create a XML file and take the data structure described in the XML file to convert it directly into a dataframe.                                                                                                                                                                                                                                                                                            

from lxml import objectify

xml = objectify.parse('books.xml')


root = xml.getroot()

print(root.Book.Author)
print(root.Book.PublishDate)
print([child.tag for child in root.Book.getchildren()])

print([child.text for child in root.Book.getchildren()])


The first thing to do is use the sub-module objectify of the lxml library, then we did the parsing of the XML file with the parse() function. As a result of parsing we got an object tree, which is an internal data structure of the lxml module.

To navigate in this tree structure, so as to select element by element, we first define the root with the
getroot() function. Once the root of the structure has been defined, we can access the various nodes
of the tree, each corresponding to the tag contained in the original XML file. The items will have the same name as the corresponding tags. So to select them, we write the various separate tags with points, reflecting in a certain way the hierarchy of nodes in the tree.

After we know how to access nodes individually, we can access various elements at the same time using getchildren(). With this function, we’ll get all the child nodes of the reference element.

The output of the program is shown below:

Swami, Vivek
2014-22-01
['Author', 'Title', 'Genre', 'Price', 'PublishDate']
['Swami, Vivek', 'Python with Vee', 'Computer', '23.56', '2014-22-01']
------------------
(program exited with code: 0)

Press any key to continue . . .


Now we have the ability to move through the lxml.etree tree structure,next we need to convert it into a dataframe. See the following program :

from lxml import objectify

import pandas as pd
import numpy as np


xml = objectify.parse('books.xml')

root = xml.getroot()

def etree2df(root):
    column_names = []
    for i in range(0, len(root.getchildren()[0].getchildren())):
        column_names.append(root.getchildren()[0].getchildren()[i].tag)
    xmlframe = pd.DataFrame(columns=column_names)
    for j in range(0, len(root.getchildren())):
        obj = root.getchildren()[j].getchildren()
        texts = []
        for k in range(0, len(column_names)):
            texts.append(obj[k].text)
        row = dict(zip(column_names, texts))
        row_s = pd.Series(row)
        row_s.name = j
        xmlframe = xmlframe.append(row_s)
    return xmlframe
print(etree2df(root))



The output of the program is shown below:

            Author                    Title                      Genre         Price   PublishDate
0     Swami, Vivek          Python with Vee       Computer  23.56  2014-22-01
1  Swami, Veevaeck  Python is easy to learn  Computer  35.95  2014-12-16
------------------
(program exited with code: 0)

Press any key to continue . . .



Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment