Monday, April 8, 2019

Pandas - 10 (Reading and Writing Data)

In this post we will see the tools provided by pandas for reading data stored in different types of media (such as files and databases). We will also see how to write data structures directly on these  formats, without bothering about the technologies used.

Our prime focus will be on a series of I/O API functions that pandas provides to read and write data directly as dataframe objects. Let's start by looking at text files, then move gradually to more complex binary formats. As this step is very important for data analysis and therefore a specific tool for this purpose must be present in the library pandas—a set of functions called I/O API. These
functions are divided into two main categories: readers and writers.

I/O API Tools

pandas is a library specialized for data analysis thus, it is mainly focused on calculation and data processing. The processes of writing and reading data from/to external files can be considered part of data processing.We can perform some operations in order to prepare the incoming data for manipulation.



CSV and Textual Files

Usually data are generally reported in tabular form. If the values in a row are separated by commas, we have the CSV (comma-separated values) format, the best-known and most popular format.

Other forms of tabular data can be separated by spaces or tabs and are typically contained in text files of various types (generally with the .txt extension). This type of file is the most common source of data and is easier to transcribe and interpret. In this regard, pandas provides a set of functions specific for this type of file, which are:

• read_csv
• read_table
• to_csv

a. Reading Data

Reading Data is the most common operation in data analysis. The data can be read from a CSV file or a text file. In the following program we'll create a CSV file 'mydata.csv' an read it's content.

mydata.csv

red,blue,yellow,green,stationary
1,5,2,3,pen
2,7,8,5,pencil
3,3,6,7,paper
2,2,8,3,eraser
4,4,2,1,sharpner



See the following program :

import pandas as pd
import numpy as np


frame1 = pd.read_csv('mydata.csv')

print('\nThe dataframe\n')
print(frame1)


The output of the program is shown below:

The dataframe

   red  blue  yellow  green    stationary
0    1     5      2            3              pen
1    2     7      8            5              pencil
2    3     3      6            7              paper
3    2     2      8            3              eraser
4    4     4      2            1              sharpner
------------------
(program exited with code: 0)

Press any key to continue . . .


In the above program we have used the read_csv() function to read the content of CSV file and convert it to a dataframe object. Since CSV files are considered text files, we can also use the read_table() function, but specify the delimiter. See the following program :

import pandas as pd
import numpy as np


frame1 = pd.read_table('mydata.csv',sep=',')

print('\nThe dataframe\n')
print(frame1)



The output of the program is shown below:

The dataframe

   red  blue  yellow  green    stationary
0    1     5      2            3              pen
1    2     7      8            5              pencil
2    3     3      6            7              paper
3    2     2      8            3              eraser
4    4     4      2            1              sharpner
------------------
(program exited with code: 0)

Press any key to continue . . .


We can see from the output that in the CSV file, headers that identify all the columns are in the first row. But this is not a general case; it often happens that the tabulated data begin directly in the first line as shown in the following CSV file 'mydata1.csv':

1,5,2,3,pen
2,7,8,5,pencil
3,3,6,7,paper
2,2,8,3,eraser
4,4,2,1,sharpner


See the following program :

import pandas as pd
import numpy as np


frame1 = pd.read_csv('mydata1.csv')

print('\nThe dataframe\n')
print(frame1)


The output of the program is shown below: 

The dataframe

    1  5  2  3       pen
0  2  7  8  5    pencil
1  3  3  6  7     paper
2  2  2  8  3    eraser
3  4  4  2  1  sharpner
------------------
(program exited with code: 0)

Press any key to continue . . . 


In the output shown above you may have noticed the difference between dataframe created for 'mydata.csv' and 'mydata1.csv' files (the header).  In such cases it's better to allow pandas to assign the default names to the columns by setting the header option to None. The following program does this:

import pandas as pd
import numpy as np


frame1 = pd.read_csv('mydata1.csv')

print('\nThe dataframe\n')
print(frame1)

frame1 = pd.read_csv('mydata1.csv',header=None)

print('\nThe new dataframe with header=None\n')
print(frame1)


The output of the program is shown below:

The dataframe

    1  5  2  3       pen
0  2  7  8  5    pencil
1  3  3  6  7     paper
2  2  2  8  3    eraser
3  4  4  2  1  sharpner

The new dataframe with header=None

     0  1  2  3         4
0  1  5  2  3       pen
1  2  7  8  5    pencil
2  3  3  6  7     paper
3  2  2  8  3    eraser
4  4  4  2  1  sharpner
------------------
(program exited with code: 0)

Press any key to continue . . .
  

Notice the colum names in the output shown above. We can specify the names directly by assigning a list of labels to the names option as shown in the next program:

import pandas as pd
import numpy as np


frame1 = pd.read_csv('mydata1.csv')

print('\nThe dataframe\n')
print(frame1)

frame1 = pd.read_csv('mydata1.csv',header=None)

print('\nThe new dataframe with header=None\n')
print(frame1)

frame1 = pd.read_csv('mydata1.csv',names=['red','blue','yellow','green','stationary'])

print('\nThe new dataframe with labels assigned\n')
print(frame1)


The output of the program is shown below: 

The dataframe

     1  5  2  3       pen
0  2  7  8  5    pencil
1  3  3  6  7     paper
2  2  2  8  3    eraser
3  4  4  2  1  sharpner

The new dataframe with header=None

     0  1  2  3         4
0  1  5  2  3       pen
1  2  7  8  5    pencil
2  3  3  6  7     paper
3  2  2  8  3    eraser
4  4  4  2  1  sharpner

The new dataframe with labels assigned

       red  blue  yellow  green stationary
0       1     5       2           3        pen
1       2     7       8          5      pencil
2       3     3       6          7      paper
3       2     2       8          3     eraser
4      4     4       2          1      sharpner
------------------
(program exited with code: 0)

Press any key to continue . . .


Suppose we want to create a dataframe with a hierarchical structure by reading a CSV file, we can do this if we extend the functionality of the read_csv() function by adding the index_col option, assigning all the columns to be converted into indexes.

Let's create a new CSV file(mydata3.csv) with two columns to be used as indexes of the hierarchy:

color,status,item1,item2,item3
black,up,3,4,6
black,down,2,6,7
white,up,5,5,5
white,down,3,3,2
white,left,1,2,1
red,up,2,2,2
red,down,1,1,4


In the following program we'll read the mydata3.csv file using the read_csv():

import numpy as np
import pandas as pd

frame1 = pd.read_csv('mydata3.csv',index_col=['color','status'])

print('\nThe dataframe\n')
print(frame1) 


The output of the program is shown below:

The dataframe

                         item1  item2  item3
color   status
black   up           3        4           6
           down       2        6           7
white   up           5        5           5
           down       3        3           2
           left           1        2           1
red      up            2        2           2
           down       1        1           4

------------------
(program exited with code: 0)

Press any key to continue . . .


As seen in the output, two columns 'color','status' have been used as indexes of the hierarchy in the dataframe.

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment