Tuesday, April 2, 2019

Pandas - 6 (Other Functionalities on Indexes)

Reindexing

We already know that once the Index object is declared in a data structure, it cannot be changed. This is true, but by executing a re-indexing, we can also overcome this problem. In fact it is possible to obtain a new data structure from an existing one where indexing rules can be defined again. In order to re-index a series, pandas provides the reindex() function.

This function creates a new series object with the values of the previous series rearranged according to the new sequence of labels. During re-indexing, it is possible to change the order of the sequence of indexes, delete some of them, or add new ones. In the case of a new label, pandas adds NaN as the
corresponding value. See the following program:

import pandas as pd

ser = pd.Series([2,5,7,4], index=['one','two','three','four'])
print(ser)
print('\n')
print('\nThe re-indexed series:\n')
print(ser.reindex(['three','four','five','one']))


The output of the program is shown below:

one      2
two      5
three    7
four     4
dtype: int64

The re-indexed series:

three    7.0
four     4.0
five     NaN
one      2.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


As we can see from the value returned, the order of the labels has been completely rearranged. The value corresponding to the label two has been dropped and a new label called five is present in the series.

However, to measure the reindexing process, defining the list of the labels can be awkward, especially with a large dataframe. So you could use some method that allows you to fill in or interpolate values automatically.

Let's take another example to better understand the functioning of this mode of automatic re-indexing. See the following program:

import pandas as pd

ser = pd.Series([1,5,6,3],index=[0,3,5,6])
print(ser)
print('\n')
print('\nThe re-indexed series with ffill:\n')
print(ser.reindex(range(6),method='ffill'))
print('\nThe re-indexed series with bfill:\n')
print(ser.reindex(range(6),method='bfill'))



The output of the program is shown below:

0    1
3    5
5    6
6    3
dtype: int64

The re-indexed series with ffill:

0    1
1    1
2    1
3    5
4    5
5    6
dtype: int64

The re-indexed series with bfill:

0    1
1    5
2    5
3    5
4    6
5    6
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .


As you can see in this example, the index column is not a perfect sequence of numbers; in fact there are some missing values (1, 2, and 4). A common need would be to perform interpolation in order to obtain the complete sequence of numbers. To achieve this, we will use re-indexing with the method option set to ffill. We also need to set a range of values for indexes. In this case, to specify a set of values between 0 and 5, we use range(6) as an argument.

After using  re-indexing with the method option set to ffill we can see from the result, the indexes that were not present in the original series were added. By interpolation, those with the lowest index in the original series have been assigned as values. In fact, the indexes 1 and 2 have the value 1, which belongs to index 0.

If we want this index value to be assigned during the interpolation, we have to use the bfill method. In this case, the value assigned to the indexes 1 and 2 is the value 5, which belongs to index 3.

Now let's use the concepts of reindexing with series to the dataframe, we can have a rearrangement not only for indexes (rows), but also with regard to the columns, or even both. As we know, adding a new column or index is possible, but if there are missing values in the original data structure, pandas adds NaN values to them. See the following program:

import pandas as pd


my_dict = {'color' : ['blue','green','yellow','red','white'],
'object' : ['ball','pen','pencil','paper','mug'],
'price' : [1.2,1.0,0.6,0.9,1.7]}

frame = pd.DataFrame(my_dict)
print(frame.reindex(range(5), method='ffill',columns=['colors','price','new','object']))



The output of the program is shown below:

      colors   price     new        object
0    blue      1.2        blue       ball
1   green     1.0        green     pen
2   yellow   0.6        yellow   pencil
3   red        0.9         red         paper
4   white    1.7         white     mug
------------------
(program exited with code: 0)

Press any key to continue . . .


Dropping

Dropping is also associated with Index objects and deleting a row or a column becomes simple, due to the labels used to indicate the indexes and column names. pandas provides a specific function for Dropping, called drop(). This method will return a new object without the items that you want to delete.

In the following program we want to remove a single item from a series. See the following program:

import pandas as pd
import numpy as np

ser = pd.Series(np.arange(4.), index = ['Python','Java','C','C++'])
print('Original series\n')
print(ser)

print('\nSeries after deleting an item\n')
print(ser.drop('Java'))#delete the item corresponding to the label Java
print('\nSeries after deleting multiple items\n')
print(ser.drop(['C','C++']))


The output of the program is shown below: 

Original series

Python    0.0
Java      1.0
C         2.0
C++       3.0
dtype: float64

Series after deleting an item

Python    0.0
C         2.0
C++       3.0
dtype: float64

Series after deleting multiple items

Python    0.0
Java      1.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


In the above program we first define a generic series of four elements with four distinct labels. In order to delete the item corresponding to a specific label (Java in our case), just specify the label as an argument of the function drop() to delete it. To remove more than one items, just pass an array with the corresponding labels (['C','C++'] in our case).

Dropping from Dataframe

The values can be deleted from Dataframe by referring to the labels of both axes. See the following program:

import pandas as pd
import numpy as np

frame = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
                   
print('Original dataframe\n')
print(frame)

print('\nDelete rows\n')
print(frame.drop(['blue','yellow']))

print('\nDelete columns\n')
print(frame.drop(['pen','pencil'],axis=1))



The output of the program is shown below:

Original dataframe

        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

Delete rows

       ball  pen  pencil  paper
red       0    1       2      3
white    12   13      14     15

Delete columns

        ball  paper
red        0      3
blue       4      7
yellow     8     11
white     12     15


------------------
(program exited with code: 0)

Press any key to continue . . .


We first declared a frame and printed to see it's content. Later we used the drop() method to delete the blue and yellow rows by passing the indexes of the rows. To delete columns, we always need to specify the indexes of the columns as well as specify the axis from which to delete the elements, and this can be done using the axis option. So to refer to the column names, we specify axis = 1 in our program.


Arithmetic and Data Alignment

pandas can align indexes coming from two different data structures. This is especially true when we  are performing an arithmetic operation on them. In fact, during these operations, not only can the indexes between the two structures be in a different order, but they also can be present in only one of the two structures. Thus pandas proves to be very powerful in aligning indexes during these operations. See the following program:

import pandas as pd
import numpy as np

s1 = pd.Series([3,2,5,1],['white','yellow','green','blue'])
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])

print('First series\n')
print(s1)

print('\nSecond series\n')
print(s2)

print('\nAdding the series\n')
print(s1+s2)


The output of the program is shown below:

First series

white     3
yellow    2
green     5
blue      1
dtype: int64

Second series

white     1
yellow    4
black     7
blue      2
brown     1
dtype: int64

Adding the series

black     NaN
blue      3.0
brown     NaN
green     NaN
white     4.0
yellow    6.0
dtype: float64


------------------
(program exited with code: 0)

Press any key to continue . . .


In the above program we created two series having two arrays of labels not perfectly matching
each other but some labels are present in both, while other labels are present only in one of the two. Next we add the two series. When the labels are present in both operators, their values will be added, while in the opposite case, they will also be shown in the result (new series), but with the value NaN.

Now let us add two dataframe which is more complex as the addition is carried out both for the rows and for the columns. See the following program:

import numpy as np


frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
                   
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index=['blue','green','white','yellow'],
                    columns=['mug','pen','ball'])                   
                   
print('dataframe 1\n')
print(frame1)

print('dataframe 2\n')
print(frame2)

print('\nAdding the dataframe\n')
print(frame1+frame2)


The output of the program is shown below:

dataframe 1

        ball  pen  pencil  paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15
dataframe 2

        mug  pen  ball
blue      0    1     2
green     3    4     5
white     6    7     8
yellow    9   10    11

Adding the dataframe

        ball  mug  paper   pen  pencil
blue     6.0  NaN    NaN   6.0     NaN
green    NaN  NaN    NaN   NaN     NaN
red      NaN  NaN    NaN   NaN     NaN
white   20.0  NaN    NaN  20.0     NaN
yellow  19.0  NaN    NaN  19.0     NaN


------------------
(program exited with code: 0)

Press any key to continue . . .


As seen in the output the alignment follows the same principle as in case of series, but is carried out both for the rows and for the columns.


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn! 
Share:

0 comments:

Post a Comment