Pandas - 7 (Operations Between Data Structures) ~ Python is easy to learn

In this post we will focus on operations that can be performed between the two pandas data structures (series and dataframe).

1. Flexible Arithmetic Methods

In the previous post we saw how to use mathematical operators directly on the pandas data structures. The same operations can also be performed using appropriate methods, called flexible arithmetic methods. Some of these methods are:

• add()
• sub()
• div()
• mul()

Using these functions needs a different specification than what we're used to dealing with mathematical operators for example, if we add two series s1 and s2, then instead of writing s1+s2 we have to use s1.add(s2). The following program adds two frames:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])

frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index=['blue','green','white','yellow'],
                    columns=['mug','pen','ball'])

print('dataframe 1\n')
print(frame1)

print('\ndataframe 2\n')
print(frame2)

print('\nAdding the dataframe using operator\n')
print(frame1+frame2)

print('\nAdding the dataframe using add() method\n')
print(frame1.add(frame2))

The output of the program is shown below:

dataframe 1

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

dataframe 2

        mug pen ball
blue      0    1     2
green     3    4     5
white     6    7     8
yellow    9   10    11

Adding the dataframe using operator

        ball mug paper   pen pencil
blue     6.0 NaN    NaN   6.0     NaN
green    NaN NaN    NaN   NaN     NaN
red      NaN NaN    NaN   NaN     NaN
white   20.0 NaN    NaN 20.0     NaN
yellow 19.0 NaN    NaN 19.0     NaN

Adding the dataframe using add() method

        ball mug paper   pen pencil
blue     6.0 NaN    NaN   6.0     NaN
green    NaN NaN    NaN   NaN     NaN
red      NaN NaN    NaN   NaN     NaN
white   20.0 NaN    NaN 20.0     NaN
yellow 19.0 NaN    NaN 19.0     NaN
------------------
(program exited with code: 0)

Press any key to continue . . .

As you can see from the output, the results are the same as what you’d get using the addition operator +.

2. Operations Between DataFrame and Series

The following program shows transaction between a dataframe and a series:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])

ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])

print('dataframe \n')
print(frame1)

print('\nseries\n')
print(ser)

print('\nSubstracting the series from dataframe using - operator\n')
print(frame1-ser)

print('\nSubstracting the series from dataframe using sub() method\n')
print(frame1.sub(ser))

In the above program two newly defined data structures have been created specifically so that the
indexes of series match the names of the columns of the dataframe. This way, we can apply a direct operation. The elements of the series are subtracted from the values of the dataframe corresponding to the same index on the column. The value is subtracted for all values of the column, regardless of their index.

The output of the program is shown below:

dataframe

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

series

ball      0
pen       1
pencil    2
paper     3
dtype: int32

Substracting the series from dataframe using - operator

        ball pen pencil paper
red        0    0       0      0
blue       4    4       4      4
yellow     8    8       8      8
white     12   12      12     12

Substracting the series from dataframe using sub() method

        ball pen pencil paper
red        0    0       0      0
blue       4    4       4      4
yellow     8    8       8      8
white     12   12      12     12
------------------
(program exited with code: 0)

Press any key to continue . . .

If an index is not present in one of the two data structures, the result will be a new column with that index only that all its elements will be NaN. See the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])

ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])

print('dataframe \n')
print(frame1)

print('\nseries\n')
print(ser)

ser['mug'] = 9

print('\nmodified series\n')
print(ser)

print('\nSubstracting the series from dataframe using - operator\n')
print(frame1-ser)

The output of the program is shown below:

dataframe

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

series

ball      0
pen       1
pencil    2
paper     3
dtype: int32

modified series

ball      0
pen       1
pencil    2
paper     3
mug       9
dtype: int64

Substracting the series from dataframe using - operator

        ball mug paper pen pencil
red        0 NaN      0    0       0
blue       4 NaN      4    4       4
yellow     8 NaN      8    8       8
white     12 NaN     12   12      12
------------------
(program exited with code: 0)

Press any key to continue . . .

3. Functions by Element

We know that the pandas library is built on the foundations of NumPy and then extends many of its
features by adapting them to new data structures as series and dataframe. Among these are the universal functions, called ufunc. This class of functions operates by element in the data structure. In the following program we calculate the square root of each value in the dataframe using the NumPy np.sqrt():

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])

ser = pd.Series(np.arange(4), index=['ball','pen','pencil','paper'])

print('dataframe \n')
print(frame1)

print('\nSquare root of each value in the dataframe\n')
print(np.sqrt(frame1))

The output of the program is shown below:

dataframe

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

Square root of each value in the dataframe

            ball       pen    pencil     paper
red     0.000000 1.000000 1.414214 1.732051
blue    2.000000 2.236068 2.449490 2.645751
yellow 2.828427 3.000000 3.162278 3.316625
white   3.464102 3.605551 3.741657 3.872983
------------------
(program exited with code: 0)

Press any key to continue . . .

4. Functions by Row or Column

The application of the functions is not limited to the ufunc functions, but also includes those defined by the user. The important point is that they operate on a one-dimensional array, giving a single number as a result. For example, you can define a lambda function that calculates the range covered by the elements in an array. See the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])

f = lambda x: x.max() - x.min()

def f(x):
    return x.max() - x.min()

print('dataframe \n')
print(frame1)

print('\nUsing the apply()function on the dataframe\n')
print(frame1.apply(f))

print('\nUsing the apply()function on the dataframe column\n')
print(frame1.apply(f,axis=1))

The output of the program is shown below:

dataframe

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

Using the apply()function on the dataframe row

ball      12
pen       12
pencil    12
paper     12
dtype: int64

Using the apply()function on the dataframe column

red       3
blue      3
yellow    3
white     3
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .

So far we have seen that the method apply() return a scalar value. It can also return a series. A useful case would be to extend the application to many functions simultaneously. In this case, we will have two or more values for each feature applied. This can be done by defining a function as shown in the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])

f = lambda x: x.max() - x.min()

def f(x):
    return pd.Series([x.min(), x.max()], index=['min','max'])


print('dataframe \n')
print(frame1)

print('\nUsing the apply()function on the dataframe row\n')
print(frame1.apply(f))

print('\nUsing the apply()function on the dataframe column\n')
print(frame1.apply(f,axis=1))

The output of the program is shown below:

dataframe

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

Using the apply()function on the dataframe row

     ball pen pencil paper
min     0    1       2      3
max    12   13      14     15

Using the apply()function on the dataframe column

        min max
red       0    3
blue      4    7
yellow    8   11
white    12   15
------------------
(program exited with code: 0)

Press any key to continue . . .

5. Statistics Functions

Most of the statistical functions for arrays are still valid for dataframe, so using the apply() function is no longer necessary. For example, functions such as sum() and mean() can calculate the sum and the average, respectively, of the elements contained within a dataframe. There is also a function called describe() that allows you to obtain summary statistics at once. See the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])


print('dataframe \n')
print(frame1)

print('\nUsing the sum()function on the dataframe \n')
print(frame1.sum())

print('\nUsing the mean()function on the dataframe \n')
print(frame1.mean())

print('\nUsing the describe()function on the dataframe \n')
print(frame1.describe())

The output of the program is shown below:

dataframe

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

Using the sum()function on the dataframe

ball      24
pen       28
pencil    32
paper     36
dtype: int64

Using the mean()function on the dataframe

ball      6.0
pen       7.0
pencil    8.0
paper     9.0
dtype: float64

Using the describe()function on the dataframe

            ball        pen     pencil      paper
count   4.000000   4.000000   4.000000   4.000000
mean    6.000000   7.000000   8.000000   9.000000
std     5.163978   5.163978   5.163978   5.163978
min     0.000000   1.000000   2.000000   3.000000
25%     3.000000   4.000000   5.000000   6.000000
50%     6.000000   7.000000   8.000000   9.000000
75%     9.000000 10.000000 11.000000 12.000000
max    12.000000 13.000000 14.000000 15.000000

------------------
(program exited with code: 0)

Press any key to continue . . .

6. Sorting and Ranking

Sorting the data is often a necessity and it is very important to be able to do it easily. pandas provides the sort_index() function, which returns a new object that’s identical to the start, but in which the elements are ordered. In the following program we'll see how you can sort items in a series. The operation is quite trivial since the list of indexes to be ordered is only one:

import pandas as pd
import numpy as np

ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])

print('The series:\n')
print(ser)

print('\nThe sorted() series:\n')
print(ser.sort_index())

print('\nThe sorted() series in descending order:\n')
print(ser.sort_index(ascending=False))

The output of the program is shown below:

The series:

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

The sorted() series:

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

The sorted() series in descending order:

yellow    3
white     8
red       5
green     4
blue      0
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .

As we can see, the items were sorted in ascending alphabetical order based on their labels (from A to Z). This is the default behavior, but you can set the opposite order by setting the ascending option to False.

In the next program we'll see how you can sort items in a dataframe. With the dataframe, the sorting can be performed independently on each of its two axes. So if you want to order by row following the indexes, you just continue to use the sort_index() function without arguments as you’ve seen before, or if you prefer to order by columns, you need to set the axis options to 1. See the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])



print('The dataframe:\n')
print(frame1)

print('\nThe sorted() dataframe order by row:\n')
print(frame1.sort_index())

print('\nThe sorted() dataframe order by columns:\n')
print(frame1.sort_index(axis=1))

The output of the program is shown below:

The series:

red       5
blue      0
yellow    3
white     8
green     4
dtype: int64

The sorted() series:

blue      0
green     4
red       5
white     8
yellow    3
dtype: int64

The sorted() series in descending order:

yellow    3
white     8
red       5
green     4
blue      0
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .

We have seen how to sort the values according to the indexes. But very often we may need to sort the values contained in the data structure. In this case, we have to differentiate depending on whether we have to sort the values of a series or a dataframe.

If we want to order the series, you need to use the sort_values() function. If you need to order the values in a dataframe, use the sort_values() function seen previously but with the by option. Then you have to specify the name of the column on which to sort. If the sorting criteria will be based on two or more columns, you can assign an array containing the names of the columns to the by option.

See the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])


ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])

print('Sort the values of a series:\n')
print(ser.sort_values())

print('\nThe sorted dataframe:\n')
print(frame1.sort_values(by='pen'))

print('\nThe sorted dataframe based on two or more columns:\n')
print(frame1.sort_values(by=['pen','pencil']))

The output of the program is shown below:

Sort the values of a series:

blue      0
yellow    3
green     4
red       5
white     8
dtype: int64

The sorted dataframe:

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15

The sorted dataframe based on two or more columns:

        ball pen pencil paper
red        0    1       2      3
blue       4    5       6      7
yellow     8    9      10     11
white     12   13      14     15
------------------
(program exited with code: 0)

Press any key to continue . . .

The ranking is an operation closely related to sorting. It mainly consists of assigning a rank (that is, a value that starts at 0 and then increase gradually) to each element of the series. The rank will be assigned starting from the lowest value to the highest.

The rank can also be assigned in the order in which the data are already in the data structure (without a sorting operation). In this case, you just add the method option with the first value assigned. By default, even the ranking follows an ascending sort. To reverse this criteria, set the ascending option to False.

See the following program:

import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])


ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])

print('assigning a rank to each element of a series:\n')
print(ser.rank())

print('\nThe rank assigned in the order of data:\n')
print(ser.rank(method='first'))

print('\nThe ranking assigned in descending order:\n')
print(ser.rank(ascending=False))

The output of the program is shown below:

assigning a rank to each element of a series:

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

The rank assigned in the order of data:

red       4.0
blue      1.0
yellow    2.0
white     5.0
green     3.0
dtype: float64

The ranking assigned in descending order:

red       2.0
blue      5.0
yellow    4.0
white     1.0
green     3.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .

7. Correlation and Covariance

The correlation and covariance calculations are expressed in pandas by the corr() and cov() functions. These kind of calculations normally involve two series. See the following program:

import pandas as pd
import numpy as np

seq2 = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])
seq = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])

print('Correlation\n')
print(seq.corr(seq2))

print('\nCovariance\n')
print(seq.cov(seq2))

The output of the program is shown below:

Correlation

0.7745966692414835

Covariance

0.8571428571428571
------------------
(program exited with code: 0)

Press any key to continue . . .

Covariance and correlation can also be applied to a single dataframe. In this case, they return their corresponding matrices in the form of two new dataframe objects. See the following program:

import pandas as pd
import numpy as np

frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])

print('Correlation\n')
print(frame2.corr())

print('\nCovariance\n')
print(frame2.cov())

The output of the program is shown below:

Correlation

            ball       pen    pencil     paper
ball    1.000000 -0.276026 0.577350 -0.763763
pen    -0.276026 1.000000 -0.079682 -0.361403
pencil 0.577350 -0.079682 1.000000 -0.692935
paper -0.763763 -0.361403 -0.692935 1.000000

Covariance

            ball       pen    pencil     paper
ball    2.000000 -0.666667 2.000000 -2.333333
pen    -0.666667 2.916667 -0.333333 -1.333333
pencil 2.000000 -0.333333 6.000000 -3.666667
paper -2.333333 -1.333333 -3.666667 4.666667
------------------
(program exited with code: 0)

Press any key to continue . . .

We can calculate the pairwise correlations between the columns or rows of a dataframe with a series or another DataFrame using the the corrwith() method. See the following program:

import pandas as pd
import numpy as np

frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),
                    index=['red','blue','yellow','white'],
                    columns=['ball','pen','pencil','paper'])
frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])

ser = pd.Series([0,1,2,3,9],index=['red','blue','yellow','white','green'])

print('Correlation with series\n')
print(frame2.corrwith(ser))

print('\nCorrelation with dataframe\n')
print(frame2.corrwith(frame1))

The output of the program is shown below:

Correlation with series

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64

Correlation with dataframe

ball      0.730297
pen      -0.831522
pencil    0.210819
paper    -0.119523
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .

Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!

Python is easy to learn

Wednesday, April 3, 2019

Pandas - 7 (Operations Between Data Structures)

0 comments:

Post a Comment