Wednesday, March 27, 2019

Pandas -2 (Series II)

In the previous post we discussed about creating the series data structure and accessing, modifying and filtering it's elements. In this post we shall focus on some more features of series data structure.

Operations on Series data structures

As Series in based on NumPy array, operations such as operators (+, -, *, and /) and mathematical functions that are applicable to NumPy array can be extended to series. The following program shows some of the operations:

import pandas as pd
import numpy as np

arr = np.array([1,12,3,34,2,16,7])

s = pd.Series(arr)

print('Original Series\n')
print(s)

print('\nDivision by 2\n')
print(s/2)

print('\nMultiplication by 2\n')
print(s*2)

print('\nAddition by 2\n')
print(s+2)

print('\nSubstraction by 2\n')
print(s-2)


The output of the program is shown below:

Original Series

0     1
1    12
2     3
3    34
4     2
5    16
6     7
dtype: int32

Division by 2

0     0.5
1     6.0
2     1.5
3    17.0
4     1.0
5     8.0
6     3.5
dtype: float64

Multiplication by 2

0     2
1    24
2     6
3    68
4     4
5    32
6    14
dtype: int32

Addition by 2

0     3
1    14
2     5
3    36
4     4
5    18
6     9
dtype: int32

Substraction by 2

0    -1
1    10
2     1
3    32
4     0
5    14
6     5
dtype: int32
------------------
(program exited with code: 0)

Press any key to continue . . .


Mathematical functions on Series data structures

We can use the NumPy mathematical functions by specifying the function referenced with np and the instance of the series passed as an argument. See the following example:

import pandas as pd
import numpy as np

arr = np.array([1,12,3,34,2,16,7])

s = pd.Series(arr)

print('Log values\n')
print(np.log(s))

print('\nSine values\n')
print(np.sin(s))

print('\nCosine values\n')
print(s*2)


The output of the program is shown below:

Log values

0    0.000000
1    2.484907
2    1.098612
3    3.526361
4    0.693147
5    2.772589
6    1.945910
dtype: float64

Sine values

0    0.841471
1   -0.536573
2    0.141120
3    0.529083
4    0.909297
5   -0.287903
6    0.656987
dtype: float64

Cosine values

0     2
1    24
2     6
3    68
4     4
5    32
6    14
dtype: int32
------------------
(program exited with code: 0)

Press any key to continue . . . 


Finding duplicate values in a series

To know all the values contained in the series, excluding duplicates, we use the unique() function. The return value is an array containing the unique values in the series, although not necessarily in order. A function that’s similar to unique() is value_counts(), which not only returns
unique values but also calculates the occurrences within a series. See the following program:

import pandas as pd
import numpy as np

sd = pd.Series([1,0,2,1,2,3], index=['red','red','blue','green','green','yellow'])
print('\nSeries with duplicate values\n')
print(sd)

print('\nSeries with unique values\n')
print(sd.unique())

print('\nSeries with unique values and occurrences within a series\n')
print(sd.value_counts())


The output of the program is shown below:


Series with duplicate values

red       1
red       0
blue      2
green     1
green     2
yellow    3
dtype: int64

Series with unique values

[1 0 2 3]

Series with unique values and occurrences within a series

2    2
1    2
3    1
0    1
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . . 


Checking if the values are contained in the data structure

The isin() evaluates the membership, that is, the given a list of values. This function tells you if the values are contained in the data structure. Boolean values that are returned can be very useful when filtering data in a series or in a column of a dataframe. See the following program:

import pandas as pd
import numpy as np

sd = pd.Series([1,0,2,1,2,3], index=['red','red','blue','green','green','yellow'])
print('\nSeries with duplicate values\n')
print(sd)

print('\nevaluating the membership\n')
print(sd.isin([0,3]))

print('\nfiltering data\n')
print(sd[sd.isin([0,3])]) 



The output of the program is shown below:

Series with duplicate values

red       1
red       0
blue      2
green     1
green     2
yellow    3
dtype: int64

evaluating the membership

red       False
red        True
blue      False
green     False
green     False
yellow     True
dtype: bool

filtering data

red       0
yellow    3
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .



NaN Values


NaN (Not a Number) is used in pandas data structures to indicate the presence of an empty field or something that’s not definable numerically.

Generally, these NaN values are a problem and must be managed in some way, especially during data analysis. These data are often generated when extracting data from a questionable source or when the source is missing data. The NaN values can also be generated in special cases, such as calculations of logarithms of negative values, or exceptions during execution of some calculation or function. See the following program:

import pandas as pd
import numpy as np

arr = np.array([1,12,-3,34,2,16,7])

s = pd.Series(arr)

print('Log values\n')
print(np.log(s))


The output of the program is shown below:

 Log values

pandaseg.py:9: RuntimeWarning: invalid value encountered in log
  print(np.log(s))
0    0.000000
1    2.484907
2         NaN
3    3.526361
4    0.693147
5    2.772589
6    1.945910
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


pandas allows us to explicitly define NaNs and add them to a data structure, such as a series. Within the array containing the values, we enter np.NaN wherever we want to define a missing value. See the following program:

import pandas as pd
import numpy as np

s = pd.Series([5,-3,np.NaN,14])

print('The series:\n')
print(s)


The output of the program is shown below:

The series:

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


Identifying the indexes without a value

The isnull() and notnull() functions are very useful to identify the indexes without a value. See the following program:

import pandas as pd
import numpy as np

s = pd.Series([5,-3,np.NaN,14])

print('The series:\n')
print(s)

print('\nThe null values:\n')
print(s.isnull())

print('\nThe not null values:\n')
print(s.notnull())


The output of the program is shown below:

The series:

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64
 

The null values:

0    False
1    False
2     True
3    False
dtype: bool
 

The not null values:

0     True
1     True
2    False
3     True
dtype: bool
------------------
(program exited with code: 0)

Press any key to continue . . .


These functions return two series with Boolean values that contain the True and False values, depending on whether the item is a NaN value or less. The isnull() function returns True at NaN values in the series; inversely, the notnull() function returns True if they are not NaN. These functions are often placed inside filters to make a condition. See the following program:

import pandas as pd
import numpy as np

s = pd.Series([5,-3,np.NaN,14])

print('The series:\n')
print(s)

print('\nUsing the notnull filter:\n')
print(s[s.notnull()])

print('\nUsing the null filter:\n')
print(s[s.isnull()])


The output of the program is shown below:

The series:

0     5.0
1    -3.0
2     NaN
3    14.0
dtype: float64

Using the notnull filter:

0     5.0
1    -3.0
3    14.0
dtype: float64

Using the null filter:

2   NaN
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . . 


Series as Dictionaries

Series can also be considered as a dict object which can be used during the definition of a series object. See the following program:

import pandas as pd
import numpy as np

mydict = {'Python': 2000, 'Java': 1000, 'C': 500,'C++': 1000}

s = pd.Series(mydict)
print('The series:\n')
print(s)


The output of the program is shown below:

The series:

Python    2000
Java      1000
C          500
C++       1000
dtype: int64
------------------
(program exited with code: 0)

Press any key to continue . . .

 As shown in the output of the program the array of the index is filled with the keys while the data are filled with the corresponding values. We can also define the array indexes separately. In this case, controlling correspondence between the keys of the dict and labels array of indexes will run. If there is a mismatch, pandas will add the NaN value. See the following program:

import pandas as pd
import numpy as np

mydict = {'Python': 2000, 'Java': 1000, 'C': 500,'C++': 1000}
languages = ['Python','Java','C','C++','Perl']

s = pd.Series(mydict,index=languages)
print('The series:\n')
print(s) 


The output of the program is shown below:

The series:

Python    2000.0
Java      1000.0
C          500.0
C++       1000.0
Perl         NaN
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


It is possible to perform arithmetic operations between two series, but in this case even the labels come into play. In fact, one of the great potentials of this type of data structures is that series can
align data addressed differently between them by identifying their corresponding labels. In the following program, we'll add two series having only some elements in common with the label. See the following program:

import pandas as pd
import numpy as np

mydict = {'Python': 2000, 'Java': 1000, 'C': 500,'C++': 1000}
mydict2 = {'Python':400,'Java':1000,'Perl':700,'Delphi':600}
s = pd.Series(mydict)
s2 = pd.Series(mydict2)
print('The series after addition:\n')
print(s+s2)


The output of the program is shown below:

The series after addition:

C            NaN
C++          NaN
Delphi       NaN
Java      2000.0
Perl         NaN
Python    2400.0
dtype: float64
------------------
(program exited with code: 0)

Press any key to continue . . .


As seen from the output we get a new object series in which only the items with the same label are added.  All other labels present in one of the two series are still added to the result but have a NaN
value.


Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!

















Share:

0 comments:

Post a Comment