By now we have a decent background on how to prepare data for analysis and ready to begin the second stage of data manipulation: the data transformation. After we arrange the form of data and their disposal within the data structure, it is important to transform their values. We will see some common issues and the steps required to overcome them using functions of the pandas library.
Some of these operations involve the presence of duplicate or invalid values, with possible removal or replacement. Other operations relate instead by modifying the indexes. Other steps include handling and processing the numerical values of the data and strings. Let's start with:
1. Removing Duplicates
Duplicate rows might be present in a dataframe for various reasons. In dataframes of enormous size, the detection of these rows can be very problematic. In this case, pandas provides a series of tools to analyze the duplicate data present in large data structures. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)
print('\nDetect the duplicate rows \n')
print(mydataframe.duplicated())
print('\nKnow which are the duplicate rows\n')
print(mydataframe[mydataframe.duplicated()])
In the above program we first create a simple dataframe with some duplicate rows. Then the duplicated() function is applied to the dataframe which can detect the rows that appear to be duplicated. It returns a series of Booleans where each element corresponds to a row, with True if the row is duplicated (i.e., only the other occurrences, not the first), and with False if there are no duplicates in the previous elements.
Finally we check which are the duplicate rows in our dataframe. The output of the program is shown below:
The dataframe
color value
0 white 2
1 white 1
2 red 3
3 red 3
4 white 2
Detect the duplicate rows
0 False
1 False
2 False
3 True
4 True
dtype: bool
Know which are the duplicate rows
color value
3 red 3
4 white 2
-----------------
(program exited with code: 0)
Press any key to continue . . .
pandas provides the drop_duplicates() function, which returns the dataframes without duplicate rows as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)
print('\nThe duplicate rows\n')
print(mydataframe[mydataframe.duplicated()])
print('\nThe dataframe after deleting the duplicate rows \n')
print(mydataframe.drop_duplicates())
The drop_duplicates() deletes the other occurrences of the rows. The output of the program is shown below:
The dataframe
color value
0 white 2
1 white 1
2 red 3
3 red 3
4 white 2
The duplicate rows
color value
3 red 3
4 white 2
The dataframe after deleting the duplicate rows
color value
0 white 2
1 white 1
2 red 3
------------------
(program exited with code: 0)
Press any key to continue . . .
2. Mapping
Mapping is the creation of a list of matches between two different values, with the ability to bind a value to a particular label or string. The pandas library provides a set of functions which, exploit mapping to perform some operations. The dict objects are used to define mapping as shown below:
mymapping = {
'label1' : 'value1,
'label2' : 'value2,
...
}
The pandas library provides the following set of specific functions:
• replace()—Replaces values
• map()—Creates a new column
• rename()—Replaces the index values
Let's see how to use these functions:
Replacing Values via Mapping
Sometimes in our data structure there are values that do not meet our needs. For example, the text may be in a foreign language, or may be a synonym of another value, or may not be expressed in the desired shape. In such cases, a replace operation of various values is often a necessary process.
See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','lal','hara','black','yellow'],
'price':[5.56,4.20,1.30,0.56,2.75]})
print('The dataframe \n')
print(mydataframe)
newcolors = {'lal': 'red','hara': 'green'}
print('\nThe dataframe after replacing the values \n')
print(mydataframe.replace(newcolors))
In the above shown program we have created a dataframe containing various objects and colors, including two colors, lal and hara, that are not in English. To be able to replace the incorrect values with new values, we define a mapping of correspondences, containing as a key the new values which are English equivalents of the two colors. Finally we use the replace() function with the mapping as an argument and get the dataframe with the correct values.
The output of the program is shown below:
The dataframe
item color price
0 ball white 5.56
1 mug lal 4.20
2 pen hara 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe after replacing the values
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
A common use of the replace() is in the replacement of NaN values with another value, for example 0. See the following program:
import pandas as pd
import numpy as np
myseries = pd.Series([1,3,np.nan,4,6,np.nan,3])
print('\nOriginal Series\n')
print(myseries)
print('\nModified Series\n')
print(myseries.replace(np.nan,0))
In the above program we have replaced all the occurrences on NaN with 0. The output of the program is shown below:
Original Series
0 1.0
1 3.0
2 NaN
3 4.0
4 6.0
5 NaN
6 3.0
dtype: float64
Modified Series
0 1.0
1 3.0
2 0.0
3 4.0
4 6.0
5 0.0
6 3.0
dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
Adding Values via Mapping
We also sometimes need to add values in a column depending on the values contained in another. This is accomplished use the map() as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','red','green','black','yellow']})
print('The dataframe \n')
print(mydataframe)
prices = { 'ball' : 5.56,
'mug' : 4.20,
'bottle' : 1.30,
'scissors' : 3.41,
'pen' : 1.30,
'pencil' : 0.56,
'ashtray' : 2.75
}
mydataframe['price'] = mydataframe['item'].map(prices)
print('\nThe dataframe with prices added\n')
print(mydataframe)
In the above program we first create a dataframe with two colomns. Later we want to add a column to indicate the price of the item shown in the dataframe. The prices are defined for each item in the prices dictionary. We then applied the mapping of the prices on the column item using the map(), making sure to add a column to the price dataframe. The map() function applied to a series or to a column of a dataframe accepts a function or an object containing a dict with mapping.
The output of the program is shown below:
The dataframe
item color
0 ball white
1 mug red
2 pen green
3 pencil black
4 ashtray yellow
The dataframe with prices added
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . . .
Rename the Indexes of the Axes
Just like the values contained within the series and the dataframe, the axis label can be transformed in a very similar way using the mapping. To replace the label indexes, pandas provides the rename() function, which takes the mapping as an argument, that is, a dict object. Let's replace the label index in our previous program so that instead of 01234 it displays first second third fourth. Add the following code to the previous program:
reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}
print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(reindex))
The output of the program is shown below:
The dataframe with label renamed
item color price
first ball white 5.56
second mug red 4.20
third pen green 1.30
fourth pencil black 0.56
fifth ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
As we can see the indexes are renamed. If we want to rename columns we must use the columns option. In this case we assign various mapping explicitly to the two index and columns options as shown below:
recolumn = {'item':'object','price': 'value'}
print('\nThe dataframe with column renamed\n')
print(mydataframe.rename(index=reindex, columns=recolumn))
Add this code to the previous program and see the output which should be:
The dataframe with column renamed
object color value
first ball white 5.56
second mug red 4.20
third pen green 1.30
fourth pencil black 0.56
fifth ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
It is also possible to rename single values as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','red','green','black','yellow']})
prices = { 'ball' : 5.56,
'mug' : 4.20,
'bottle' : 1.30,
'scissors' : 3.41,
'pen' : 1.30,
'pencil' : 0.56,
'ashtray' : 2.75
}
mydataframe['price'] = mydataframe['item'].map(prices)
print('\nThe original dataframe\n')
print(mydataframe)
print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(index={1:'first'}, columns={'item':'object'}))
The output of the program is shown below:
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe with label renamed
object color price
0 ball white 5.56
first mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
As was desired, the index 1 is renames to first. As you have seen that the rename() function returns a dataframe with the changes, leaving unchanged the original dataframe. If we want the changes to take
effect on the object on which you call the function, you will set the inplace option to True. Add the following code to the previous program:
mydataframe.rename(columns={'item':'object'}, inplace=True)
print('\nThe dataframe with change in item object\n')
print(mydataframe)
The output of the program is shown below:
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe with label renamed
object color price
0 ball white 5.56
first mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe with change in item object
object color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
If you print the original data frame again what will be the output?
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Some of these operations involve the presence of duplicate or invalid values, with possible removal or replacement. Other operations relate instead by modifying the indexes. Other steps include handling and processing the numerical values of the data and strings. Let's start with:
1. Removing Duplicates
Duplicate rows might be present in a dataframe for various reasons. In dataframes of enormous size, the detection of these rows can be very problematic. In this case, pandas provides a series of tools to analyze the duplicate data present in large data structures. See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)
print('\nDetect the duplicate rows \n')
print(mydataframe.duplicated())
print('\nKnow which are the duplicate rows\n')
print(mydataframe[mydataframe.duplicated()])
In the above program we first create a simple dataframe with some duplicate rows. Then the duplicated() function is applied to the dataframe which can detect the rows that appear to be duplicated. It returns a series of Booleans where each element corresponds to a row, with True if the row is duplicated (i.e., only the other occurrences, not the first), and with False if there are no duplicates in the previous elements.
Finally we check which are the duplicate rows in our dataframe. The output of the program is shown below:
The dataframe
color value
0 white 2
1 white 1
2 red 3
3 red 3
4 white 2
Detect the duplicate rows
0 False
1 False
2 False
3 True
4 True
dtype: bool
Know which are the duplicate rows
color value
3 red 3
4 white 2
-----------------
(program exited with code: 0)
Press any key to continue . . .
pandas provides the drop_duplicates() function, which returns the dataframes without duplicate rows as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'color': ['white','white','red','red','white'],
'value': [2,1,3,3,2]})
print('The dataframe \n')
print(mydataframe)
print('\nThe duplicate rows\n')
print(mydataframe[mydataframe.duplicated()])
print('\nThe dataframe after deleting the duplicate rows \n')
print(mydataframe.drop_duplicates())
The drop_duplicates() deletes the other occurrences of the rows. The output of the program is shown below:
The dataframe
color value
0 white 2
1 white 1
2 red 3
3 red 3
4 white 2
The duplicate rows
color value
3 red 3
4 white 2
The dataframe after deleting the duplicate rows
color value
0 white 2
1 white 1
2 red 3
------------------
(program exited with code: 0)
Press any key to continue . . .
2. Mapping
Mapping is the creation of a list of matches between two different values, with the ability to bind a value to a particular label or string. The pandas library provides a set of functions which, exploit mapping to perform some operations. The dict objects are used to define mapping as shown below:
mymapping = {
'label1' : 'value1,
'label2' : 'value2,
...
}
The pandas library provides the following set of specific functions:
• replace()—Replaces values
• map()—Creates a new column
• rename()—Replaces the index values
Let's see how to use these functions:
Replacing Values via Mapping
Sometimes in our data structure there are values that do not meet our needs. For example, the text may be in a foreign language, or may be a synonym of another value, or may not be expressed in the desired shape. In such cases, a replace operation of various values is often a necessary process.
See the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','lal','hara','black','yellow'],
'price':[5.56,4.20,1.30,0.56,2.75]})
print('The dataframe \n')
print(mydataframe)
newcolors = {'lal': 'red','hara': 'green'}
print('\nThe dataframe after replacing the values \n')
print(mydataframe.replace(newcolors))
In the above shown program we have created a dataframe containing various objects and colors, including two colors, lal and hara, that are not in English. To be able to replace the incorrect values with new values, we define a mapping of correspondences, containing as a key the new values which are English equivalents of the two colors. Finally we use the replace() function with the mapping as an argument and get the dataframe with the correct values.
The output of the program is shown below:
The dataframe
item color price
0 ball white 5.56
1 mug lal 4.20
2 pen hara 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe after replacing the values
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
A common use of the replace() is in the replacement of NaN values with another value, for example 0. See the following program:
import pandas as pd
import numpy as np
myseries = pd.Series([1,3,np.nan,4,6,np.nan,3])
print('\nOriginal Series\n')
print(myseries)
print('\nModified Series\n')
print(myseries.replace(np.nan,0))
In the above program we have replaced all the occurrences on NaN with 0. The output of the program is shown below:
Original Series
0 1.0
1 3.0
2 NaN
3 4.0
4 6.0
5 NaN
6 3.0
dtype: float64
Modified Series
0 1.0
1 3.0
2 0.0
3 4.0
4 6.0
5 0.0
6 3.0
dtype: float64
------------------
(program exited with code: 0)
Press any key to continue . . .
Adding Values via Mapping
We also sometimes need to add values in a column depending on the values contained in another. This is accomplished use the map() as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','red','green','black','yellow']})
print('The dataframe \n')
print(mydataframe)
prices = { 'ball' : 5.56,
'mug' : 4.20,
'bottle' : 1.30,
'scissors' : 3.41,
'pen' : 1.30,
'pencil' : 0.56,
'ashtray' : 2.75
}
mydataframe['price'] = mydataframe['item'].map(prices)
print('\nThe dataframe with prices added\n')
print(mydataframe)
In the above program we first create a dataframe with two colomns. Later we want to add a column to indicate the price of the item shown in the dataframe. The prices are defined for each item in the prices dictionary. We then applied the mapping of the prices on the column item using the map(), making sure to add a column to the price dataframe. The map() function applied to a series or to a column of a dataframe accepts a function or an object containing a dict with mapping.
The output of the program is shown below:
The dataframe
item color
0 ball white
1 mug red
2 pen green
3 pencil black
4 ashtray yellow
The dataframe with prices added
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . . .
Rename the Indexes of the Axes
Just like the values contained within the series and the dataframe, the axis label can be transformed in a very similar way using the mapping. To replace the label indexes, pandas provides the rename() function, which takes the mapping as an argument, that is, a dict object. Let's replace the label index in our previous program so that instead of 01234 it displays first second third fourth. Add the following code to the previous program:
reindex = {0: 'first',1: 'second',2: 'third',3: 'fourth',4: 'fifth'}
print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(reindex))
The output of the program is shown below:
The dataframe with label renamed
item color price
first ball white 5.56
second mug red 4.20
third pen green 1.30
fourth pencil black 0.56
fifth ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
As we can see the indexes are renamed. If we want to rename columns we must use the columns option. In this case we assign various mapping explicitly to the two index and columns options as shown below:
recolumn = {'item':'object','price': 'value'}
print('\nThe dataframe with column renamed\n')
print(mydataframe.rename(index=reindex, columns=recolumn))
Add this code to the previous program and see the output which should be:
The dataframe with column renamed
object color value
first ball white 5.56
second mug red 4.20
third pen green 1.30
fourth pencil black 0.56
fifth ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
It is also possible to rename single values as shown in the following program:
import pandas as pd
import numpy as np
mydataframe = pd.DataFrame({ 'item':['ball','mug','pen','pencil','ashtray'],
'color':['white','red','green','black','yellow']})
prices = { 'ball' : 5.56,
'mug' : 4.20,
'bottle' : 1.30,
'scissors' : 3.41,
'pen' : 1.30,
'pencil' : 0.56,
'ashtray' : 2.75
}
mydataframe['price'] = mydataframe['item'].map(prices)
print('\nThe original dataframe\n')
print(mydataframe)
print('\nThe dataframe with label renamed\n')
print(mydataframe.rename(index={1:'first'}, columns={'item':'object'}))
The output of the program is shown below:
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe with label renamed
object color price
0 ball white 5.56
first mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
As was desired, the index 1 is renames to first. As you have seen that the rename() function returns a dataframe with the changes, leaving unchanged the original dataframe. If we want the changes to take
effect on the object on which you call the function, you will set the inplace option to True. Add the following code to the previous program:
mydataframe.rename(columns={'item':'object'}, inplace=True)
print('\nThe dataframe with change in item object\n')
print(mydataframe)
The output of the program is shown below:
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe with label renamed
object color price
0 ball white 5.56
first mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The original dataframe
item color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
The dataframe with change in item object
object color price
0 ball white 5.56
1 mug red 4.20
2 pen green 1.30
3 pencil black 0.56
4 ashtray yellow 2.75
------------------
(program exited with code: 0)
Press any key to continue . . .
If you print the original data frame again what will be the output?
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment