The dataframe is a spreadsheet like tabular data structure designed to extend series to multiple dimensions. It consists of an ordered collection of columns, each of which can contain a value of a different type (numeric, string, Boolean, etc.). The figure below shows a dataframe datastructure:
The dataframe has two index arrays. The first index array, associated with the lines, has very similar functions to the index array in series. In fact, each label is associated with all the values in the row. The second array contains a series of labels, each associated with a particular column. We can also view dataframe as a dict of series, where the keys are the column names and the values are the series that will form the columns of the dataframe. All elements in each series are mapped according to an array of labels, called the index.
A dataframe is constructed using by passing dict object to the DataFrame() constructor. This dict object contains a key for each column we want to define, with an array of values for each of them.
As depicted in the figure above, if we consider column names as “Keys” and list of items under that column as “Values”, we can easily use a python dictionary to represent the same as
my_dict = {
'name' : ["a", "b", "c", "d", "e","f", "g"],
'age' : [20,27, 35, 55, 18, 21, 35],
'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}
We can create a Pandas DataFrame out of this dictionary as:
import Pandas as pd
df = pd.DataFrame(my_dict)
See the following program:
import pandas as pd
my_dict = {
'name' : ["a", "b", "c", "d", "e","f", "g"],
'age' : [20,27, 35, 55, 18, 21, 35],
'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}
df = pd.DataFrame(my_dict)
print(df)
The output of the program shows the created dataframe:
name age designation
0 a 20 VP
1 b 27 CEO
2 c 35 CFO
3 d 55 VP
4 e 18 VP
5 f 21 CEO
6 g 35 MD
------------------
(program exited with code: 0)
Press any key to continue . . .
We can also create a dataframe with a preferred selection of columns which is done by specifying a sequence of columns using the columns option in the constructor of the dataframe. The columns will be created in the order of the sequence regardless of how they are contained in the dict object. See the following program:
import pandas as pd
my_dict = {
'name' : ["a", "b", "c", "d", "e","f", "g"],
'age' : [20,27, 35, 55, 18, 21, 35],
'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}
df1 = pd.DataFrame(my_dict, columns=['age','designation'])
df2 = pd.DataFrame(my_dict, columns=['age','name'])
df3 = pd.DataFrame(my_dict, columns=['name','designation'])
print(df1)
print('\n')
print(df2)
print('\n')
print(df3)
The output of the program is shown below:
age designation
0 20 VP
1 27 CEO
2 35 CFO
3 55 VP
4 18 VP
5 21 CEO
6 35 MD
age name
0 20 a
1 27 b
2 35 c
3 55 d
4 18 e
5 21 f
6 35 g
name designation
0 a VP
1 b CEO
2 c CFO
3 d VP
4 e VP
5 f CEO
6 g MD
------------------
(program exited with code: 0)
Press any key to continue . . .
If the labels for dataframe objects are not explicitly specified in the Index array, pandas automatically assigns a numeric sequence starting from 0(as shown in the output above). Instead, if we want to assign labels to the indexes of a dataframe, we have to use the index option and assign it an array containing the labels as shown in the following program:
import pandas as pd
my_dict = {
'name' : ["a", "b", "c", "d", "e","f", "g"],
'age' : [20,27, 35, 55, 18, 21, 35],
'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}
df= pd.DataFrame(my_dict, index=['one','two','three','four','five','six','seven'])
print(df)
The output of the program is shown below:
name age designation
one a 20 VP
two b 27 CEO
three c 35 CFO
four d 55 VP
five e 18 VP
six f 21 CEO
seven g 35 MD
------------------
(program exited with code: 0)
Press any key to continue . . .
Instead of using dict object there is an alternate way to define a dataframe by defining three arguments in the constructor, in the following order—a data matrix, an array containing the labels assigned to the index option, and an array containing the names of the columns assigned to the columns option. See the following program:
import pandas as pd
import numpy as np
df= pd.DataFrame(np.arange(16).reshape((4,4)),index=['Canada','China','US','India'],columns=['Python','Java','PHP','C++'])
print(df)
The output of the program is shown below:
Python Java PHP C++
Canada 0 1 2 3
China 4 5 6 7
US 8 9 10 11
India 12 13 14 15
------------------
(program exited with code: 0)
Press any key to continue . . .
Once a dataframe is created we can
a. Know the name of all the columns of a dataframe by specifying the columns attribute on the instance of the dataframe object (df.columns).
b. Get the list of indexes by specifying the index attribute on the instance of the dataframe object (df.index).
c. Get the entire set of data contained within the data structure using the values attribute on the instance of the dataframe object (df.values).
d. Select only the contents of a column (df['name'] or df.name)
e. Extract a particular row by using the loc attribute with the index value of the row (df.loc[2])
f. Select multiple rows by specifying an array with the sequence of rows to insert (df.loc[[2,4]])
g. Extract a portion of a DataFrame, selecting the lines that we want to extract, using the reference numbers of the indexes (df[0:1],df[1:3] )
h. Retrieve a single value within a dataframe, by using the name of the column and then the index or the label of the row (df['age'][3]).
The following program implements all of the above mentioned features:
import pandas as pd
import numpy as np
my_dict = {
'name' : ["a", "b", "c", "d", "e","f", "g"],
'age' : [20,27, 35, 55, 18, 21, 35],
'designation': ["VP", "CEO", "CFO", "VP", "VP", "CEO", "MD"]
}
df = pd.DataFrame(my_dict)
print('The dataframe:\n')
print(df)
print('\nThe columns of dataframe:\n')
print(df.columns)
print('\nThe indexes of dataframe:\n')
print(df.index)
print('\nThe entire set of data contained in dataframe:\n')
print(df.values)
print('\nThe contents of a column in dataframe:\n')
print(df['name'])
print(df.name)
print('\nExtracted row from dataframe:\n')
print(df.loc[2])
print('\nmultiple rows of dataframe:\n')
print(df.loc[[2,4]])
print('\nExtracted portion of a DataFrame:\n')
print(df[0:1])
print(df[1:3] )
print('\n Single value within a dataframe:\n')
print(df['age'][3])
The output of the program is shown below:
The dataframe:
name age designation
0 a 20 VP
1 b 27 CEO
2 c 35 CFO
3 d 55 VP
4 e 18 VP
5 f 21 CEO
6 g 35 MD
The columns of dataframe:
Index(['name', 'age', 'designation'], dtype='object')
The indexes of dataframe:
RangeIndex(start=0, stop=7, step=1)
The entire set of data contained in dataframe:
[['a' 20 'VP']
['b' 27 'CEO']
['c' 35 'CFO']
['d' 55 'VP']
['e' 18 'VP']
['f' 21 'CEO']
['g' 35 'MD']]
The contents of a column in dataframe:
0 a
1 b
2 c
3 d
4 e
5 f
6 g
Name: name, dtype: object
0 a
1 b
2 c
3 d
4 e
5 f
6 g
Name: name, dtype: object
Extracted row from dataframe:
name c
age 35
designation CFO
Name: 2, dtype: object
multiple rows of dataframe:
name age designation
2 c 35 CFO
4 e 18 VP
Extracted portion of a DataFrame:
name age designation
0 a 20 VP
name age designation
1 b 27 CEO
2 c 35 CFO
Single value within a dataframe:
55
------------------
(program exited with code: 0)
Press any key to continue . . .
Here I am ending today’s post. Until we meet again keep practicing and learning Python, as Python is easy to learn!
0 comments:
Post a Comment