Wednesday, August 12, 2020

Subsetting Rows and Columns

Pandas DataFrame – Data Science with Python

We’ve been using the colon, :, in loc and iloc to the left of the comma. When we do so, we select all the rows in our dataframe. However, we can choose to put values to the left of the comma if we want to select specific rows along with specific columns.

# using loc
print(df.loc[42, 'country']) 

Output - Angola 

# using iloc
print(df.iloc[42, 0]) 

Output - Angola 

Just make sure you don’t forget the differences between loc and iloc.

# will cause an error
print(df.loc[42, 0])
 

Output - Traceback (most recent call last):
File "<ipython-input-1-2b69d7150b5e>", line 2, in <module>
print(df.loc[42, 0])
TypeError: cannot do label indexing on <class
'pandas.core.indexes.base.Index'> with these indexers [0] of <class
'int'>

We can combine the row and column sub-setting syntax with the multiple-row and multiple-column sub-setting syntax to get various slices of our data. 

# get the 1st, 100th, and 1000th rows
# from the 1st, 4th, and 6th columns
# the columns we are hoping to get are
# country, lifeExp, and gdpPercap
print(df.iloc[[0, 99, 999], [0, 3, 5]])

 

I usually try to pass in the actual column names when sub-setting data whenever possible. That
approach makes the code more readable since you do not need to look at the column name vector to know which index is being called. Additionally, using absolute indexes can lead to problems if the column order gets changed for some reason. This is just a general rule of thumb, as there will be exceptions where using the index position is a better option.

Remember, you can use the slicing syntax on the row portion of the loc and iloc attributes.

print(df.loc[10:13, ['country', 'lifeExp', 'gdpPercap']])

 

So this is all about Looking at Columns, Rows, and Cells  with slicing and indexing. Our next focus will be on Grouped and Aggregated Calculations. See you soon with my next post.

Share:

0 comments:

Post a Comment