Subsetting Rows
Rows can be subset in multiple ways, by row name or row index. Here is a quick overview of the various methods:
- loc - Subset based on index label (row name)
- iloc - Subset based on row index (row number)
- ix - (no longer works in Pandas v0.20) Subset based on index label or row index
Subset Rows by Index Label: loc
Let’s take a look at part of our Gapminder data.
print(df.head())
On the left side of the printed dataframe, we see what appear to be row numbers. This column-less row of values is the index label of the dataframe. Think of the index label as being like a column name, but for rows instead of columns. By default, Pandas will fill in the index labels with the row numbers (note that it starts counting from 0). A common example where the row index labels are not the same as the row number is when we work with time series data. In that case, the index label will be a timestamp of sorts. For now, though, we will keep the default row number values.
We can use the loc attribute on the dataframe to subset rows based on the index label.
print(df.loc[0])
print(df.loc[99])
print(df.loc[-1])
Note that passing -1 as the loc will cause an error, because it is actually looking for the row index label (row number) ‘-1’, which does not exist in our example. Instead, we can use a bit of Python to calculate the number of rows and pass that value into loc.
number_of_rows = df.shape[0]
last_row_index = number_of_rows - 1
print(df.loc[last_row_index])
Alternatively, we can use the tail method to return the last 1 row, instead of the default 5.
print(df.tail(n=1))
Notice that when we used tail() and loc, the results were printed out differently. Let’s look at which type
is returned when we use these methods.
subset_loc = df.loc[0]
subset_head = df.head(n=1)
print(type(subset_loc))
<class 'pandas.core.series.Series'>
print(type(subset_head))
<class 'pandas.core.frame.DataFrame'>
Depending on which method we use and how many rows we return, Pandas will return a different object. The way an object gets printed to the screen can be an indicator of the type, but it’s always best to use the type function to be sure.
Subsetting Multiple Rows
Just as for columns, we can select multiple rows.
print(df.loc[[0, 99, 999]])
Here this post comes to an end. We saw how we can use the loc attribute on the dataframe to subset rows based on the index label. In the next post we'll see how iloc does the same thing.
0 comments:
Post a Comment