We already saw how we can use specific indices to subset our data. Only rarely, however, will we know the exact row or column index to subset the data. Typically you are looking for values that meet (or don’t meet) a particular calculation or observation. To explore this process, let’s use a larger data set.
scientists = pd.read_csv('../data/scientists.csv')
We just saw how we can calculate basic descriptive metrics of vectors. The describe method will calculate
multiple descriptive statistics in a single method call.
ages = scientists['Age']
print(ages)
# get basic stats
print(ages.describe())
print(ages.mean())
What if we wanted to subset our ages by identifying those above the mean?
print(ages[ages > ages.mean()])
Let’s ease out this statement and look at what ages > ages.mean() returns.
print(ages > ages.mean())
print(type(ages > ages.mean()))
This statement returns a Series with a dtype of bool. In other words, we can not only subset values using labels and indices, but also supply a vector of boolean values. Python has many functions and methods.
Depending on how it is implemented, it may return labels, indices, or booleans. Keep this point in mind as you learn new methods and seek to piece together various parts for your work. If we liked, we could manually supply a vector of bools to subset our data.
# get index 0, 1, 4, and 5
manual_bool_values = [True, True, False, False, True, True, False, True]
print(ages[manual_bool_values])
If you’re familiar with programming, you would find it strange that ages > ages.mean() returns a vector
without any for loops. Many of the methods that work on series (and also DataFrames) are vectorized, meaning that they work on the entire vector simultaneously. This approach makes the code easier to read, and typically optimizations are available to make calculations faster. Our next post will be based on discussion over this approach.
0 comments:
Post a Comment