NumPy’s array is optimized for homogeneous numeric data that’s accessed via integer indices. Data science presents unique demands for which more customized data structures are required. Big data applications must support mixed data types, customized indexing, missing data, data that’s not structured consistently and data that needs to be manipulated into forms appropriate for the databases and data analysis packages you use.
Pandas is the most popular library for dealing with such data. It provides two key collections —Series for one-dimensional collections and DataFrames for two-dimensional collections. You can use pandas’ MultiIndex to manipulate multi-dimensional data in the context of Series and DataFrames.
Wes McKinney created pandas in 2008 while working in industry. The name pandas is derived from the term “panel data,” which is data for measurements over time, such as stock prices or historical temperature readings. McKinney needed a library in which the same data structures could handle both time- and non-time-based data with support for data alignment, missing data, common database-style data manipulations, and more.
NumPy and pandas are intimately related. Series and DataFrames use arrays “under the hood.” Series and DataFrames are valid arguments to many NumPy operations. Similarly, arrays are valid arguments to many Series and DataFrame operations.
pandas Series
A Series is an enhanced one-dimensional array. Whereas arrays use only zero-based integer indices, Series support custom indexing, including even non-integer indices like strings. Series also offer additional capabilities that make them more convenient for many data-science oriented tasks. For example, Series may have missing data, and many Series operations ignore missing data by default.
By default, a Series has integer indices numbered sequentially from 0. The following creates a Series of
student grades from a list of integers:
In [1]: import pandas as pd
In [2]: grades = pd.Series([87, 100, 94])
The initializer also may be a tuple, a dictionary, an array, another Series or a single value. We’ll show a single value momentarily.
Pandas displays a Series in two-column format with the indices left aligned in the left column and the values right aligned in the right column. After listing the Series elements, pandas shows the data type (dtype) of the underlying array’s elements:
In [3]: grades
Out[3]:
0 87
1 100
2 94
dtype: int64
Note how easy it is to display a Series in this format, compared to the corresponding code for displaying a list in the same two-column format.
You can create a series of elements that all have the same value:
In [4]: pd.Series(98.6, range(3))
Out[4]:
0 98.6
1 98.6
2 98.6
dtype: float64
The second argument is a one-dimensional iterable object (such as a list, an array or a range) containing the Series’ indices. The number of indices determines the number of elements.
You can access a Series’s elements by via square brackets containing an index:
In [5]: grades[0]
Out[5]: 87
We'll continue with the discussion in next posts.
0 comments:
Post a Comment