We have already covered Pandas library before but I felt the need to explore it further, hence again starting this series .
Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheetlike data for fast data loading, manipulating, aligning, and merging, among other functions. To give Python these enhanced features, Pandas introduces two new data types to Python: Series and DataFrame. The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series objects.
Loading Your First Data Set
When given a data set, we first load it and begin looking at its structure and contents. The simplest way of
looking at a data set is to examine and subset specific rows and columns. We can see which type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics. Since Pandas is not part of the Python standard library, we have to first tell Python to load (import) the library.
looking at a data set is to examine and subset specific rows and columns. We can see which type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics. Since Pandas is not part of the Python standard library, we have to first tell Python to load (import) the library.
import pandas
In our examples we will use the Gapminder Data Set originally comes from www.gapminder.org. The repository can be found at: www.github.com/jennybc/gapminder.
With the library loaded, we can use the read_csv function to load a CSV data file. To access the read_csv function from Pandas, we use dot notation.
df = pandas.read_csv('../data/gapminder.tsv', sep='\t')
Note that by default the read_csv function will read a comma-separated file and our Gapminder data are separated by tabs. We can use the sep parameter and indicate a tab with \t .
print(df.head())
We use the head method so Python shows us only the first 5 rows from the data frame. So the output of our code will be:
country continent year lifeExp pop gdpPercap
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
0 Afghanistan Asia 1952 28.801 8425333 779.445314
1 Afghanistan Asia 1957 30.332 9240934 820.853030
2 Afghanistan Asia 1962 31.997 10267083 853.100710
3 Afghanistan Asia 1967 34.020 11537966 836.197138
4 Afghanistan Asia 1972 36.088 13079460 739.981106
When working with Pandas functions, it is common practice to give pandas the alias pd. Thus the following code is equivalent to the preceding example:
import pandas as pd
df = pd.read_csv('../data/gapminder.tsv', sep='\t')
df = pd.read_csv('../data/gapminder.tsv', sep='\t')
We can check whether we are working with a Pandas DataFrame by using the built-in type function (i.e., it comes directly from Python, not any package such as Pandas).
print(type(df))
Output <class 'pandas.core.frame.DataFrame'>
The type function is handy when you begin working with many different types of Python objects and need to know which object you are currently working on.
The data set we loaded is currently saved as a Pandas DataFrame object and is relatively small. Every
DataFrame object has a shape attribute that will give us the number of rows and columns of the
DataFrame.
DataFrame object has a shape attribute that will give us the number of rows and columns of the
DataFrame.
So if we execute this :
print(df.shape)
Our output will be:
(1704, 6)
The shape attribute returns a tuple in which the first value is the number of rows and the second
number is the number of columns. From the preceding results, we see our Gapminder data set has 1704 rows and 6 columns.
number is the number of columns. From the preceding results, we see our Gapminder data set has 1704 rows and 6 columns.
Since shape is an attribute of the dataframe, and not a function or method of the DataFrame, it does not
have parentheses after the period. If you made the mistake of putting parentheses after the shape attribute, it would return an error. So the code:
have parentheses after the period. If you made the mistake of putting parentheses after the shape attribute, it would return an error. So the code:
print(df.shape())
Should result in an error as shown below:
Traceback (most recent call last):
File "<ipython-input-1-e05f133c2628>", line 2, in <module>
File "<ipython-input-1-e05f133c2628>", line 2, in <module>
print(df.shape())
TypeError: 'tuple' object is not callable
TypeError: 'tuple' object is not callable
Typically, when first looking at a data set, we want to know how many rows and columns there are (we just did that). To get the gist of which information it contains, we look at the columns. The column names, like shape, are specified using the column attribute of the dataframe object.
print(df.columns)
Output:
Index(['country', 'continent', 'year', 'lifeExp', 'pop',
'gdpPercap'],
dtype='object')
'gdpPercap'],
dtype='object')
The Pandas DataFrame object is similar to the DataFrame-like objects found in other languages (e.g.,
Julia and R) Each column (Series) has to be the same type, whereas each row can contain mixed types. In our current example, we can expect the country column to be all strings and the year to be integers. However, it’s best to make sure that is the case by using the dtypes attribute or the info method.
Julia and R) Each column (Series) has to be the same type, whereas each row can contain mixed types. In our current example, we can expect the country column to be all strings and the year to be integers. However, it’s best to make sure that is the case by using the dtypes attribute or the info method.
print(df.dtypes)
Output:
country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object
To get more information about our data:
print(df.info())
Output:
class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country 1704 non-null object
continent 1704 non-null object
year 1704 non-null int64
lifeExp 1704 non-null float64
pop 1704 non-null int64
gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country 1704 non-null object
continent 1704 non-null object
year 1704 non-null int64
lifeExp 1704 non-null float64
pop 1704 non-null int64
gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None
Now we’re able to load a simple data file, we want to be able to inspect its contents and this will be the focus of my next post.
0 comments:
Post a Comment