Thursday, August 6, 2020

Pandas DataFrame Basics

We have already covered Pandas library before but I felt the need to explore it further, hence again starting this series .

Python Pandas DataFrame Join, Merge, and Concatenate | by Jiahui ...

Pandas is an open source Python library for data analysis. It gives Python the ability to work with spreadsheetlike data for fast data loading, manipulating, aligning, and merging, among other functions. To give Python these enhanced features, Pandas introduces two new data types to Python: Series and DataFrame. The DataFrame represents your entire spreadsheet or rectangular data, whereas the Series is a single column of the DataFrame. A Pandas DataFrame can also be thought of as a dictionary or collection of Series objects.

Loading Your First Data Set

When given a data set, we first load it and begin looking at its structure and contents. The simplest way of
looking at a data set is to examine and subset specific rows and columns. We can see which type of information is stored in each column, and can start looking for patterns by aggregating descriptive statistics. Since Pandas is not part of the Python standard library, we have to first tell Python to load (import) the library.

import pandas

In our examples we will use the Gapminder Data Set originally comes from www.gapminder.org. The repository can be found at: www.github.com/jennybc/gapminder.

With the library loaded, we can use the read_csv function to load a CSV data file. To access the read_csv function from Pandas, we use dot notation.

df = pandas.read_csv('../data/gapminder.tsv', sep='\t')

Note that by default the read_csv function will read a comma-separated file and our Gapminder data are separated by tabs. We can use the sep parameter and indicate a tab with \t .

print(df.head())

We use the head method so Python shows us only the first 5 rows from the data frame. So the output of our code will be:

    country        continent year  lifeExp  pop           gdpPercap
0 Afghanistan  Asia         1952 28.801 8425333   779.445314
1 Afghanistan  Asia         1957 30.332 9240934   820.853030
2 Afghanistan  Asia         1962 31.997 10267083 853.100710
3 Afghanistan  Asia         1967 34.020 11537966 836.197138
4 Afghanistan  Asia         1972 36.088 13079460 739.981106


When working with Pandas functions, it is common practice to give pandas the alias pd. Thus the following code is equivalent to the preceding example:

import pandas as pd
df = pd.read_csv('../data/gapminder.tsv', sep='\t')

We can check whether we are working with a Pandas DataFrame by using the built-in type function (i.e., it comes directly from Python, not any package such as Pandas).

print(type(df))

Output <class 'pandas.core.frame.DataFrame'>

The type function is handy when you begin working with many different types of Python objects and need to know which object you are currently working on.

The data set we loaded is currently saved as a Pandas DataFrame object and is relatively small. Every
DataFrame object has a shape attribute that will give us the number of rows and columns of the
DataFrame.

So if we execute this :

print(df.shape)

Our output will be:

(1704, 6)

The shape attribute returns a tuple in which the first value is the number of rows and the second
number is the number of columns. From the preceding results, we see our Gapminder data set has 1704 rows and 6 columns.

Since shape is an attribute of the dataframe, and not a function or method of the DataFrame, it does not
have parentheses after the period. If you made the mistake of putting parentheses after the shape attribute, it would return an error. So the code:

print(df.shape())

Should result in  an error as shown below:

Traceback (most recent call last):
File "<ipython-input-1-e05f133c2628>", line 2, in <module>
print(df.shape())
TypeError: 'tuple' object is not callable


Typically, when first looking at a data set, we want to know how many rows and columns there are (we just did that). To get the gist of which information it contains, we look at the columns. The column names, like shape, are specified using the column attribute of the dataframe object.

print(df.columns)

Output:

Index(['country', 'continent', 'year', 'lifeExp', 'pop',
'gdpPercap'],
dtype='object')

The Pandas DataFrame object is similar to the DataFrame-like objects found in other languages (e.g.,
Julia and R) Each column (Series) has to be the same type, whereas each row can contain mixed types. In our current example, we can expect the country column to be all strings and the year to be integers. However, it’s best to make sure that is the case by using the dtypes attribute or the info method.

print(df.dtypes)

Output:

country object
continent object
year int64
lifeExp float64
pop int64
gdpPercap float64
dtype: object


To get more information about our data:

print(df.info())

Output:

class 'pandas.core.frame.DataFrame'>
RangeIndex: 1704 entries, 0 to 1703
Data columns (total 6 columns):
country 1704 non-null object
continent 1704 non-null object
year 1704 non-null int64
lifeExp 1704 non-null float64
pop 1704 non-null int64
gdpPercap 1704 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 80.0+ KB
None

Now we’re able to load a simple data file, we want to be able to inspect its contents and this will be the focus of my next post.
Share:

0 comments:

Post a Comment