Whether you are manually inputting data or creating a small test example, knowing how to create dataframes without loading data from a file is a useful skill. It is especially helpful when you are asking a question about a StackOverflow error.
Creating a Series
The Pandas Series is a one-dimensional container, similar to the built-in Python list. It is the data type that represents each column of the DataFrame. Each column in a dataframe must be of the same dtype. Since a dataframe can be thought of a dictionary of Series objects, where each key is the column name and the value is the Series, we can conclude that a Series is very similar to a Python list, except each element must be the same dtype. Those who have used the numpy library will realize this is the same behavior as demonstrated by the ndarray.
The easiest way to create a Series is to pass in a Python list. If we pass in a list of mixed types, the most
common representation of both will be used. Typically the dtype will be object.
import pandas as pd
s = pd.Series(['banana', 42])
print(s)
Output
0 banana
1 42
dtype: object
Notice on the left that the “row number” is shown. This is actually the index for the series. It is similar to the row name and row index we saw in previous posts for dataframes. It implies that we can actually assign a “name” to values in our series.
# manually assign index values to a series
# by passing a Python list
s = pd.Series(['Wes McKinney', 'Creator of Pandas'],
index=['Person', 'Who'])
print(s)
Output
Person Wes McKinney
Who Creator of Pandas
dtype: object
Creating a DataFrame
A DataFrame can be thought of as a dictionary of Series objects. This is why dictionaries are the most common way of creating a DataFrame. The key represents the column name, and the values are the contents of the column.
scientists = pd.DataFrame({
'Name': ['Rosaline Franklin', 'William Gosset'],
'Occupation': ['Chemist', 'Statistician'],
'Born': ['1920-07-25', '1876-06-13'],
'Died': ['1958-04-16', '1937-10-16'],
'Age': [37, 61]})
print(scientists)
Output
Notice that order is not guaranteed. If we look at the documentation for DataFrame (DataFrame documentation: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html ), we see that we can usethe columns parameter or specify the column order. If we wanted to use the name column for the row index, we can use the index parameter.
scientists = pd.DataFrame(
data={'Occupation': ['Chemist', 'Statistician'],
'Born': ['1920-07-25', '1876-06-13'],
'Died': ['1958-04-16', '1937-10-16'],
'Age': [37, 61]},
index=['Rosaline Franklin', 'William Gosset'],
columns=['Occupation', 'Born', 'Died', 'Age'])
print(scientists)
The order is not guaranteed because Python dictionaries are not ordered. If we want an ordered dictionary, we need to use the OrderedDict from the collections module.
from collections import OrderedDict
# note the round brackets after OrderedDict
# then we pass a list of 2-tuples
scientists = pd.DataFrame(OrderedDict([
('Name', ['Rosaline Franklin', 'William Gosset']),
('Occupation', ['Chemist', 'Statistician']),
('Born', ['1920-07-25', '1876-06-13']),
('Died', ['1958-04-16', '1937-10-16']),
('Age', [37, 61])
])
)
print(scientists)
In this post we saw how to create our own series and dataframe. In the next post we'll explore series in detail.
0 comments:
Post a Comment