Friday, August 7, 2020

Inspecting contents of a simple data file (Subsetting Columns)

Pandas DataFrame (Python): 10 useful tricks | by Maurizio ...

Now that we’re able to load a simple data file, we want to be able to inspect its contents. We could print out the contents of the dataframe, but with today’s data, there are often too many cells to make sense of all the printed information. Instead, the best way to look at our data is to inspect it in parts by looking at various subsets of the data. We already saw that we can use the head method of a dataframe to look at the first five rows of our data. This is useful to see if our data loaded properly and to get a sense of each of the columns, its name, and its contents. Sometimes, however, we may want to see only particular rows, columns, or values from our data.

Subsetting Columns

If we want to examine multiple columns, we can specify them by names, positions, or ranges.

1. If we want only a specific column from our data, we can access the data using square brackets. See the following program:

# just get the country column and save it to its own variable
country_df = df['country']
# show the first 5 observations
print(country_df.head())

Output:


0 Afghanistan
1 Afghanistan
2 Afghanistan
3 Afghanistan
4 Afghanistan
Name: country, dtype: object

 

# show the last 5 observations
print(country_df.tail())

 

Output:

1699 Zimbabwe
1700 Zimbabwe
1701 Zimbabwe
1702 Zimbabwe
1703 Zimbabwe
Name: country, dtype: object

2. To specify multiple columns by the column name, we need to pass in a Python list between the square brackets. This may look a bit strange since there will be two sets of square brackets. See the following program:

# Looking at country, continent, and year
subset = df[['country', 'continent', 'year']]
print(subset.head())

 

Output:

    country        continent year
0 Afghanistan Asia         1952
1 Afghanistan Asia         1957
2 Afghanistan Asia         1962
3 Afghanistan Asia        1967
4 Afghanistan Asia        1972

 

print(subset.tail())

Output:

         country       continent year
1699 Zimbabwe Africa      1987
1700 Zimbabwe Africa      1992
1701 Zimbabwe Africa      1997
1702 Zimbabwe Africa      2002
1703 Zimbabwe Africa      2007

We can opt to print the entire subset dataframe. I won’t use this option here, as it would take up an unnecessary amount of space, but feel free to try at your end. 

In the next post we'll deal with Subsetting Columns by Index Position.


Share:

0 comments:

Post a Comment