Monday, November 21, 2022

pandas DataFrames

A pandas DataFrame is a 2D labeled data structure with columns that can be of different types. A DataFrame can be thought of as a dictionary-like container for Series objects, where each key in the dictionary is a column label and each value is a Series.

If you are familiar with relational databases, you’ll notice that a pandas DataFrame is similar to a regular SQL table. The figure below illustrates an

example of a pandas DataFrame.


Notice that the DataFrame includes an index column. Like with Series, pandas uses zero-based numeric indexing for DataFrames by default.

However, you can replace the default index with one or more existing columns. Figure below shows the same DataFrame but with the Date column set as the index.




In this particular example, the index is a column of type date. In fact, pandas allows you to have DataFrame indexes of any type. The most commonly used index types are integers and strings. However, you are not limited to using only simple types. You might define an index of a sequence type, such as List or Tuple, or even use an object type that is not built into Python; this could be a third-party type or even your own object type.

Share:

Thursday, November 17, 2022

Combining Series into a DataFrame

Multiple Series can be combined to form a DataFrame. Let’s try this by creating another Series and combining it with the emps_names Series: 

data = ['jeff.russell','jane.boorman','tom.heints']

emps_emails = pd.Series(data,index=[9001,9002,9003], name ='emails')

emps_names.name = 'names'

df = pd.concat([emps_names,emps_emails], axis=1)

print(df)

To create the new Series, you call the Series() constructor , passing the following arguments: the list to be converted to a Series, the indices of the Series, and the name of the Series.

You need to name Series before concatenating them into a DataFrame, because their names will become the names of the corresponding DataFrame columns. Since you didn’t name the emps_names Series when you created it earlier, you name it here by setting its name property to 'names'. After that, you can concatenate it with the emps_emails Series. You specify axis=1 in order to concatenate along the columns.

The resulting DataFrame looks like this:

names     emails

9001        Jeff Russell jeff.russell

9002        Jane Boorman jane.boorman

9003        Tom Heints tom.heints

Share:

Monday, November 14, 2022

Accessing Data in a Series

To access an element in a Series, specify the Series name followed by the element’s index within square brackets, as shown here:

print(emps_names[9001])

This outputs the element corresponding to index 9001:

Jeff Russell

Alternatively, you can use the loc property of the Series object:

print(emps_names.loc[9001])

Although you’re using custom indices in this Series object, you can still access its elements by position (that is, use integer location–based indexing) via the iloc property. Here, for example, you print the first element in the Series: 

print(emps_names.iloc[0])

You can access multiple elements by their indices with a slice operation:

print(emps_names.loc[9001:9002])

This produces the following output:

9001 Jeff Russell

9002 Jane Boorman

Notice that slicing with loc includes the right endpoint (in this case, index 9002), whereas usually Python slice syntax does not.

You can also use slicing to define the range of elements by position rather than by index. For instance, the preceding results could instead be generated by the following code:

print(emps_names.iloc[0:2])

or simply as follows:

print(emps_names[0:2])

As you can see, unlike slicing with loc, slicing with [] or iloc works the same as usual Python slicing: the start position is included but the stop is not. Thus, [0:2] leaves out the element in position 2 and returns only the first two elements.

Share:

Thursday, November 10, 2022

pandas Series

A pandas Series is a 1D labeled array. By default, elements in a Series are labeled with integers according to their position, like in a Python list.

However, you can specify custom labels instead. These labels need not be unique, but they must be of a hashable type, such as integers, floats, strings, or tuples.

The elements of a Series can be of any type (integers, strings, floats, Python objects, and so on), but a Series works best if all its elements are of the same type. Ultimately, a Series may become one column in a larger DataFrame, and it’s unlikely you’ll want to store different kinds of data in the same column. 

Creating a Series

There are several ways to create a Series. In most cases, you feed it some kind of 1D dataset. Here’s how you create a Series from a Python list:

import pandas as pd

data = ['Jeff Russell','Jane Boorman','Tom Heints']

emps_names = pd.Series(data)

print(emps_names)

You start by importing the pandas library and aliasing it as pd. Then you create a list of items to be used as the data for the Series. Finally, you create the Series, passing the list in to the Series constructor method.

This gives you a single list with numeric indices set by default, starting from 0:

0 Jeff Russell

1 Jane Boorman

2 Tom Heints

dtype: object

The dtype attribute indicates the type of the underlying data for the given Series. By default, pandas uses the data type object to store strings.

You can create a Series with user-defined indices as follows:

data = ['Jeff Russell','Jane Boorman','Tom Heints']

emps_names = pd.Series(data,index=[9001,9002,9003])

print(emps_names)

This time the data in the emps_names Series object appears as follows:

9001 Jeff Russell

9002 Jane Boorman

9003 Tom Heints

dtype: object

You start by importing the pandas library and aliasing it as pd. Then you create a list of items to be used as the data for the Series. Finally, you create the Series, passing the list in to the Series constructor method .

This gives you a single list with numeric indices set by default, starting from 0:

0 Jeff Russell

1 Jane Boorman

2 Tom Heints

dtype: object

The dtype attribute indicates the type of the underlying data for the given Series. By default, pandas uses the data type object to store strings.

You can create a Series with user-defined indices as follows:

data = ['Jeff Russell','Jane Boorman','Tom Heints']

emps_names = pd.Series(data,index=[9001,9002,9003])

print(emps_names)

This time the data in the emps_names Series object appears as follows:

9001 Jeff Russell

9002 Jane Boorman

9003 Tom Heints

dtype: object

Share:

Monday, November 7, 2022

Using NumPy Statistical Functions

NumPy’s statistical functions allow you to analyze the contents of an array. For example, you can find the maximum value of an entire array or the maximum value of an array along a given axis.

Let’s say you want to find the maximum value in the salary_bonus array you created in the previous post. You can do this with the NumPy array’s max() function:

print(salary_bonus.max())

The function returns the maximum amount paid in the past three months to any employee in the dataset:

3400

NumPy can also find the maximum value of an array along a given axis. If you want to determine the maximum amount paid to each employee in the past three months, you can use NumPy’s amax() function, as shown here:

print(np.amax(salary_bonus, axis = 1))

By specifying axis = 1, you instruct amax() to search horizontally across the columns for a maximum in the salary_bonus array, thus applying the function across each row. This calculates the maximum monthly amount paid to each employee in the past three months:

[3400 3200 3000]

Similarly, you can calculate the maximum amount paid each month to any employee by changing the axis parameter to 0: 

print(np.amax(salary_bonus, axis = 0))

The results are as follows:

[3200 3400 3400] 

Share:

Thursday, November 3, 2022

Performing Element-Wise Operations on NumPy arrays

It’s easy to perform element-wise operations on multiple NumPy arrays of the same dimensions. For example, you can add the base_salary and bonus arrays together to determine the total amount paid each month to each employee:

salary_bonus = base_salary + bonus

print(type(salary_bonus))

print(salary_bonus)

As you can see, the addition operation is a one-liner. The resulting dataset is a NumPy array too, in which each element is the sum of the corresponding elements in the base_salary and bonus arrays:    

<class 'NumPy.ndarray'>

[[3200 3400 3400]

[3200 3100 3200]

[2500 3000 2900]]


Share: