Thursday, December 1, 2022

Components and applications of pandas

The pandas library is comprised of the following components:

• pandas/core: This contains the implementations of the basic data structures of pandas, such as Series and DataFrames. Series and DataFrames are basic toolsets that are very handy for data manipulation and are used extensively by data scientists.

• pandas/src: This consists of algorithms that provide the basic functionalities of pandas. These functionalities are part of the architecture of pandas, which you will not be using explicitly. This layer is written in C or Cython.

• pandas/io: This comprises toolsets for the input and output of files and data. These toolsets facilitate data input from sources such as CSV and text and allow you to write data to formats such as text and CSV.

• pandas/tools: This layer contains all the code and algorithms for pandas functions and methods, such as merge, join, and concat.

• pandas/sparse: This contains the functionalities for handling missing values within its data structures, such as DataFrames and Series.

• pandas/stats: This contains a set of tools for handling statistical functions such as regression and classification.

• pandas/util: This contains all the utilities for debugging the library.

• pandas/rpy: This is the interface for connecting to R.

The versatility of its different architectural components makes pandas useful in many real-world applications. Various data-wrangling functionalities in pandas (such as merge, join, and concatenation) save time when building real-world applications. Some notable applications where the pandas library can come in handy are as follows:

• Recommendation systems

• Advertising

• Stock predictions

• Neuroscience

• Natural language processing (NLP)

The list goes on. What's more important to note is that these are applications that have an impact on people's daily lives. For this reason, learning pandas has the potential to give a fillip to your analytics career.


Share:

Monday, November 21, 2022

pandas DataFrames

A pandas DataFrame is a 2D labeled data structure with columns that can be of different types. A DataFrame can be thought of as a dictionary-like container for Series objects, where each key in the dictionary is a column label and each value is a Series.

If you are familiar with relational databases, you’ll notice that a pandas DataFrame is similar to a regular SQL table. The figure below illustrates an

example of a pandas DataFrame.


Notice that the DataFrame includes an index column. Like with Series, pandas uses zero-based numeric indexing for DataFrames by default.

However, you can replace the default index with one or more existing columns. Figure below shows the same DataFrame but with the Date column set as the index.




In this particular example, the index is a column of type date. In fact, pandas allows you to have DataFrame indexes of any type. The most commonly used index types are integers and strings. However, you are not limited to using only simple types. You might define an index of a sequence type, such as List or Tuple, or even use an object type that is not built into Python; this could be a third-party type or even your own object type.

Share:

Thursday, November 17, 2022

Combining Series into a DataFrame

Multiple Series can be combined to form a DataFrame. Let’s try this by creating another Series and combining it with the emps_names Series: 

data = ['jeff.russell','jane.boorman','tom.heints']

emps_emails = pd.Series(data,index=[9001,9002,9003], name ='emails')

emps_names.name = 'names'

df = pd.concat([emps_names,emps_emails], axis=1)

print(df)

To create the new Series, you call the Series() constructor , passing the following arguments: the list to be converted to a Series, the indices of the Series, and the name of the Series.

You need to name Series before concatenating them into a DataFrame, because their names will become the names of the corresponding DataFrame columns. Since you didn’t name the emps_names Series when you created it earlier, you name it here by setting its name property to 'names'. After that, you can concatenate it with the emps_emails Series. You specify axis=1 in order to concatenate along the columns.

The resulting DataFrame looks like this:

names     emails

9001        Jeff Russell jeff.russell

9002        Jane Boorman jane.boorman

9003        Tom Heints tom.heints

Share:

Monday, November 14, 2022

Accessing Data in a Series

To access an element in a Series, specify the Series name followed by the element’s index within square brackets, as shown here:

print(emps_names[9001])

This outputs the element corresponding to index 9001:

Jeff Russell

Alternatively, you can use the loc property of the Series object:

print(emps_names.loc[9001])

Although you’re using custom indices in this Series object, you can still access its elements by position (that is, use integer location–based indexing) via the iloc property. Here, for example, you print the first element in the Series: 

print(emps_names.iloc[0])

You can access multiple elements by their indices with a slice operation:

print(emps_names.loc[9001:9002])

This produces the following output:

9001 Jeff Russell

9002 Jane Boorman

Notice that slicing with loc includes the right endpoint (in this case, index 9002), whereas usually Python slice syntax does not.

You can also use slicing to define the range of elements by position rather than by index. For instance, the preceding results could instead be generated by the following code:

print(emps_names.iloc[0:2])

or simply as follows:

print(emps_names[0:2])

As you can see, unlike slicing with loc, slicing with [] or iloc works the same as usual Python slicing: the start position is included but the stop is not. Thus, [0:2] leaves out the element in position 2 and returns only the first two elements.

Share:

Thursday, November 10, 2022

pandas Series

A pandas Series is a 1D labeled array. By default, elements in a Series are labeled with integers according to their position, like in a Python list.

However, you can specify custom labels instead. These labels need not be unique, but they must be of a hashable type, such as integers, floats, strings, or tuples.

The elements of a Series can be of any type (integers, strings, floats, Python objects, and so on), but a Series works best if all its elements are of the same type. Ultimately, a Series may become one column in a larger DataFrame, and it’s unlikely you’ll want to store different kinds of data in the same column. 

Creating a Series

There are several ways to create a Series. In most cases, you feed it some kind of 1D dataset. Here’s how you create a Series from a Python list:

import pandas as pd

data = ['Jeff Russell','Jane Boorman','Tom Heints']

emps_names = pd.Series(data)

print(emps_names)

You start by importing the pandas library and aliasing it as pd. Then you create a list of items to be used as the data for the Series. Finally, you create the Series, passing the list in to the Series constructor method.

This gives you a single list with numeric indices set by default, starting from 0:

0 Jeff Russell

1 Jane Boorman

2 Tom Heints

dtype: object

The dtype attribute indicates the type of the underlying data for the given Series. By default, pandas uses the data type object to store strings.

You can create a Series with user-defined indices as follows:

data = ['Jeff Russell','Jane Boorman','Tom Heints']

emps_names = pd.Series(data,index=[9001,9002,9003])

print(emps_names)

This time the data in the emps_names Series object appears as follows:

9001 Jeff Russell

9002 Jane Boorman

9003 Tom Heints

dtype: object

You start by importing the pandas library and aliasing it as pd. Then you create a list of items to be used as the data for the Series. Finally, you create the Series, passing the list in to the Series constructor method .

This gives you a single list with numeric indices set by default, starting from 0:

0 Jeff Russell

1 Jane Boorman

2 Tom Heints

dtype: object

The dtype attribute indicates the type of the underlying data for the given Series. By default, pandas uses the data type object to store strings.

You can create a Series with user-defined indices as follows:

data = ['Jeff Russell','Jane Boorman','Tom Heints']

emps_names = pd.Series(data,index=[9001,9002,9003])

print(emps_names)

This time the data in the emps_names Series object appears as follows:

9001 Jeff Russell

9002 Jane Boorman

9003 Tom Heints

dtype: object

Share:

Monday, November 7, 2022

Using NumPy Statistical Functions

NumPy’s statistical functions allow you to analyze the contents of an array. For example, you can find the maximum value of an entire array or the maximum value of an array along a given axis.

Let’s say you want to find the maximum value in the salary_bonus array you created in the previous post. You can do this with the NumPy array’s max() function:

print(salary_bonus.max())

The function returns the maximum amount paid in the past three months to any employee in the dataset:

3400

NumPy can also find the maximum value of an array along a given axis. If you want to determine the maximum amount paid to each employee in the past three months, you can use NumPy’s amax() function, as shown here:

print(np.amax(salary_bonus, axis = 1))

By specifying axis = 1, you instruct amax() to search horizontally across the columns for a maximum in the salary_bonus array, thus applying the function across each row. This calculates the maximum monthly amount paid to each employee in the past three months:

[3400 3200 3000]

Similarly, you can calculate the maximum amount paid each month to any employee by changing the axis parameter to 0: 

print(np.amax(salary_bonus, axis = 0))

The results are as follows:

[3200 3400 3400] 

Share:

Thursday, November 3, 2022

Performing Element-Wise Operations on NumPy arrays

It’s easy to perform element-wise operations on multiple NumPy arrays of the same dimensions. For example, you can add the base_salary and bonus arrays together to determine the total amount paid each month to each employee:

salary_bonus = base_salary + bonus

print(type(salary_bonus))

print(salary_bonus)

As you can see, the addition operation is a one-liner. The resulting dataset is a NumPy array too, in which each element is the sum of the corresponding elements in the base_salary and bonus arrays:    

<class 'NumPy.ndarray'>

[[3200 3400 3400]

[3200 3100 3200]

[2500 3000 2900]]


Share:

Wednesday, October 26, 2022

tox

tox is a tool to automatically manage virtual environments, usually for tests and builds. It is used to make sure that those run in well-defined environments and is smart about caching them to reduce churn. True to its roots as a test-running tool, tox is configured in test environments. 

tox itself is a PyPI package usually installed in a virtual environment. Because tox creates ad hoc temporary virtual environments for testing, the virtual environment tox is installed in can be common to many projects. A common pattern is to create a virtual environment dedicated to tox.

$ python -m venv ~/.venvs/tox

$ ~/.venvx/tox/bin/python -m pip install tox

$ alias tox=~/.venvs/tox/bin/tox

It uses a unique ini-based configuration format. This can make writing configurations difficult since remembering the subtleties of the file format can be hard. However, while hard to tap, there is a lot of power that can certainly configure tests and build clear and concise runs.

One thing that tox lacks is a notion of dependencies between build steps. This means that those are usually managed from the outside by running specific test runs after others and sharing artifacts somewhat ad hoc.

A tox environment more or less corresponds to a section in the configuration file. By default, tox uses the tox.ini file.

[testenv:some-name]

.

.

.

Note that if the name of the environment contains pyNM (for example, py36), then tox defaults to using CPython, the standard Python implementation, version N.M (3.6, in this case) as the Python interpreter for that test environment.

tox also supports name-based environment guessing for more esoteric implementations of Python. For example, PyPy, an implementation of Python in Python, is supported with the name pypyNM.

If the name does not include one of the supported short names, or if there is a need to override the default, a basepython field in the section can be used to indicate a specific Python version. By default, tox looks for Python available in the path. However, if the plug-in tox-pyenv is installed in the virtual environment that tox itself is installed in, tox will query pyenv if it cannot find the right Python on the path.

Share:

Friday, October 21, 2022

Execution and utility modules

For historical reasons, execution modules go in the file roots _modules subdirectory. Similar to execution modules, they are also synchronized when state.highstate is applied and when explicitly synchronized via saltutil. sync_all.

As an example, let’s write an execution module to delete several files to simplify the state module.

def multiremove(files):

for fname in files:

__salt__['file.remove'](fname)

Note that Salt is usable in execution modules as well. However, while it can cross-call other execution modules (in this example, file) it cannot cross-call into state modules.

You put this code in _modules/multifile, and you can change the state module to have

__salt__['multifile.mutiremove'](mean_files)

instead of

for fname in mean_files:

__salt__['file.remove'](fname)

Execution modules are often simpler than state modules, as in this example. In this toy example, the execution module barely does anything except coordinate calls to other execution modules.

This is not completely atypical, however. Salt has so much logic for managing machines that all an execution module often has to do is coordinate calls to other execution modules. 

Utility

When writing several execution or state modules, sometimes there is common code that can be factored out.

This code can sit in utility modules under the root file _utils directory. It is available as the __utils__ dictionary.

As an example, you can factor out the calculation of the return value in the state module.

def return_value(name, old_files):

if len(old_files) == 0:

comment = "No changes made"

result = True

elif __opts__['test']:

comment = f"{name} will be changed"

result = None

else:

comment = f"{name} has been changed"

result = True

changes = dict(old=old_files, new=[])

return dict(

name=name,

comment=comment,

result=result,

changes=changes,

)

You get a simpler state module if you use the execution module and the utility modules.

def enforce_no_mean_files(name):

mean_files = __salt__['files.find'](name,

path="*mean*")

if len(mean_files) == 0 or __opts__['test']:

return __utils__['removal.return_value']

(name, mean_files)

__salt__['multifile.mutiremove'](mean_files)

return __utils__['removal.return_value'](name,mean_files)

In this case, you could have put the function as a regular function in the module. Putting it in a utility module was used to show how to call functions in utility modules.

Sometimes it is useful to have third-party dependencies, especially when writing new state and execution modules. This is straightforward to do when installing a minion. You just make sure to install the minion in a virtual environment with those third-party dependencies.

When using Salt with SSH, this is significantly less trivial. In that case, it is sometimes best to bootstrap from SSH to a real minion. One way to achieve that is to have a persistent state in the SSH minion directory and have the installation of the minion set a grain of completely_disable in the SSH minion. This would ensure that the SSH configuration does not crosstalk with the regular minion configuration.

Share:

Monday, October 17, 2022

Salt Extensions

Since Salt is written in Python, it is fully extensible in Python. The easiest way to extend Salt for new things is to put files in the file_roots directory on the Salt master. Unfortunately, there is no package manager for Salt extensions yet. Those files automatically get synchronized to the minions, either when running state.apply or explicitly running saltutil.sync_state. The latter is useful if you want to test, for example, a dry run of the state without causing any changes but with the modified modules.

States

State modules go under the root directory for the environment. If you want to share State modules between environments, it is possible to make a custom root and share that root between the right environments.

The following is an example of a module that ensures there are no files that have the name mean in them under a specific directory. It is probably not very useful, although making sure that unneeded files are not there could be important. For example, you might want to enforce no .git directories.

def enforce_no_mean_files(name):

mean_files = __salt__['files.find'](name,

path="*mean*")

# ...continues below...

The name of the function maps to the name of the state in the SLS state file. If you put this code in mean.py, the appropriate way to address this state would be mean.enforce_no_mean_files.

The right way to find files or do anything in a Salt state extension is to call Salt executors. In most non-toy examples, this means writing a matching pair: a Salt executor extension and a Salt state extension.

Since you want to progress one thing at a time, you use a prewritten Salt executor: the file module, which has the find function.

def enforce_no_mean_files(name):

# ...continued... 

if mean_files = []:

return dict(

name=name,

result=True,

comment='No mean files detected',

changes=[],

)

# ...continues below...

One of the things the state module is responsible for, and often the most important thing, is doing nothing if the state is already achieved. This is what being a convergence loop is all about—optimizing to achieve convergence.

def enforce_no_mean_files(name):

# ...continued...

changes = dict(

old=mean_files,

new=[],

)

# ...continues below...

You now know what the changes are going to be. Calculating it here means you can guarantee consistency between the responses in the test vs. non-test mode.

def enforce_no_mean_files(name):

# ...continued...

changes = dict(

if __opts__['test']:

return dict(

name=name,

result=None,

comment=f"The state of {name} will be

changed",

changes=changes,

)

# ...continues below...

The next important responsibility is to support the test mode. It is considered a best practice to always test before applying a state. You want to clearly articulate the changes that this module does if activated.

def enforce_no_mean_files(name):

# ...continued...

changes = dict(

for fname in mean_files:

__salt__['file.remove'](fname)

# ...continues below...

In general, you should only be calling one function from the execution module that matches the state module. Since you are using file as the execution module in this example, you call the remove function in a loop.

def enforce_no_mean_files(name):

# ...continued...

changes = dict(

return dict(

name=name,

changes=changes,

result=True,

comment=f"The state of {name} was

changed",

)

# ...continues below...

Finally, you return a dictionary with the same changes as those documented in the test mode but with a comment indicating that these have already run.

This is the typical structure of a state module: one (or more) functions that accept a name (and possibly more arguments) and then return a result. The structure of checking if changes are needed and whether you are in test mode, and then performing the changes is also typical.

Share:

Wednesday, October 12, 2022

py renderer

Let’s indicate that a file should be parsed with the py renderer with #!py at the top.

In that case, the file is interpreted as a Python file. Salt looks for a run function, runs it, and treats the return value as the state.

When running, __grains__ and __pillar__ contain the grain and pillar data.

As an example, you can implement the same logic with a py renderer.

#!py

def run():

if __grains__['os'] == 'CentOS':

package_name = 'python-devel'

elif __grains__['os'] == 'Debian':

package_name = 'python-dev'

else:

raise ValueError("Unrecognized operating

system",

__grains__['os'])

return { package_name: dict(pkg='installed') }

Since the py renderer is not a combination of two unrelated parsers, mistakes are sometimes easier to diagnose.

You get the following if you reintroduce the bug from the first version.

#!py

def run():

if __grains__['os'] == 'CentOS':

package_name = 'python-devel'

elif __grains__['os'] == 'Debian':

package_name = 'python-dev'

return { package_name: dict(pkg='installed') }

In this case, the result is a NameError pinpointing the erroneous line and the missing name.

The trade-off is that reading it in YAML form is more straightforward if the configuration is big and mostly static.

Share: