Tuesday, November 26, 2019

NumPy - Statistics in Python

For data analysis, your understanding of NumPy will help in scientific computation. Knowledge of this library is a fundamental step in data analysis mastery. Once you understand NumPy, you can then build on to other libraries like Pandas.

Once you learn the basics of NumPy, you can then advance into data analytics, using linear algebra and other statistical approaches to analyze data. These are two of the most important mathematical aspects that any data analyst should know about. During data analysis, you will often be required to make predictions based on some raw data at your disposal. For example, you might be asked to present the standard deviation or arithmetic mean of some data for analysis.

In linear algebra, the emphasis is on using linear equations to solve problems through NumPy and SciPy. Mastery of the NumPy basics will help you build on the knowledge you have gained over the years, and perform complex operations in Python.

In NumPy, one of the things you should remember is file I/O. All the data you access is retrieved from files. Therefore, it is important that you learn the basic read and write operations to the said files. One of the benefits of using the NumPy library is that you are always aware that all the items contained in any array share the same type. Because of this reason, you can easily determine the size of storage needed for the array.

Once you have it installed, import the NumPy package into a new Python session as follows:

import numpy as np

As you work on NumPy, you will realize that most of the work you do is built around the N-dimensional array, commonly identified as ndarray . The ndarray refers to a multidimensional array which could hold as many items as defined.

The ndarray is also homogenous, meaning that all the items that are present in the array are of the same size and type. Each object within the array is also defined by its unique data type, (dtype ). With this in mind, each ndarray is always linked with one dtype .

Each array holds a given number of items. The items are available in different dimensions. The dimensions and items within the array define the shape of the array. These dimensions are referred to as the axes and as they compound, they form a rank .

When starting a new array, use the array() function to introduce all the elements in a Python list as shown below:

>>> x = np.array([5, 7, 9])
>>> x
array([5, 7, 9])

To determine whether the object you just created is indeed an ndarray , you can introduce the type() function as shown below:

>>> type(x)
<type 'numpy.ndarray'>

The dtype created might be associated with the ndarray . To identify this data type, you introduce the following function:

>>> x.dtype
dtype('int32')

The array above only has one axis. As a result, its rank is 1. The shape of the array above is (3,1). How do you determine these values from the array? We introduce the attribute ndim to give us the number of axes, the size to tell us the length of the array, and finally the shape attribute to determine the shape of the array as shown below:

>>> x.ndim
1
>>> x.size
3
>>> x.shape
(3L,)

In the examples we have extrapolated above, we have been working with an array in one dimension. As you proceed in data analysis, you will come across arrays that have more than one dimension. Let’s use an example where you have two dimensions below to explain this further.

>>> y = np.array([[12.3, 22.4],[20.3, 24.1]])
>>> y.dtype
dtype('float64')
>>> y.ndim
2
>>> y.size
4
>>> y.shape
(2L, 2L)

This array contains two axes, hence its rank is 2. The length of each of the axes is 2. The itemsize attribute is commonly used in arrays to tell us the size of every item within the array in bytes as shown in the example below:

>>> y.itemsize
8
>>> y.data
<read-write buffer for 0x0000000003D44DF0, size 32, offset 0 at 0x0000000003D5FEA0>

In the next post well learn about the different ways of creating arrays.
Share:

0 comments:

Post a Comment