Monday, January 31, 2022

Combining biotechnology and machine learning

In recent years, scientific advancements in the field, boosted by applications of machine learning and various predictive technologies, have led to many major accomplishments, such as the discovery of new and novel treatments, faster and more accurate diagnostic tests, greener manufacturing methods, and much more. There are countless areas where machine learning can be applied within the biotechnology sector; however, they can be narrowed down to three general categories:

• Science and Innovation: All things related to the research and development of products.

• Business and Operations: All things related to processes that bring products to market.

• Patients and Human Health: All things related to patient health and consumers.

These three categories are essentially a product pipeline that begins with scientific innovation, where products are brainstormed, followed by business and operations, where the product is manufactured, packaged, and marketed, and finally the patients and consumers that utilize the products.

Let's take a look at a few examples of applications of machine learning as they relate to these areas:


The figure shown above shows the development of a product highlighting areas where AI can be applied. Throughout the life cycle of a given product or therapy, there are numerous areas where machine learning can be applied – the only limitation is the existence of data to support the development of a new model. Within the scope of science and innovation, there have been significant advances when it comes to predicting molecular properties, generating molecular structures to suit specific therapeutic targets, and even sequencing genes for advanced diagnostics. In each of these examples, AI has been – and continues to be – useful in aiding and accelerating the research and development of new and novel products. Within the scope of business and operations, there are many examples of AI being used to improve processes such as intelligently manufacturing materials to reduce waste, natural language processing to extract insights from scientific literature, or even demand forecasting to improve supply chain processes. In each of these examples, AI has been crucial in reducing costs and increasing efficiency. Finally, when it comes to patients and health, AI has proven to be pivotal when it comes to recruiting people for and shaping clinical trials, developing recommendation engines designed to avoid drug interactions, or even faster diagnoses, given a patient's symptoms. In each of these applications, data was obtained, used to generate a model, and then validated.

The applications of AI we have observed thus far are only a few examples of the areas where powerful predictive models can be applied. In almost every process throughout the cycle where data is available, a model can be prepared in some way, shape, or form. As we begin to explore the development of many of these models in various areas throughout this process, we will need a few software-based tools to help us. We will know about these tools in the next post

Share:

Friday, January 28, 2022

Arithmetic with NumPy Arrays

Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays applies the operation element-wise:

In [51]: arr = np.array([[1., 2., 3.], [4., 5., 6.]])

In [52]: arr

Out[52]:

array([[ 1., 2., 3.],

[ 4., 5., 6.]])

In [53]: arr * arr

Out[53]:

array([[ 1., 4., 9.],

[ 16., 25., 36.]])

In [54]: arr - arr

Out[54]:

array([[ 0., 0., 0.],

[ 0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In [55]: 1 / arr

Out[55]:

array([[ 1. , 0.5 , 0.3333],

[ 0.25 , 0.2 , 0.1667]])

In [56]: arr ** 0.5

Out[56]:

array([[ 1. , 1.4142, 1.7321],

[ 2. , 2.2361, 2.4495]])

Comparisons between arrays of the same size yield boolean arrays:

In [57]: arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])

In [58]: arr2

Out[58]:

array([[ 0., 4., 1.],

[ 7., 2., 12.]])

In [59]: arr2 > arr

Out[59]:

array([[False, True, False],

[ True, False, True]], dtype=bool)

Operations between differently sized arrays is called broadcasting. Broadcasting describes how arithmetic works between arrays of different shapes. It can be a powerful feature, but one that can cause confusion, even for experienced users. The simplest example of broadcasting occurs when combining a scalar value with an array:

In [79]: arr = np.arange(5)

In [80]: arr

Out[80]: array([0, 1, 2, 3, 4])

In [81]: arr * 4

Out[81]: array([ 0, 4, 8, 12, 16])

Here we say that the scalar value 4 has been broadcast to all of the other elements in the multiplication operation. For example, we can demean each column of an array by subtracting the column means. In this case, it is very simple:

In [82]: arr = np.random.randn(4, 3)

In [83]: arr.mean(0)

Out[83]: array([-0.3928, -0.3824, -0.8768])

In [84]: demeaned = arr - arr.mean(0)

In [85]: demeaned

Out[85]:

array([[ 0.3937, 1.7263, 0.1633],

[-0.4384, -1.9878, -0.9839],

[-0.468 , 0.9426, -0.3891],

[ 0.5126, -0.6811, 1.2097]])

In [86]: demeaned.mean(0)

Out[86]: array([-0., 0., -0.])

Demeaning the rows as a broadcast operation requires a bit more care. Fortunately, broadcasting potentially lower dimensional values across any dimension of an array (like subtracting the row means from each column of a two-dimensional array) is possible as long as you follow the rules.

This brings us to: The Broadcasting Rule

Two arrays are compatible for broadcasting if for each trailing dimension (i.e., starting from the end) the axis lengths match or if either of the lengths is 1. Broadcasting is then performed over the missing or length 1 dimensions.


Even as an experienced NumPy user, I often find myself having to pause and draw a diagram as I think about the broadcasting rule. Consider the last example and suppose we wished instead to subtract the mean value from each row. Since arr.mean(0) has length 3, it is compatible for broadcasting across axis 0 because the trailing dimension in arr is 3 and therefore matches. According to the rules, to subtract over axis 1 (i.e., subtract the row mean from each row), the smaller array must have shape (4, 1):

In [87]: arr
Out[87]:
array([[ 0.0009, 1.3438, -0.7135],
[-0.8312, -2.3702, -1.8608],
[-0.8608, 0.5601, -1.2659],
[ 0.1198, -1.0635, 0.3329]])
In [88]: row_means = arr.mean(1)
In [89]: row_means.shape
Out[89]: (4,)
In [90]: row_means.reshape((4, 1))
Out[90]:
array([[ 0.2104],
[-1.6874],
[-0.5222],
[-0.2036]])
In [91]: demeaned = arr - row_means.reshape((4, 1))
In [92]: demeaned.mean(1)
Out[92]: array([ 0., -0., 0., 0.])

See the Figure below for an illustration of this operation.


The next topic is a very interesting and highly imperative. It is related to basic Indexing and Slicing. See ya then !






Share:

Thursday, January 27, 2022

The NumPy ndarray: A Multidimensional Array Object

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and generate a small array of random data:

In [12]: import numpy as np

# Generate some random data

In [13]: data = np.random.randn(2, 3)

In [14]: data

Out[14]:

array([[-0.2047, 0.4789, -0.5194],

[-0.5557, 1.9658, 1.3934]])

I then write mathematical operations with data:

In [15]: data * 10

Out[15]:

array([[ -2.0471, 4.7894, -5.1944],

[ -5.5573, 19.6578, 13.9341]])

In [16]: data + data

Out[16]:

array([[-0.4094, 0.9579, -1.0389],

[-1.1115, 3.9316, 2.7868]])

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each “cell” in the array have been added to each other.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:

In [17]: data.shape

Out[17]: (2, 3)

In [18]: data.dtype

Out[18]: dtype('float64')

We have already discussed about creation of ndarray. Just to revise, here are some of the Array creation functions -



Data Types for ndarrays

The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

In [33]: arr1 = np.array([1, 2, 3], dtype=np.float64)

In [34]: arr2 = np.array([1, 2, 3], dtype=np.int32)

In [35]: arr1.dtype

Out[35]: dtype('float64')

In [36]: arr2.dtype

Out[36]: dtype('int32')

dtypes are a source of NumPy’s flexibility for interacting with data coming from other systems. In most cases they provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran. The numerical dtypes are named the same way: a type name, like float or int, followed by a number indicating the number of bits per element. A standard double precision floating-point value (what’s used under the hood in Python’s float object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as float64. See Table below for a full listing of NumPy’s supported data types.


You can explicitly convert or cast an array from one dtype to another using ndarray’s astype method:

In [37]: arr = np.array([1, 2, 3, 4, 5])

In [38]: arr.dtype

Out[38]: dtype('int64')

In [39]: float_arr = arr.astype(np.float64)

In [40]: float_arr.dtype

Out[40]: dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer dtype, the decimal part will be truncated:

In [41]: arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

In [42]: arr

Out[42]: array([ 3.7, -1.2, -2.6, 0.5, 12.9, 10.1])

In [43]: arr.astype(np.int32)

Out[43]: array([ 3, -1, -2, 0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use astype to convert them to numeric form:

In [44]: numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)

In [45]: numeric_strings.astype(float)

Out[45]: array([ 1.25, -9.6 , 42. ])

If casting were to fail for some reason (like a string that cannot be converted to float64), a ValueError will be raised. Here I was a bit lazy and wrote float instead of np.float64; NumPy aliases the Python types to its own equivalent data dtypes.

You can also use another array’s dtype attribute:

In [46]: int_array = np.arange(10)

In [47]: calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)

In [48]: int_array.astype(calibers.dtype)

Out[48]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a dtype:

In [49]: empty_uint32 = np.empty(8, dtype='u4')

In [50]: empty_uint32

Out[50]:

array([ 0, 1075314688, 0, 1075707904, 0,

1075838976, 0, 1072693248], dtype=uint32)

In the next post we will discuss about arithmetic operations with NumPy Arrays. Keep practicing and revising. 

Share:

Wednesday, January 26, 2022

Removing Items from a NumPy Array

To delete an item from an array, you may use the delete() method. You need to pass the existing array and the index of the item to be deleted to the delete() method. The following script deletes an item at index 1 (second item) from the my_array array.

my_array = np.array(["Red", "Green", "Orange"])

print(my_array)

print("After deletion")

updated_array = np.delete(my_array, 1)

print(updated_array)

The output shows that the item at index 1, i.e., “Green,” is deleted. 

Output:

['Red' 'Green' 'Orange']

After deletion

['Red' 'Orange']

If you want to delete multiple items from an array, you can pass the item indexes in the form of a list to the delete() method. For example, the following script deletes the items at index 1 and 2 from the NumPy array named my_array.

my_array = np.array(["Red", "Green", "Orange"])

print(my_array)

print("After deletion")

updated_array = np.delete(my_array, [1,2])

print(updated_array)

Output:

['Red' 'Green' 'Orange']

After deletion

['Red']

You can delete a row or column from a 2-D array using the delete method. However, just as you did with the append() method for adding items, you need to specify whether you want to delete a row or column using the axis attribute.

The following script creates an integer array with four rows and five columns. Next, the delete() method is used to delete the row at index 1(second row). Notice here that to delete the array, the value of the axis attribute is set to 0.

integer_random = np.random.randint(1,11, size=(4, 5))

print(integer_random)

print("After deletion")

updated_array = np.delete(integer_random, 1, axis = 0)

print(updated_array)

The output shows that the second row is deleted from the input 2-D array.

Output:

[[ 2 3 3 3 4]

[ 1 7 6 7 10]

[ 7 1 6 6 8]

[ 3 7 8 10 7]]

After deletion

[[ 2 3 3 3 4]

[ 7 1 6 6 8]

[ 3 7 8 10 7]]

Finally, to delete a column, you can set the value of the axis attribute to 1, as shown below:

print(integer_random)

print("After deletion")

updated_array = np.delete(integer_random, 1, axis = 1)

print(updated_array)

The output shows all the rows from our two-dimensional NumPy array.

Output:

[[ 9 10 10 5 5]

[ 5 1 2 4 2]

[ 5 1 3 7 8]

[ 5 1 8 2 5]]

After deletion

[[ 9 10 5 5]

[ 5 2 4 2]

[ 5 3 7 8]

[ 5 8 2 5]]

Next we will talk about N-dimensional array object, or ndarray. Till we meet keep practicing and revising.

Share:

Tuesday, January 25, 2022

Adding Items in a NumPy Array

To add the items into a NumPy array, you can use the append() method from the NumPy module. First, you need to pass the original array and the item that you want to append to the array to the append() method. The append() method returns a new array that contains newly added items appended to the end of the original array.

The following script adds a text item “Yellow” to an existing array with three items.

my_array = np.array(["Red", "Green", "Orange"])

print(my_array)

extended = np.append(my_array, "Yellow")

print(extended)

Output:

['Red' 'Green' 'Orange']

['Red' 'Green' 'Orange' 'Yellow']

In addition to adding one item at a time, you can also append an array of items to an existing array. The method remains similar to appending a single item. You just have to pass the existing array and the new array to the append() method, which returns a concatenated array where items from the new array are appended at the end of the original array.

my_array = np.array(["Red", "Green", "Orange"])

print(my_array)

extended = np.append(my_array, ["Yellow", "Pink"])

print(extended)

Output:

['Red' 'Green'['Red' 'Green' 'Orange']

['Red' 'Green' 'Orange' 'Yellow' 'Pink']

To add items in a two-dimensional NumPy array, you have to specify whether you want to add the new item as a row or as a column. To do so, you can take the help of the axis attribute of the append method.

Let’s first create a 3 x 3 array of all zeros.

zeros_array = np.zeros((3,3))

print(zeros_array)

The output shows all the rows from our two-dimensional NumPy array.

Output:

[[0. 0. 0.]

[0. 0. 0.]

[0. 0. 0.]]

To add a new row in the above 3 x 3 array, you need to pass the original array to the new array in the form of a row vector and the axis attribute to the append() method. To add a new array in the form of a row, you need to set 0 as the value for the axis attribute.

Here is an example script.

zeros_array = np.zeros((3,3))

print(zeros_array)

print("Extended Array")

extended = np.append(zeros_array, [[1, 2, 3]], axis = 0)

print(extended)

In the output below, you can see that a new row has been appended to our original 3 x 3 array of all zeros.

Output:

[[0. 0. 0.]

[0. 0. 0.]

[0. 0. 0.]]

Extended Array

[[0. 0. 0.]

[0. 0. 0.]

[0. 0. 0.]

[1. 2. 3.]]

To append a new array as a column in the existing 2-D array, you need to set the value of the axis attribute to 1.

zeros_array = np.zeros((3,3))

print(zeros_array)

print("Extended Array")

extended = np.append(zeros_array, [[1],[2],[3]], axis = 1)

print(extended)

Output:

[[0. 0. 0.]

[0. 0. 0.]

[0. 0. 0.]]

Extended Array

[[0. 0. 0. 1.]

[0. 0. 0. 2.]

[0. 0. 0. 3.]]

The topic of discussion for our next post will be removing Items from a NumPy Array. Don't forget to practice and revise whatever we have covered so far

Share:

Sunday, January 23, 2022

Printing NumPy Arrays

Depending on the dimensions, there are various ways to display the NumPy arrays. The simplest way to print a NumPy array is to pass the array to the print method, as you have already seen in the previous posts. An example is given below: 

my_array = np.array([10,12,14,16,20,25])

print(my_array)

Output:

[10 12 14 16 20 25]

You can also use loops to display items in a NumPy array. It is a good idea to know the dimensions of a NumPy array before printing the array on the console. To see the dimensions of a NumPy array, you can use the ndim attribute, which prints the number of dimensions for a NumPy array. To see the shape of your NumPy array, you can use the shape attribute.

print(my_array.ndim)

print(my_array.shape)

The script shows that our array is one-dimensional. The shape is (6,), which means our array is a vector with 6 items.

Output:

1 (6,)

To print items in a one-dimensional NumPy array, you can use a single foreach loop, as shown below:

for i in my_array:

print(i)

Output:

10

12

14

16

20

25

Now, let’s see another example of how you can use the for each loop to print items in a two-dimensional NumPy array.

The following script creates a two-dimensional NumPy array with four rows and five columns. The array contains random integers between 1 and 10. The array is then printed on the console.

integer_random = np.random.randint(1,11, size=(4, 5))

print(integer_random)

In the output below, you can see your newly created array.

Output:

[[ 7 7 10 9 8]

[ 6 10 2 5 9]

[ 2 9 2 10 2]

[ 9 6 3 2 1]]

Let’s now try to see the number of dimensions and shape of our NumPy array.

print(integer_random.ndim)

print(integer_random.shape)

The output below shows that our array has two dimensions and the shape of the array is (4,5), which refers to four rows and five columns.

Output:

2 (4, 5)

To traverse through items in a two-dimensional NumPy array, you need two foreach loops: one for each row and the other for each column in the row.

Let’s first use one for loop to print items in our two-dimensional NumPy array.

for i in my_array:

print(i)

The output shows all the rows from our two-dimensional NumPy array.

Output:

[7 7 10 9 8]

[6 10 2 5 9]

[2 9 2 10 2]

[9 6 3 2 1]

To traverse through all the items in the two-dimensional array, you can use the nested for each loop, as follows:

for rows ininteger_random:

for column in rows:

print(column)

Output:

7

7

1

0

9

8

6

1

0

2

5

9

2

9

2

1

0

2

9

6

3

2

1

In the next post , you will see how to add, remove, and sort elements in a NumPy array.

Share:

Saturday, January 22, 2022

Creating NumPy Arrays

Depending on the type of data you need inside your NumPy array, different methods can be used to create a NumPy array. 

1. Using Array Method - To create a NumPy array, you can pass a list to the array() method of the NumPy module, as shown below:

nums_list = [10,12,14,16,20]

nums_array = np.array(nums_list)

type(nums_array)

Output:

numpy.ndarray

You can also create a multi-dimensional NumPy array. To do so, you need to create a list of lists where each internal list corresponds to the row in a two dimensional array. Here is an example of how to create a two-dimensional array using the array() method.

row1 = [10,12,13]

row2 = [45,32,16]

row3 = [45,32,16]

nums_2d = np.array([row1, row2, row3])

nums_2d.shape

Output:

(3, 3)

2. Using Arrange Method - With the arrange () method, you can create a NumPy array that contains a range of integers. The first parameter to the arrange method is the lower bound, and the second parameter is the upper bound. The lower bound is included in the array. However, the upper bound is not included. The following script creates a NumPy array with integers 5 to 10.

nums_arr = np.arange(5,11)

print(nums_arr)

Output:

[5 6 7 8 9 10]

You can also specify the step as a third parameter in the arrange() function. A step defines the distance between two consecutive points in the array. The following script creates a NumPy array from 5 to 11 with a step size of 2.

nums_arr = np.arange(5,12,2)

print(nums_arr)

Output:

[5 7 9 11]

3. Using Ones Method - The ones() method can be used to create a NumPy array of all ones. Here is an example.

ones_array = np.ones(6)

print(ones_array)

Output:

[1. 1. 1. 1. 1. 1.]

You can create a two-dimensional array of all ones by passing the number of rows and columns as the first and second parameters of the ones() method, as shown below:

ones_array = np.ones((6,4))

print(ones_array)

Output:

[[1. 1. 1. 1.]

[1. 1. 1. 1.]

[1. 1. 1. 1.]

[1. 1. 1. 1.]

[1. 1. 1. 1.]

[1. 1. 1. 1.]]

4. Using Zeros Method - The zeros() method can be used to create a NumPy array of all zeros. Here is an example.

zeros_array = np.zeros(6)

print(zeros_array)

Output:

[0. 0. 0. 0. 0. 0.]

You can create a two-dimensional array of all zeros by passing the number of rows and columns as the first and second parameters of the zeros() method, as shown below:

zeros_array = np.zeros((6,4))

print(zeros_array)

Output:

[[0. 0. 0. 0.]

[0. 0. 0. 0.]

[0. 0. 0. 0.]

[0. 0. 0. 0.]

[0. 0. 0. 0.]

[0. 0. 0. 0.]]

5. Using Eyes Method - The eye() method is used to create an identity matrix in the form of a two dimensional NumPy array. An identity matrix contains 1s along the diagonal, while the rest of the elements are 0 in the array.

eyes_array = np.eye(5)

print(eyes_array)

Output:

[[1. 0. 0. 0. 0.]

[0. 1. 0. 0. 0.]

[0. 0. 1. 0. 0.]

[0. 0. 0. 1. 0.]

[0. 0. 0. 0. 1.]]

6. Using Random Method - The random.rand() function from the NumPy module can be used to create a NumPy array with uniform distribution.

uniform_random = np.random.rand(4, 5)

print(uniform_random)

Output:

[[0.36728531 0.25376281 0.05039624 0.96432236 0.08579293]

[0.29194804 0.93016399 0.88781312 0.50209692 0.63069239]

[0.99952044 0.44384871 0.46041845 0.10246553 0.53461098]

[0.75817916 0.36505441 0.01683344 0.9887365 0.21490949]]

The random.randn() function from the NumPy module can be used to create a NumPy array with normal distribution, as shown in the following example:

normal_random = np.random.randn(4, 5)

print(uniform_random)

Output:

[[0.36728531 0.25376281 0.05039624 0.96432236 0.08579293]

[0.29194804 0.93016399 0.88781312 0.50209692 0.63069239]

[0.99952044 0.44384871 0.46041845 0.10246553 0.53461098]

[0.75817916 0.36505441 0.01683344 0.9887365 0.21490949]]

Finally, the random.randint() function from the NumPy module can be used to create a NumPy array with random integers between a certain range. The first parameter to the randint() function specifies the lower bound, the second parameter specifies the upper bound, and the last parameter specifies the number of random integers to generate between the range. The following example generates five random integers between 5 and 50. 

integer_random = np.random.randint(10, 50, 5)

print(integer_random)

Output:

[25 49 21 35 17]

In the next post we will discuss about printing NumPy arrays. 

Share:

Friday, January 21, 2022

NumPy Array 2

We can convert data types in the NumPy array to other data types via the astype() method. But first, you need to specify the target data type in the astype() method.

For instance, the following script converts the array you created in the previous script (previous post) to the datetime data type. You can see that “M” is passed as a parameter value to the astype() function. “M” stands for the datetime data type as aforementioned.

my_array3 = my_array.astype("M")

print(my_array3.dtype)

print(my_array3.dtype.itemsize)

Output:

datetime64[D]

8

In addition to converting arrays from one type to another, you can also specify the data type for a NumPy array at the time of definition via the dtype parameter.

For instance, in the following script, you specify “M” as the value for the dtype parameter, which tells the Python interpreter that the items must be stored as datatime values.

my_array = np.array(["1990-10-04", "1989-05-06", "1990-11-04"], dtype = "M")

print(my_array)

print(my_array.dtype)

print(my_array.dtype.itemsize)

Output:

['1990-10-04' '1989-05-06' '1990-11-04']

datetime64[D]

8

In the next post we will learn how to create NumPy arrays. Till then keep practicing and perform some experiments on your own

Share:

Thursday, January 20, 2022

NumPy Arrays

The main data structure in the NumPy library is the NumPy array, which is an extremely fast and memory-efficient data structure. The NumPy array is much faster than the common Python list and provides vectorized matrix operations. Let us see the different data types that you can store in a NumPy array, the different ways to create the NumPy arrays, how you can access items in a NumPy array, and how to add or remove items from a NumPy array.

The NumPy library supports all the default Python data types in addition to some of its intrinsic data types. Thus the default Python data types, e.g., strings, integers, floats, Booleans, and complex data types, can be stored in NumPy arrays.

You can check the data type in a NumPy array using the dtype property. Later in this post you will see the different ways of creating NumPy arrays in detail.

Here, we will show you the array() function and then print the type of the NumPy array using the dtype property. Here is an example:

import numpy as np

my_array = np.array([10,12,14,16,20,25])

print(my_array)

print(my_array.dtype)

print(my_array.dtype.itemsize)

The script above defines a NumPy array with six integers. Next, the array type is displayed via the dtype attribute. Finally, the size of each item in the array (in bytes) is displayed via the itemsize attribute.

The output below prints the array and the type of the items in the array, i.e., int32 (integer type), followed by the size of each item in the array, which is 4 bytes (32 bits). 

Output:

[10 12 14 16 20 25]

int32

4

The Python NumPy library supports the following data types including the default Python types.

• i – integer

• b – boolean

• u – unsigned integer

• f – float

• c – complex float

• m – timedelta

• M – datetime

• o – object

• S – string

• U – Unicode string

• V – fixed chunk of memory for other type ( void )

Next, let’s see another example of how Python stores text. The following script creates a NumPy array with three text items and displays the data type and size of each item.

import numpy as np

my_array = np.array(["Red", "Green", "Orange"])

print(my_array)

print(my_array.dtype)

print(my_array.dtype.itemsize)

The output below shows that NumPy stores text in the form of Unicode string data type denoted by U. Here, the digit 6 represents the item with the most number of characters.

Output:

['Red' 'Green' 'Orange']

<U6

24

Though the NumPy array is intelligent enough to guess the data type of items stored in it, this is not always the case. For instance, in the following script, you store some dates in a NumPy array. Since the dates are stored in the form of texts (enclosed in double quotations), by default, the NumPy array treats the dates as text. Hence, if you print the data type of the items stored, you will see that it will be a Unicode string (U10). See the code below:

my_array = np.array(["1990-10-04", "1989-05-06", "1990-11-04"])

print(my_array)

print(my_array.dtype)

print(my_array.dtype.itemsize)

Output:

['1990-10-04' '1989-05-06' '1990-11-04']

<U10

40

We will continue to explore NumPy array is the next post. Till then just play around with the code and perform some experiments on your own

Share:

Inferential statistics

Inferential statistics deals with inferring or deducing things from the sample data we have in order to make statements about the population as a whole. When we're looking to state our conclusions, we have to be mindful of whether we conducted an observational study or an experiment. With an observational study, the independent variable is not under the control of the researchers, and so we are observing those taking part in our study (think about studies on smoking—we can't force people to smoke). The fact that we can't control the independent variable means that we cannot conclude causation.

With an experiment, we are able to directly influence the independent variable and randomly assign subjects to the control and test groups, such as A/B tests (for anything from website redesigns to ad copy). Note that the control group doesn't receive treatment; they can be given a placebo (depending on what the study is). The ideal setup for this is double-blind, where the researchers administering the treatment don't know which treatment is the placebo and also don't know which subject belongs to which group. 

Inferential statistics gives us tools to translate our understanding of the sample data to a statement about the population. Remember that the sample statistics we discussed earlier are estimators for the population parameters. Our estimators need confidence intervals, which provide a point estimate and a margin of error around it. This is the range that the true population parameter will be in at a certain confidence level. At the 95% confidence level, 95% of the confidence intervals that are calculated from random samples of the population contain the true population parameter. Frequently, 95% is chosen for the confidence level and other purposes in statistics, although 90% and 99% are also common; the higher the confidence level, the wider the interval.

Hypothesis tests allow us to test whether the true population parameter is less than, greater than, or not equal to some value at a certain significance level (called alpha). The process of performing a hypothesis test starts with stating our initial assumption or null hypothesis: for example, the true population mean is 0. We pick a level of statistical significance, usually 5%, which is the probability of rejecting the null hypothesis when it is true. Then, we calculate the critical value for the test statistic, which will depend on the amount of data we have and the type of statistic (such as the mean of one population or the proportion of votes for a candidate) we are testing. The critical value is compared to the test statistic from our data and, based on the result, we either reject or fail to reject the null  hypothesis. Hypothesis tests are closely related to confidence intervals. The significance level is equivalent to 1 minus the confidence level. This means that a result is statistically significant if the null hypothesis value is not in the confidence interval.

Share:

Wednesday, January 19, 2022

Difference between the dotted and solid portions of the regression line

When we make predictions using the solid portion of the line, we are using interpolation, meaning that we will be predicting ice cream sales for temperatures the regression was created on. On the other hand, if we try to predict how many ice creams will be sold at 45°C, it is called extrapolation (the dotted portion of the line), since we didn't have any temperatures this high when we ran the regression. Extrapolation can be very dangerous as many trends don't continue indefinitely. People may decide not to leave their houses because it is so hot. This means that instead of selling the predicted 39.54 ice creams, they would sell zero.

When working with time series, our terminology is a little different: we often look to forecast future values based on past values. Forecasting is a type of prediction for time series. Before we try to model the time series, however, we will often use a process called time series decomposition to split the time series into components, which can be combined in an additive or multiplicative fashion and may be used as parts of a model.

The trend component describes the behavior of the time series in the long term without accounting for seasonal or cyclical effects. Using the trend, we can make broad statements about the time series in the long run, such as the population of Earth is increasing or the value of a stock is stagnating. The seasonality component explains the systematic and calendar-related movements of a time series. For example, the number of ice cream trucks on the streets of New York City is high in the summer and drops to nothing in the winter; this pattern repeats every year, regardless of whether the actual amount each summer is the same. Lastly, the cyclical component accounts for anything else unexplained or irregular with the time series; this could be something such as a hurricane driving the number of ice cream trucks down in the short term because it isn't safe to be outside. This component is difficult to anticipate with a forecast due to its unexpected nature.

We can use Python to decompose the time series into trend, seasonality, and noise or residuals. The cyclical component is captured in the noise (random, unpredictable data); after we remove the trend and seasonality from the time series, what we are left with is the residual:


When building models to forecast time series, some common methods include exponential smoothing and ARIMA-family models. ARIMA stands for autoregressive (AR), integrated (I), moving average (MA). Autoregressive models take advantage of the fact that an observation at time t is correlated to a previous observation, for example, at time t - 1. Later, we will look at some techniques for determining whether a time series is autoregressive; note that not all time series are. The integrated component concerns the differenced data, or the change in the data from one time to another. For example, if we were concerned with a lag (distance between times) of 1, the differenced data would be the value at time t subtracted by the value at time t - 1. Lastly, the moving average component uses a sliding window to average the last x observations, where x is the length of the sliding window. If, for example, we have a 3-period moving average, by the time we have all of the data up to time 5, our moving average calculation only uses time periods 3, 4, and 5 to forecast time 6.

The moving average puts equal weight on each time period in the past involved in the calculation. In practice, this isn't always a realistic expectation of our data. Sometimes, all past values are important, but they vary in their influence on future data points. For these cases, we can use exponential smoothing, which allows us to put more weight on more recent values and less weight on values further away from what we are predicting.

Note that we aren't limited to predicting numbers; in fact, depending on the data, our predictions could be categorical in nature—things such as determining which flavor of ice cream will sell the most on a given day or whether an email is spam or not. This type of prediction will also be discussed in future. Our topic of discussion for the next post will be inferential statistics.


Share:

Tuesday, January 18, 2022

Ice cream shop sales prediction

Say our favorite ice cream shop has asked us to help predict how many ice creams they can expect to sell on a given day. They are convinced that the temperature outside has a strong influence on their sales, so they have collected data on the number of ice creams sold at a given temperature. We agree to help them, and the first thing we do is make a scatter plot of the data they collected:


We can observe an upward trend in the scatter plot: more ice creams are sold at higher temperatures. In order to help out the ice cream shop, though, we need to find a way to make predictions from this data. We can use a technique called regression to model the relationship between temperature and ice cream sales with an equation. Using this equation, we will be able to predict ice cream sales at a given temperature.

There are many types of regression that will yield a different type of equation, such as linear (which we will use for this example) and logistic. Our first step will be to identify the dependent variable, which is the quantity we want to predict (ice cream sales), and the variables we will use to predict it, which are called independent variables. While we can have many independent variables, our ice cream sales example only has one: temperature. Therefore, we will use simple linear regression to model the relationship as a line:


The regression line in the previous scatter plot yields the following equation for the relationship:


Suppose that today the temperature is 35°C—we would plug that in for temperature in the equation. The result predicts that the ice cream shop will sell 24.54 ice creams. This prediction is along the red line in the previous plot. Note that the ice cream shop can't actually sell fractions of ice cream.

Before leaving the model in the hands of the ice cream shop, it's important to discuss the difference between the dotted and solid portions of the regression line that we obtained. This we will do in the next post.

Share:

Sunday, January 16, 2022

Anscombe's quartet

There is a very interesting dataset illustrating how careful we must be when only using summary statistics and correlation coefficients to describe our data. It also shows us that plotting is not optional. Anscombe's quartet is a collection of four different datasets that have identical summary statistics and correlation coefficients, but when plotted, it is obvious they are not similar. Each dataset consists of eleven (x,y) points. They were constructed in 1973 by the statistician Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of outliers and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."

Notice that each of the plots in Figure shown above has an identical best-fit line defined by the equation y = 0.50x + 3.00. In the next section, we will discuss, at a high level, how this line is created and what it means.

The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets. 

Summary statistics are very helpful when we're getting to know the data, but be wary of relying exclusively on them. Remember, statistics can be misleading; be sure to also plot the data before drawing any conclusions or proceeding with the analysis.

In the next post we will learn about prediction and forecasting using regression

Share:

Multivariate statistics

With multivariate statistics, we seek to quantify relationships between variables and attempt to make predictions for future behavior. The covariance is a statistic for quantifying the relationship between variables by showing how one variable changes with respect to another (also referred to as their joint variance):

E[X] is a new notation for us. It is read as the expected value of X or the expectation of X, and it is calculated by summing all the possible values of X multiplied by their probability—it's the long-run average of X.

The magnitude of the covariance isn't easy to interpret, but its sign tells us whether the variables are positively or negatively correlated. However, we would also like to quantify how strong the relationship is between the variables, which brings us to correlation. Correlation tells us how variables change together both in direction (same or opposite) and magnitude (strength of the relationship). To find the correlation, we calculate the Pearson correlation coefficient, symbolized by ρ (the Greek letter rho), by dividing the covariance by the product of the standard deviations of the variables:

This normalizes the covariance and results in a statistic bounded between -1 and 1, making it easy to describe both the direction of the correlation (sign) and the strength of it (magnitude). Correlations of 1 are said to be perfect positive (linear) correlations, while those of -1 are perfect negative correlations. Values near 0 aren't correlated. If correlation coefficients are near 1 in absolute value, then the variables are said to be strongly correlated; those closer to 0.5 are said to be weakly correlated.

Let's look at some examples using scatter plots. In the leftmost subplot of Figure 1.12 (ρ = 0.11), we see that there is no correlation between the variables: they appear to be random noise with no pattern. The next plot with ρ = -0.52 has a weak negative correlation: we can see that the variables appear to move together with the x variable increasing, while the y variable decreases, but there is still a bit of randomness. In the third plot from the left (ρ = 0.87), there is a strong positive correlation: x and y are increasing together. The rightmost plot with ρ = -0.99 has a near-perfect negative correlation: as x increases, y decreases. We can also see how the points form a line:


To quickly eyeball the strength and direction of the relationship between two variables (and see whether there even seems to be one), we will often use scatter plots rather than calculating the exact correlation coefficient. This is for a couple of reasons:

1. It's easier to find patterns in visualizations, but it's more work to arrive at the same conclusion by looking at numbers and tables.

2. We might see that the variables seem related, but they may not be linearly related. Looking at a visual representation will make it easy to see if our data is actually quadratic, exponential, logarithmic, or some other non-linear function.

Both of the following plots depict data with strong positive correlations, but it's pretty obvious when looking at the scatter plots that these are not linear. The one on the left is logarithmic, while the one on the right is exponential:


It's very important to remember that while we may find a correlation between X and Y, it doesn't mean that X causes Y or that Y causes X. There could be some Z that actually causes both; perhaps X causes some intermediary event that causes Y, or it is actually just a coincidence. Keep in mind that we often don't have enough information to report causation—correlation does not imply causation.

Next we will take up pitfalls of summary statistics. See you in the next post



Share:

Friday, January 14, 2022

Common distributions

While there are many probability distributions, each with specific use cases, there are some that we will come across often. The Gaussian, or normal, looks like a bell curve and is parameterized by its mean (μ) and standard deviation (σ). The standard normal (Z) has a mean of 0 and a standard deviation of 1.

Many things in nature happen to follow the normal distribution, such as heights. The Poisson distribution is a discrete distribution that is often used to model arrivals. The time between arrivals can be modeled with the exponential distribution. Both are defined by their mean, lambda (λ). The uniform distribution places equal likelihood on each value within its bounds. We often use this for random number generation. When we generate a random number to simulate a single success/failure outcome, it is called a Bernoulli trial. This is parameterized by the probability of success (p). When we run the same experiment multiple times (n), the total number of successes is then a binomial random variable. Both the Bernoulli and binomial distributions are discrete.

We can visualize both discrete and continuous distributions; however, discrete distributions give us a probability mass function (PMF) instead of a PDF:


In order to compare variables from different distributions, we would have to scale the data, which we could do with the range by using min-max scaling. We take each data point, subtract the minimum of the dataset, then divide by the range. This normalizes our data (scales it to the range [0, 1]):


This isn't the only way to scale data; we can also use the mean and standard deviation. In this case, we would subtract the mean from each observation and then divide by the standard deviation to standardize the data. This gives us what is known as a Z-score:


We are left with a normalized distribution with a mean of 0 and a standard deviation (and variance) of 1. The Z-score tells us how many standard deviations from the mean each observation is; the mean has a Z-score of 0, while an observation of 0.5 standard deviations below the mean will have a Zscore of -0.5.

There are, of course, additional ways to scale our data, and the one we end up choosing will be dependent on our data and what we are trying to do with it. By keeping the measures of central tendency and measures of dispersion in mind, you will be able to identify how the scaling of data is being done in any other methods you come across.

Until now we were dealing with univariate statistics and were only able to say something about the variable we were looking at. Next we will focus on multivariate statistics.

Share:

Kernel density estimate

KDEs are similar to histograms, except rather than creating bins for the data, they draw a smoothed curve, which is an estimate of the distribution's probability density function (PDF). The PDF is for continuous variables and tells us how probability is distributed over the values. Higher values for the PDF indicate higher likelihoods:


When the distribution starts to get a little lopsided with long tails on one side, the mean measure of center can easily get pulled to that side. Distributions that aren't symmetric have some skew to them. A left (negative) skewed distribution has a long tail on the left-hand side; a right (positive) skewed distribution has a long tail on the right-hand side. In the presence of negative skew, the mean will be less than the median, while the reverse happens with a positive skew. When there is no skew, both will be equal:


There is also another statistic called kurtosis, which compares the density of the center of the distribution with the density at the tails. Both skewness and kurtosis can be calculated with the SciPy package.

Each column in our data is a random variable, because every time we observe it, we get a value according to the underlying distribution—it's not static. When we are interested in the probability of getting a value of x or less, we use the cumulative distribution function (CDF), which is the integral (area under the curve) of the PDF:



The probability of the random variable X being less than or equal to the specific value of x is denoted as P(X ≤ x). With a continuous variable, the probability of getting exactly x is 0. This is because the probability will be the integral of the PDF from x to x (area under a curve with zero width), which is 0:

In order to visualize this, we can find an estimate of the CDF from the sample, called the empirical cumulative distribution function (ECDF). Since this is cumulative, at the point where the value on the x-axis is equal to x, the y value is the cumulative probability of P(X ≤ x). Let's visualize P(X ≤ 50), P(X = 50), and P(X > 50) as an example:


In addition to examining the distribution of our data, we may find the need to utilize probability distributions for uses such as simulation or hypothesis testing. In the next posts we will take a look at a few distributions that we are likely to come across.

Share:

Thursday, January 13, 2022

Summarizing data

We have seen many examples of descriptive statistics that we can use to summarize our data by its center and dispersion; in practice, looking at the 5-number summary and visualizing the distribution prove to be helpful first steps before diving into some of the other aforementioned metrics. The 5-number summary, as its name indicates, provides five descriptive statistics that summarize our data: 


A box plot (or box and whisker plot) is a visual representation of the 5-number summary. The median is denoted by a thick line in the box. The top of the box is Q and the bottom of the box is Q . Lines (whiskers) extend from both sides of the box boundaries toward the minimum and maximum. Based on the convention our plotting tool uses, though, they may only extend to a certain statistic; any values beyond these statistics are marked as outliers (using points). For this book in general, the lower bound of the whiskers will be Q –1.5 * IQR and the upper bound will be Q + 1.5 * IQR, which is called the Tukey box plot:


While the box plot is a great tool for getting an initial understanding of the distribution, we don't get to see how things are distributed inside each of the quartiles. For this purpose, we turn to histograms for discrete variables (for instance, the number of people or books) and kernel density estimates (KDEs) for continuous variables (for instance, heights or time). There is nothing stopping us from using KDEs on discrete variables, but it is easy to confuse people that way. Histograms work for both discrete and continuous variables; however, in both cases, we must keep in mind that the number of bins we choose to divide the data into can easily change the shape of the distribution we see.

To make a histogram, a certain number of equal-width bins are created, and then bars with heights for the number of values we have in each bin are added. The following plot is a histogram with 10 bins, showing the three measures of central tendency for the same data that was used to generate the box plot in previous figure


In the next post we will see KDEs which are similar to histograms, except rather than creating bins for the data, they draw a smoothed curve, which is an estimate of the distribution's probability density function (PDF).

Share:

Wednesday, January 12, 2022

Interquartile range

As mentioned earlier, the median is the 50 percentile or the 2 quartile (Q ). Percentiles and quartiles are both quantiles—values that divide data into equal groups each containing the same percentage of the total data. Percentiles divide the data into 100 parts, while quartiles do so into four (25%, 50%, 75%, and 100%). 

Since quantiles neatly divide up our data, and we know how much of the data goes in each section, they are a perfect candidate for helping us quantify the spread of our data. One common measure for this is the interquartile range (IQR), which is the distance between the 3 and 1 quartiles:


The IQR gives us the spread of data around the median and quantifies how much dispersion we have in the middle 50% of our distribution. It can also be useful when checking the data for outliers. In addition, the IQR can be used to calculate a unitless measure of dispersion, which we will discuss next.

Just like we had the coefficient of variation when using the mean as our measure of central tendency, we have the quartile coefficient of dispersion when using the median as our measure of center. This statistic is also unitless, so it can be used to compare datasets. It is calculated by dividing the semiquartile range (half the IQR) by the midhinge (midpoint between the first and third quartiles):


In the next post we will look at how we can use measures of central tendency and dispersion to summarize our data. 

Share:

Tuesday, January 11, 2022

Standard deviation

We can use the standard deviation to see how far from the mean data points are on average. A small standard deviation means that values are close to the mean, while a large standard deviation means that values are dispersed more widely. This is tied to how we would imagine the distribution curve: the smaller the standard deviation, the thinner the peak of the curve (0.5); the larger the standard deviation, the wider the peak of the curve (2):


The standard deviation is simply the square root of the variance. By performing this operation, we get a statistic in units that we can make sense of again ($ for our income example):


When we moved from variance to standard deviation, we were looking to get to units that made sense; however, if we then want to compare the level of dispersion of one dataset to another, we would need to have the same units once again. One way around this is to calculate the coefficient of variation (CV), which is unitless. The CV is the ratio of the standard deviation to the mean:


Since the CV is unitless, we can use it to compare the volatility of different assets. So far, other than the range, we have discussed mean-based measures of dispersion; next, we will look at how we can describe the spread with the median as our measure of central tendency. Our next post will focus on Interquartile range and related topics


Share:

Monday, January 10, 2022

Range and variance

The range is the distance between the smallest value (minimum) and the largest value (maximum). The units of the range will be the same units as our data. Therefore, unless two distributions of data are in the same units and measuring the same thing, we can't compare their ranges and say one is more dispersed than the other:


Just from the definition of the range, we can see why it wouldn't always be the  best way to measure the spread of our data. It gives us upper and lower bounds on what we have in the data; however, if we have any outliers in our data, the range will be rendered useless.

Another problem with the range is that it doesn't tell us how the data is dispersed around its center; it really only tells us how dispersed the entire dataset is. This brings us to the variance. The variance describes how far apart observations are spread out from their average value (the mean). The population variance is denoted as σ (pronounced sigma-squared), and the sample variance is written as s . It is calculated as the average squared distance from the mean. Note that the distances must be squared so that distances below the mean don't cancel out those above the mean.

If we want the sample variance to be an unbiased estimator of the population variance, we divide by n - 1 instead of n to account for using the sample mean instead of the population mean; this is called Bessel's correction. Most statistical tools will give us the sample variance by default, since it is very rare that we would have data for the entire population:



The variance gives us a statistic with squared units. This means that if we started with data on income in dollars ($), then our variance would be in dollars squared ($*$  ). This isn't really useful when we're trying to see how this describes the data; we can use the magnitude (size) itself to see how spread out something is (large values = large spread), but beyond that, we need a measure of spread with units that are the same as our data. For this purpose, we use the standard deviation. This we will see in the next post


Share:

Thursday, January 6, 2022

Median and mode

Unlike the mean, the median is robust to outliers. Consider income in the US; the top 1% is much higher than the rest of the population, so this will skew the mean to be higher and distort the perception of the average person's income. However, the median will be more representative of the average income because it is the 50 percentile of our data; this means that 50% of the values are greater than the median and 50% are less than the median.

The median is calculated by taking the middle value from an ordered list of values; in cases where we have an even number of values, we take the mean of the middle two values. If we take the numbers 0, 1, 1, 2, and 9 again, our median is 1. Notice that the mean and median for this dataset are different; however, depending on the distribution of the data, they may be the same.

The mode is the most common value in the data (if we, once again, have the numbers 0, 1, 1, 2, and 9, then 1 is the mode). In practice, we will often hear things such as the distribution is bimodal or multimodal (as opposed to unimodal) in cases where the distribution has two or more most popular values. This doesn't necessarily mean that each of them occurred the same amount of times, but rather, they are more common than the other values by a significant amount. As shown in the following plots, a unimodal distribution has only one mode (at 0), a bimodal distribution has two (at -2 and 3), and a multimodal distribution has many (at -2, 0.4, and 3):



Understanding the concept of the mode comes in handy when describing continuous distributions; however, most of the time when we're describing our continuous data, we will use either the mean or the median as our measure of central tendency. When working with categorical data, on the other hand, we will typically use the mode.

Knowing where the center of the distribution is only gets us partially to being able to summarize the distribution of our data—we need to know how values fall around the center and how far apart they are. Measures of spread tell us how the data is dispersed; this will indicate how thin (low dispersion) or wide (very spread out) our distribution is. As with measures of central tendency, we have several ways to describe the spread of a distribution, and which one we choose will depend on the situation and the data.

We will discuss about range and variance in the  next post

Share:

Wednesday, January 5, 2022

Descriptive statistics

Let us begin our discussion of descriptive statistics with univariate statistics; univariate simply means that these statistics are calculated from one (uni) variable. Everything in this section can be extended to the whole dataset, but the statistics will be calculated per variable we are recording (meaning that if we had 100 observations of speed and distance pairs, we could calculate the averages across the dataset, which would give us the average speed and average distance statistics).

Descriptive statistics are used to describe and/or summarize the data we are working with. We can start our summarization of the data with a measure of central tendency, which describes where most of the data is centered around, and a measure of spread or dispersion, which indicates how far apart values are.

Measures of central tendency describe the center of our distribution of data. There are three common statistics that are used as measures of center: mean, median, and mode. Each has its own strengths, depending on the data we are working with.

1. Mean
Perhaps the most common statistic for summarizing data is the average, or mean. The population mean is denoted by μ (the Greek letter mu), and the sample mean is written as (pronounced X-bar). The sample mean is calculated by summing all the values and dividing by the count of values; for example, the mean of the numbers 0, 1, 1, 2, and 9 is 2.6 ((0 + 1 + 1 + 2 +9)/5):

 

We use x to represent the i observation of the variable X. Note how the variable as a whole is represented with a capital letter, while the specific observation is lowercase. Σ (the Greek capital letter sigma) is used to represent a summation, which, in the equation for the mean, goes from 1 to n, which is
the number of observations.

One important thing to note about the mean is that it is very sensitive to outliers (values created by a different generative process than our distribution). In the previous example, we were dealing with only five values; nevertheless, the 9 is much larger than the other numbers and pulled the mean higher than all but the 9. In cases where we suspect outliers to be present in our data, we may want to instead use the median as our measure of central tendency. 

Median and mode will be topic of discussion of our next post

 

Share:

Tuesday, January 4, 2022

Statistical foundations for data analysis

When we want to make observations about the data we are analyzing, we often, if not always, turn to statistics in some fashion. The data we have is referred to as the sample, which was observed from (and is a subset of) the population.

Two broad categories of statistics are descriptive and inferential statistics. With descriptive statistics, as the name implies, we are looking to describe the sample. Inferential statistics involves using the sample statistics to infer, or deduce, something about the population, such as the underlying distribution.

Often, the goal of an analysis is to create a story for the data; unfortunately, it is very easy to misuse statistics.This is especially true of inferential statistics, which is used in many scientific studies and papers to show the significance of the researchers' findings. We will focus on descriptive statistics to help explain the data we are analyzing.

There's an important thing to remember before we attempt any analysis: our sample must be a random sample that is representative of the population. This means that the data must be sampled without bias (for example, if we are asking people whether they like a certain sports team, we can't only ask fans of the team) and that we should have (ideally) members of all distinct groups from the population in our sample (in the sports team example, we can't just ask men).

When we will discuss machine learning in future, we will need to sample our data, which will be a sample to begin with. This is called resampling. Depending on the data, we will have to pick a different method of sampling. Often, our best bet is a simple random sample: we use a random number generator to pick rows at random. When we have distinct groups in the data, we want our sample to be a stratified
random sample, which will preserve the proportion of the groups in the data.

In some cases, we don't have enough data for the aforementioned sampling strategies, so we may turn to random sampling with replacement (bootstrapping); this is called a bootstrap sample. Note that our underlying sample needs to have been a random sample or we risk increasing the bias of the estimator (we could pick certain rows more often because they are in the data more often if it was a convenience sample, while in the true population these rows aren't as prevalent).

We will continue our discussion of descriptive statistics in the next post.

Share: