Tuesday, December 31, 2019

Plotting the sine function

on December 31, 2019 with No comments

Let’s plot the sine function over the interval from 0 to 4 . The main plotting function plot in matplotlib
does not plot functions per se, it plots (x;y) data sets. As we shall see, we can instruct the function plot either to just draw points or dots at each data point, or we can instruct it to draw straight lines between the data points. To create the illusion of the smooth function that the sine function is, we need to create enough (x;y) data points so that when plot draws straight lines between the data points, the function appears to be smooth.

The sine function undergoes two full oscillations with two maxima and two minima between 0 and 4 . So let’s start by creating an array with 33 data points between 0 and 4 , and then let matplotlib draw a straight line between them. Our code consists of four parts:

• Import the NumPy and matplotlib modules.
• Create the (x;y) data arrays.
• Have plot draw straight lines between the (x;y) data points.
• Display the plot in a figure window using the show function.

See the following program:

import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 4.*np.pi, 33)
y = np.sin(x)
plt.plot(x, y)
plt.show()

The output is shown below. It consists of the sine function plotted over the interval from 0 to 4 , as advertised, as well as axes annotated with nice whole numbers over the appropriate interval.

One problem, however, is that while the plot oscillates like a sine wave, it is not smooth (look at the peaks). This is because we did not create the (x;y) arrays with enough data points. To correct this, we
need more data points which can be created using the same program shown above but with 129 (x;y) data points instead of 33. In the above program just replace 33 in line 3 with 129 (a few more or less is ok) so that the function linspace creates an array with 129 data points instead of 33. The output iss shown below:

In making this plot, matplotlib has made a number of choices, such as the size of the figure, the color of the line, even the fact that by default a line is drawn between successive data points in the (x;y) arrays. All of these choices can be changed by explicitly instructing matplotlib to do so. This involves including more arguments in the function calls we have used and using new functions that control other properties of the plot. See the previous posts and try a few of the simpler embellishments that are possible.

Introducing New Elements to a Plot

on December 30, 2019 with No comments

Charts are supposed to make your data visually appealing. To do this, it is important to ensure you use the correct chart to represent the data you need, because not all charts are suitable for any kind of data. The basic lines and markers will not be sufficient in making the charts appealing. You should think of getting additional elements into the chart for this purpose.

How to add text to a chart

With the title() function, you can introduce an elaborate title into the chart. Beyond that, you should also be able to introduce the axis label. This is done with the xlabel() and ylabel() functions. Remember that when you introduce a new function like the axis label functions, they create an argument within the string of code you are working with. We want to introduce the axis labels to a chart. This is the first step because they help you identify the values that will be assigned to every axis as you plot data. Your illustration should follow the code below:

import matplotlib.pyplot as plt
plt.axis([0,5,0,20])
plt.title('My first plot')
plt.xlabel('Counting')
plt.ylabel('Square values')
plt.plot([1,2,3,4], [1,4,9,16],'ro')
plt.show()

When you run the above program you should have the following plot:

You can perform basic editing for all the text you have entered that describe the plot. Basic editing includes altering the font and font size, colors, or any other tweaks that you might need for the plot to be appealing. Following the example above, we can further tweak the title as follows:

plt.axis([0,5,0,20])
plt.title('My first plot',fontsize=18,fontname='Comic Sans MS')
plt.xlabel('Counting',color='black')
plt.ylabel('Square values',color='black')
plt.plot([1,2,3,4], [1,4,9,16],'ro')
plt.show()

The output should be as shown below:

The Matplotlib functionality allows you to perform more edits to the chart. For example, you can introduce new text into the chart using the text () function,

text(x,y,s, fontdict=None, **kwargs) .

In the function outlined above, the coordinates x and y represent the location of the text you are introducing into the chart. s, represents the string of text you are adding to the chart at the specified location. The fontdict() function represents the font you use for the new text. However, this function is optional. Once you have these figured out, you can then introduce keywords into the code. Let’s
have a look at the example below to illustrate this:

plt.axis([0,5,0,20])
plt.title('My first plot',fontsize=20,fontname='Times New Roman')
plt.xlabel('Counting',color='gray')
plt.ylabel('Square values',color='gray')
plt.text(1,1.4,'First')
plt.text(2,4.4,'Second')
plt.text(3,9.4,'Third')
plt.text(4,16.4,'Fourth')
plt.plot([1,2,3,4], [1,4,9,16],'ro')
plt.show()

The output should be as shown below:

Matplotlib is specifically built to help you introduce mathematical expressions into your work using the LaTeX expressions. When keyed in correctly, the interpreter will recognize the expressions and aptly convert them into the necessary expression graphic. This is how to introduce formula, expressions, or other unique characters into your plot. When writing LaTeX expressions, remember to use an r before the expression so that the interpreter can read it as raw text.

plt.axis([0,5,0,20])
plt.title('My first plot',fontsize=20,fontname='Times New Roman')
plt.xlabel('Counting',color='gray')
plt.ylabel('Square values',color='gray')
plt.text(1,1.4,'First')
plt.text(2,4.4,'Second')
plt.text(3,9.4,'Third')
plt.text(4,16.4,'Fourth')
plt.text(1.1,12,r'$y = x^2$',fontsize=20,bbox={'facecolor':'yellow','alpha':0.2})
plt.plot([1,2,3,4], [1,4,9,16],'ro')
plt.show()

Your plot should have a y=x2 expression in a yellow background as shown below:

More often, you can go online and create charts that allow you to automatically add or remove grids. You can do this in Python, too. A grid is important in your work because it shows you the position of all the points plotted on the chart. To add a grid, introduce the grid() function as shown below, passing it as true.

plt.axis([0,5,0,20])
plt.title('My first plot',fontsize=20,fontname='Times New Roman')
plt.xlabel('Counting',color='gray')
plt.ylabel('Square values',color='gray')
plt.text(1,1.4,'First')
plt.text(2,4.4,'Second')
plt.text(3,9.4,'Third')
plt.text(4,16.4,'Fourth')
plt.text(1.1,12,r'$y = x^2$',fontsize=20,bbox={'facecolor':'yellow','alpha':0.2})
plt.grid(True)
plt.plot([1,2,3,4], [1,4,9,16],'ro')
plt.show()

The output is as shown below:

If you want to do away with the grid, you should plot the condition as false as shown below:

plt.grid(True)

How to Create a Chart

on December 29, 2019 with No comments

Before you begin, import pyplot to your programming environment and set the name as plt as shown below:

import matplotlib.pyplot as plt
plt.plot([1,2,3,4])

When you enter this code, you will have created a Line2D object. An object in this case is a linear representation of the trends you will plot within a given chart. To view the plot, you will use the function below:

plt.show()

The result should be a plotting window similar to the one below:

Depending on the platform you are using, in some cases your chart will display without necessarily calling the show() function, especially if you are using iPython QtConsole. Once this plot is prepared you must provide a definition for the two arrays on the x and y axis. The blue line in the example above represents all the points in your plot. This is the default configuration when your data does not have a legend, axis labels, or a title.

Beyond using pyplot commands for single figures, you can work with lots of figures at the same time in Matplotlib. You can take things further and introduce new plots within each figure. Other than using multiple subplots, you can also use the subplot() function to create multiple drawing areas in the main figure.

The subplot() function also helps you choose the subplot to focus your work on. Once selected, any commands passed will be called on the current subplot. A careful look at the subplot() function reveals three integers, each of which serves a unique role.

The first integer outlines the number of vertical divisions available in the figure. The second integer outlines the number of horizontal divisions available in the figure. The third integer outlines the subplot where your commands are directed.

t = np.arange(0,5,0.1)
y1 = np.sin(2*np.pi*t)
y2 = np.sin(2*np.pi*t)
plt.subplot(211)
plt.plot(t,y1,'b-.')
plt.subplot(212)
plt.plot(t,y2,'r--')

You should have the following plot:

In the next example, we will create vertical divisions from the plots above using the code below:

t = np.arange(0.,1.,0.05)
y1 = np.sin(2*np.pi*t)
y2 = np.cos(2*np.pi*t)

plt.subplot(121)
plt.plot(t,y1,'b-.')
plt.subplot(122)
plt.plot(t,y2,'r--')
plt.show()

You should have the plot below:

Display Tools in Matplotlib

on December 28, 2019 with No comments

There are different display tools you can use to help you understand a plot the first time you see it. Legends and annotations serve this purpose. Legends identify different series of data within your plot. To access it, you call the matplotlib function legend () .

Annotations, on the other hand, help in identifying the important points in the plot. Annotations are called using the matplotlib function annotate() . An annotation must always have an arrow and a label, each of which could be described by different parameters. Because of this reason, you can use the help (annotate) function to get the best explanation.

Other display tools include labels, grids, and titles. A label will be present on both axes, but you can call them using the functions xlabel () and ylabel () for the x and y axis respectively. The title of your plot can be identified using the title () function, while the grid is identified using the grid () function. It is wise to note that you can turn the grid plot on or off where necessary.

In Matplotlib, you will be working with a lot of tools and functions that enhance manipulation and representation of the objects you work with, alongside any internal objects that might be present. By design, matplotlib is built into three layers as shown below:

● The scripting layer

This layer is also referred to as the pyplot . This is where functions and artist classes operate. The pyplot is an interface used in data visualization and analysis.

● The artist layer

This is an intermediate Matplotlib layer. All the elements in this layer are used in building charts, and include things like markers, titles, and labels assigned to the x and y axis.

● The backend layer

This is the lowest level in Matplotlib. All the APIs are found in this layer. At this point, graphic element implementation takes place, albeit at the lowest possible level. Each of these layers can only share communication with the layer beneath it, but not the one above it, hence the nature of communication in Matplotlib is unidirectional.

Having mentioned pyplot , you should also learn about pylab . Pylab is a unique module that is installed together with Matplotlib, while pyplot on the other hand runs as an internal package in Matplotlib. Your installation code for these two will look like this:

from pylab import *
and
import matplotlib.pyplot as plt
import numpy as np

Pylab allows you to enjoy the benefits of using pyplot and NumPy within the same namespace, without necessarily having to import NumPy as a separate package. If you already have pylab imported, you will not need to call the NumPy and pyplot functions because they are automatically called, in a process similar to what you experience in MATLAB as shown below:

Instead of having
plt.plot()
np.array([1,2,3,4]

You will have
plot(x,y)
array([1,2,3,4])

Essentially, the role of the pyplot package is to enable you to program in Python through the matplotlib library.

Scatter Plots

on December 27, 2019 with No comments

The role of a scatter plot is to identify the relationship between a couple of variables displayed in a coordinate system. Each data point is identified according to the variable values. From the scatter graph, you can tell whether there is a relationship between the variables or not.

When studying a scatter plot diagram, the direction of the trend tells you the nature of correlation. A positive correlation, for example, is represented by an upward pattern. A scatter plot can also be used alongside a bubble chart. Bubble charts introduce a third variable beyond the two identified in the scatter plot. The size of the bubble around the data points is used to determine the value of the third variable.

In matplotlib, scatter plots are called through the scatter () function. The following commands are used to access the scatter function’s documentation:

$ ipython -pylab
In [1] : help(scatter)

In the example below, we introduce three parameters, s to represent the size of the bubble chart, alpha to represent the transparency of the bubbles when plotted on the chart, and c to represent the colors. The alpha variable values are in the range of 0 - completely transparent, and 1 - completely opaque. You will have a scatter chart with the following coordinates:

plt.scatter(years, cnt_log, c= 200 * years, s=20 + 200 *
gpu_counts/gpu_counts.max(), alpha=0.5)

You should have the following code:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('transcount.csv')
df = df.groupby('year').aggregate(np.mean)
gpu = pd.read_csv('gpu_transcount.csv')
gpu = gpu.groupby('year').aggregate(np.mean)
df = pd.merge(df, gpu, how='outer', left_index=True, right_index=True)
df = df.replace(np.nan, 0)
print df
years = df.index.values
counts = df['trans_count'].values
gpu_counts = df['gpu_trans_count'].values
cnt_log = np.log(counts)
plt.scatter(years, cnt_log, c= 200 * years, s=20 + 200 * gpu_counts/
gpu_counts.max(), alpha=0.5)
plt.show()

The output as obtained on the output window is as follows-

       trans_count gpu_trans_count
year
1971 2.300000e+03     0.000000e+00
1972 3.500000e+03     0.000000e+00
1974 4.533333e+03     0.000000e+00
1975 3.510000e+03     0.000000e+00
1976 7.500000e+03     0.000000e+00
1978 1.900000e+04     0.000000e+00
1979 4.850000e+04     0.000000e+00
1982 9.450000e+04     0.000000e+00
1983 8.500000e+03     0.000000e+00
1984 2.000000e+05     0.000000e+00
1985 1.053333e+05     0.000000e+00
1986 2.500000e+04     0.000000e+00
1988 2.500000e+05     0.000000e+00
1989 7.401175e+05     0.000000e+00
1991 6.900000e+05     0.000000e+00
1993 3.100000e+06     0.000000e+00
1994 5.789770e+05     0.000000e+00
1995 5.500000e+06     0.000000e+00
1996 4.300000e+06     0.000000e+00
1997 8.150000e+06     3.500000e+06
1998 7.500000e+06     0.000000e+00
1999 1.760000e+07     1.350000e+07
2000 3.150000e+07     2.500000e+07
2001 4.500000e+07     5.850000e+07
2002 1.375000e+08     8.500000e+07
2003 1.900667e+08     1.260000e+08
2004 3.520000e+08     1.910000e+08
2005 1.690000e+08     3.120000e+08
2006 6.040000e+08     5.325000e+08
2007 3.716000e+08     7.270000e+08
2008 9.032000e+08     1.179500e+09
2009 3.450000e+09     2.154000e+09
2010 1.511667e+09     2.946667e+09
2011 1.733500e+09     4.312712e+09
2012 2.014826e+09     5.310000e+09
2013 5.000000e+09     6.300000e+09
2014 4.310000e+09     0.000000e+00

The following plot will be obtained-

Logarithmic Plots (Log Plots)

on December 26, 2019 with No comments

A logarithmic plot is essentially a basic plot, but it is set on a logarithmic scale. The difference between this and a normal linear scale is that the intervals are set in order of their magnitude. We have two different types of log plots: the log-log plot and semi-log plot.

The log-log plot has logarithm scales on both the x and y axis. In matplotlib, this plot is identified by the following function: matplotlib.pyplot.loglog() . The semi-log plot, on the other hand, uses two different scales. It has a logarithmic scale on one axis and a linear scale on the other. They are identified by the following functions:

semilogx() for the x axis, and semilogy() for the y axis.

Straight lines in such plots are used to identify exponential laws. The code below represents data on transistor counts within a given range of years. We will use it to study the procedure for creating logarithmic plots:

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
df = pd.read_csv('transcount.csv')
df = df.groupby('year').aggregate(np.mean)
years = df.index.values
counts = df['trans_count'].values
poly = np.polyfit(years, np.log(counts), deg=1)
print ("Poly", poly)
plt.semilogy(years, counts, 'o')
plt.semilogy(years, np.exp(np.polyval(poly, years)))
plt.show()

Step 1:

Build the data using the following functions:

poly = np.polyfit(years, np.log(counts), deg=1)
print "Poly", poly

Step 2:

From the data fit above, you should have a polynomial object. Based on the data available, you should have the polynomial coefficients arranged in descending order.

Step 3:

To study the polynomial created, use the NumPy function polyval() . Plot data and use the y axis semi-log function as shown:

plt.semilogy(years, counts, 'o')
plt.semilogy(years, np.exp(np.polyval(poly, years)))

Now run the program, you will get the following plot:

Also, Poly [ 3.61559210e-01 -7.05783195e+02] will be printed on the output window.

Basic Matplotlib Plots

on December 25, 2019 with No comments

A simple plot

Before you plot on matplotlib, you must have a plot () function within the matplotlib.pyplot sub package. This is to give you the basic plot with x-axis and y-axis variables. Alternatively, you can also use format parameters to represent the line style you are using. To determine the format parameters and options used, the following commands apply:

$ ipython -pylab
In [1] : help(plot)

In the example above, you are creating two unique lines. The first one, which will act as the default line, is the solid line style, while the second one will have a dashed line. Study the code snippet below. We will use it to describe the procedure on how to create a simple plot.

import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 20)
plt.plot(x, .5 + x)
plt.plot(x, 1 + 2 * x, '--')
plt.show()

Use the following procedure to plot the lines described above:

Step 1:

Determine the x coordinates using linspace () , a NumPy function. The x coordinates start at 0 and end at 20, hence you should have the following function:

x = np.linspace(0, 20)

Step 2:

Plot the lines on your axis in the following order:

plt.plot(x, .5 + x)
plt.plot(x, 1 + 2 * x, '--')

Step 3:

At this point, you have two options. You can save the plot or view it on a screen. The savefig() function is used to save the file. If you have to view it, the show () function is used. To view the function on the screen, use the following plotting function:

plt.show()

The plot obtained is shown below:

Fundamentals of Matplotlib

on December 24, 2019 with 1 comment

Lets look at some of the important concepts that you shall come across and use in Matplotlib, and their meanings or roles:

Axis – This represents a number line, and is used to determine the graph limits.
Axes – These represent what we construe as plots. A single figure can hold as many axes as possible. In the event of a 3D object, you can have two or three objects. Take note that for all axes, you must have an x and y label.
Artist – Refers to everything that you can see on your figure, for example collection objects, Line2D objects and Text objects. You will notice that most of the Artists are on the Axes.
Figure – Refers to the entire figure you are working on. It might include more than one plots or axes.

Pyplot is a Matplotlib module that allows you to work with simple functions, in the process adding elements like text, images, and lines within the figure you are working on. A simple plot can be created in the following manner:

import matplotlib.pyplot as plt
import numpy as np

There are lots of command functions that you can use to help you work with Matplotlib. Each of these pyplot functions changes figures in one way or the other when executed. The following is a list of the plots you will use in Matplotlib:

● Quiver – Used to create 2D arrow fields
● Step – Used to create a step plot
● Stem – Used to build a stem plot
● Scatter – Creates a scatter plot of x against y
● Stackplot – Used to create a plot for a stacked area
● Plot – Creates markers or plot lines to your axes
● Polar – Creates a polar plot
● Pie – Creates a pie chart
● Barh - Creates a horizontal bar plot
● Bar – Creates a bar plot
● Boxplot – Creates a whisker and box plot
● Hist – Used to create a histogram
● Hist2d – Used to create a histogram plot in 2D

Given that you might be working with images from time to time during data analysis, you will frequently use the following image functions:

● Imshow – Used to show images on your axes
● Imsave – Used to save arrays in the form of an image file
● Imread – Used to read files from images into arrays

Now it's time to create a plot but you must first import the Pyplot module from your Matplotlib package before you can create a plot. This is done as shown below:

import matplotlib.pyplot as plt

After importing the module, you introduce arrays into the plot. The NumPy library has predefined array functions that you will use going forward. These are imported as follows:

import numpy as np

With this done, proceed to introduce objects into the plot using the NumPy library’s arange() function as shown below:

x = np.arange(0, math.pi*2, 0.05)

With this data, you can then proceed to specify the x and y axis labels, and the plot title as shown:

plt.xlabel("angle")
plt.ylabel("sine")
plt.title('sine wave')

To view the window, use the show() function below:

plt.show()

At this juncture, your program should look like this:

from matplotlib import pyplot as plt
import numpy as np
import math #will help in defining pi
x = np.arange(0, math.pi*2, 0.05)
y = np.sin(x)
plt.plot(x,y)
plt.xlabel("angle")
plt.ylabel("sine")
plt.title('sine wave')
plt.show()

Before you plot on matplotlib, you must have a plot () function within the matplotlib.pyplot subpackage. This is to give you the basic plot with x-axis and y-axis variables. Now run the program and see the plot obtained. It should be like:

Data Visualization with Matplotlib

on December 23, 2019 with No comments

Data visualization is one of the first things you have to perform before you analyze data. The moment you have a glance at some data, your mind creates a rough idea of the way you want it to look when you map it on a graph.

Image result for data visualization with matplotlib in python

Image result for data visualization with matplotlib in python

Matplotlib might seem rather complex at first, but with basic coding knowledge, it should be easier for you. We will highlight some of the important concepts that will guide your work going forward.

Plotting data for visualization will need you to work with different data ranges. You might need to work with general or specific data ranges. The whole point behind Matplotlib is to help you work with data with as minimal challenges as possible. As a data analyst, you are in full control over the data you use, hence you must also understand the necessary commands to alter the same.

Remember that the machine learning environment in Matplotlib is almost similar to MATLAB. Therefore, if you have some experience with MATLAB, you should find things easier here. All the work you do in Matplotlib is built in a hierarchical manner. At the highest point, you have a state-machine environment, while at the lowest level you have the object-oriented interfaces where pyplot only performs a limited number of functions. At this level, it is up to you to build figures, and from them you can create axes. The axes will help in all, if not most of your plotting needs.

To install Matplotlib on your machine, run the following Python command:

python -m pip install -U pip
python -m pip install -U matplotlib

To set you off, install Matplotlib on your device using the following commands:

pip install matplotlib
xcode-select -install (if you are working on a Mac)

There are several dependencies that you might need to install with Matplotlib, including NumPy and Python if it is not already installed on your device. To further enhance your interface output, you might also need to install other packages like Tornado and pycairo.

If you are going to work on animations from time to time, you might need to install ImageMagick or any other packages that could assist you like LaTeX.

Fundamentals of Matplotlib

Below are some of the important concepts that you shall come across and use in Matplotlib, and their meanings or roles:

● Axis – This represents a number line, and is used to determine the graph limits.
● Axes – These represent what we construe as plots. A single figure can hold as many axes as possible. In the event of a 3D object, you can have two or three objects. Take note that for all axes, you must have an x and y label.
● Artist – Refers to everything that you can see on your figure, for example collection objects, Line2D objects and Text objects. You will notice that most of the Artists are on the Axes.
● Figure – Refers to the entire figure you are working on. It might include more than one plots or axes.

Pyplot is a Matplotlib module that allows you to work with simple functions, in the process adding elements like text, images, and lines within the figure you are working on. A simple plot can be created in the following manner:

import matplotlib.pyplot as plt
import numpy as np

How to Avoid Data Contamination

on December 21, 2019 with No comments

From empty data fields to data duplication and invalid addresses, there are so many ways you can end up with contaminated data. Having looked at possible causes and methods of cleaning data, it is important for an expert in your capacity to put measures in place to prevent data contamination in the future. The challenges you experienced in cleaning data could easily be avoided, especially if the data collection processes are within your control.

Looking back to the losses your business suffers in dealing with contaminated data and the resource wastage in terms of time, you can take significant measures to reduce inefficiencies, which will eventually have an impact on your customers and their level of satisfaction.

One of the most important steps today is to invest in the appropriate CRM programs to help in data handling. Having data in one place makes it easier to verify the credibility and integrity of data within your database. The following are some simple methods you can employ in your organization to prevent data contamination, and ensure you are using quality data for decision-making:

● Proper configurations

Irrespective of the data handling programs you use, one of the most important things is to make sure you configure applications properly. Your company could be using CRM programs or simple Excel sheets. Whichever the case, it is important to configure your programs properly. Start with the critical information. Make sure the entries are accurate and complete. One of the challenges of incomplete data is that there is always the possibility that someone could complete them with inaccurate data to make them presentable, when this is not the real picture.

Data integrity is just as important, so make sure you have the appropriate data privileges in place for anyone who has to access critical information. Set the correct range for your data entries. This way, anyone keying in data will be unable to enter incorrect data not within the appropriate range. Where possible, set your system up such that you can receive notifications whenever someone enters the wrong range, or is struggling, so that you can follow up later on and ensure you captured the correct data.

● Proper training

Human error is one of a data analyst’s worst nightmares when trying to prevent data contamination. Other than innocent mistakes, many errors from human entry are usually about context. It is important that you train everyone handling data on how to go about it. This is a good way to improve accuracy and data integrity from the foundation - data entry.

Your team must also understand the challenges you experience when using contaminated data, and more importantly why they need to be keen at data entry. If you are using CRM programs, make sure they understand different functionality levels so they know the type of data they should enter.

Another issue is how to find the data they need. When under duress, most people key in random or inaccurate data to get some work done or bypass some restrictions. By training them on how to search for specific data, it is easier to avoid unnecessary challenges with erroneous entries. This is usually a problem when you have new members joining your team. Ensure you train them accordingly, and encourage them to ask for help whenever they are unsure of anything.

● Entry formats

The data format is equally important as the desired level of accuracy. Think about this from a logical perspective. If someone sends you a text message written in all capital letters, you will probably disregard it or be offended by the tone of the message. However, if the same message is sent with proper formatting, your response is more positive.

The same applies to data entry. Try and make sure that everyone who participates in data handling is careful enough to enter data using the correct format. Ensure the formats are easy to understand, and remind the team to update data they come across if they realize it is not in the correct format. Such
changes will go a long way in making your work easier during analysis.

● Empower data handlers

Beyond training your team, you also need to make sure they are empowered and aware of their roles in data handling. One of the best ways of doing this is to assign someone the data advocacy role. A data advocate is someone whose role is to ensure and champion consistency in data handling. Such a person will essentially be your data administrator. Their role is usually important, especially when implementing new systems. They come up with a plan to ensure data is cleaned and organized. One of their deliverables should include proper data collection procedures to help you improve the results obtained from using the data in question.

● Overcoming data duplication

Data duplication happens in so many organizations because the same data is processed at different levels. Duplication might eventually see you discard important and accurate data accidentally, affecting any results derived from the said data. For example, ensure your team searches for specific items before they create new ones. Provide an in-depth search process that increases the search results and reduces the possibility of data duplication. For example, beyond looking for a customer’s name, the entry should also include contact information.

Provide as many relevant fields that can be searched into, thereby increasing the possibility of arresting and avoiding duplicates. You can find data for a customer named COVRI Solutions PVT LTD in different databases labeled as COVRI SOLUTIONS P LTD or COVRI Solutions PVT LT. The moment you come across such duplicates, the last thing you want to do is to eliminate them from the database. Instead, investigate further to ascertain the similarities and differences between the entries.

Consult, verify, and update the correct entry accordingly. Alternatively, you can escalate such issues to your data advocate for further action. At the same time, put measures in place that scans your database to warn users whenever they are about to create a duplicate entry.

● Data filtration

Perhaps one of the best solutions would be cleaning data before it gets into your database. A good way of doing this would be creating clear outlines on the correct data format to use. With such procedures in place, you have an easier time handling data. If all the conditions are met, you will probably handle data cleaning at the entry point instead of once the data is in your database, making
your work easier.

Create filters to determine the right data to collect and the data that can be updated later. It doesn’t make sense to collect a lot of information to give you the illusion of a complete and elaborate database, when in a real sense very little of what you have is relevant to your cause.

The misinformation that arises from inaccurate data can be avoided if you take the right precautionary measures in data handling. Data security is also important, especially if you are using data sources where lots of other users have access. Restrict access to data where possible, and make sure you create different access privileges for all users.

How to Clean Data

on December 19, 2019 with No comments

Having gone through the procedures described in the previous post and identified unclean data, your next challenge is how to clean it and use accurate data for analysis.

You have five possible alternatives for handling such a situation:

● Data imputation

If you are unable to find the necessary values, you can impute them by filling in the gaps for the inaccurate values. The closest explanation for imputation is that it is a clever way of guessing the missing values, but through a data-driven scientific procedure. Some of the techniques you can use to impute missing data include stratification and statistical indicators like mode, mean and median. If you have studied the data and identified unique patterns, you can stratify the missing values based on the trend identified. For example, men are generally taller than women. You can use this presumption to fill in missing values based on the data you already have.

The most important thing, however, is to try and seek a second opinion on the data before imputing your new values. Some datasets are very critical, and imputing might introduce a personal bias which eventually affects the outcome.

● Data scaling

Data scaling is a process where you change the data range so that you have a reasonable range. Without this, some values that might appear larger than others might be given prominence by some algorithms. For example, the age of a sample population generally exists within a smaller range compared to the average population of a city. Some algorithms will give the population priority
over age, and might ignore the age variable altogether.

By scaling such entries, you maintain a proportional relationship between different variables, ensuring that they are within a similar range. A simple way of doing this is to use a baseline for the large values, or use percentage values for the variables.

● Correcting data

Correcting data is a far better alternative than removing data. This involves intuition and clarification. If you are concerned about the accuracy of some data, getting clarification can help allay your fears. With the new information, you can fix the problems you identified and use data you are confident about in your analysis.

● Data removal

One of the first things you could think about is to eliminate the missing entries from your dataset. Before you do this, it is advisable that you investigate to determine why the entries are missing. In some cases, the best option is to remove the data from your analysis altogether. If, for example, more than 80% of entries in a row is missing and you cannot replace them from any other source, that row will not be useful to your analysis. It makes sense to remove it.

Data removal comes with caveats. If you have to eliminate any data from your analysis, you must give a reason for this decision in a report accompanying your analysis. This is important so as to safeguard yourself from claims of data manipulation or doctoring data to suit a narrative. Some types of data are irreplaceable, so you must consult experts in the associated fields before you remove them. Most of the time, data removal is applied when you identify duplicates in the data, especially if removing the duplicates does not affect the outcome of your analysis.

● Flagging data

There are situations where you have columns missing some values, but you cannot afford to eliminate all of them. If you are working with numeric data, a reprieve would be to introduce a new column where you indicate all the missing values. The algorithm you are using should identify these values as such. In case the flagged values are necessary in your analysis, you can impute them or find a better way to correct them then use them in your analysis. In case this is not possible, make sure you highlight this in your report.

Cleaning erroneous data can be a difficult process. A lot of data scientists generally hope to avoid it, especially since it is time-consuming. However, it is a necessary process that will bring you closer to using appropriate data for your analysis. Remember that the main objective is to use clean data that will give you the closest reflection of the true picture of events.

Identify Inaccurate Data

on December 19, 2019 with No comments

More often, you need to make a judgement call to determine whether the data you are accessing is accurate or not.

As you go through data, you must make a logical decision based on what you see. The following are some factors you should think about:

● Study the range

First, check the range of data. This is usually one of the easiest problems to identify. Let’s say you are working on data for primary school kids. You know the definitive age bracket for the students. If you identify age entries that are either too young or too old for primary school kids whose data you have, you need to investigate further.

Essentially what you are doing here is an overview of a max-min approach. With these ranges in mind, you can skim through data and identify erroneous entries. Skimming through is easy if you are working with a few entries. If you have thousands or millions of data entries, a max-min function code can help you identify the wrong entries in an instant. You can also plot the data on a graph and
visually detect the values that don’t fall within the required distribution pattern.

● Investigate the categories

How many categories of data do you expect? This is another important factor that will help you determine whether your data is accurate or not. If you expect a dataset with nine categories, anything less is acceptable, but not more. If you have more than nine categories, you should investigate to determine the legitimacy of the additional categories. Say you are working with data on marital status, and your expected options are single, married, divorced, or widowed. If the data has six categories, you should investigate to determine why there are two more.

● Data consistency

Look at the data in question and ensure all entries are consistent. In some cases, inaccuracies appear as a result of inconsistency. This is common when working with percentages. Percentages can either be fed into data sets as basis points or decimal points. If you have data that has both sets of entries, they might be incompatible.

● Inaccuracies across multiple fields

This is perhaps one of the most difficult challenges you will overcome when cleaning inaccurate data. The following entries, for example, are valid individually. A 4-year old girl is a valid age entry. 5 children is also a valid entry. However, a datapoint that depicts Grace as a 4-year old girl with 5 children is absurd. You would need to check for inconsistencies and inaccuracies in several rows and columns.

● Data visualization

Plotting data in visual form is one of the easiest ways of identifying abnormal distributions or any other errors in the data. Say you are working with data whose visualization should result in a bimodal distribution, but when you plot the data you end up with a normal distribution. This would immediately alert you that something is not right, and you need to check your data for accuracy.

● Number of errors in your data set

Having identified the unique errors in the data set, you must enumerate them. Enumeration will help you make a final decision on how and whether to use the data. How many errors are there? If you have more than half of the data as inaccurate, it is obvious that your presentation would be greatly flawed. You must then follow up with the individuals who prepared the data for clarification or find an alternative.

● Missing entries

A common data concern that data analysts deal with is working with datasets missing some entries. Missing entries is relative. If you are missing two or three entries, this should not be a big issue. However, if your data set is missing many entries, you have to find out the reason behind this. Missing entries usually happen when you are collating data from multiple sources, and in the process some of the data is either deleted, overwritten, or skipped. You must investigate the missing entries because the answer might help you determine whether you are missing only a few entries that might be insignificant going forward, or important entries whose absence affects the outcome.

Data Cleaning

on December 18, 2019 with No comments

Data cleaning is one of the most important procedures you should learn in data analysis. You will constantly be working with different sets of data and the accuracy or completeness of the same is never guaranteed. Because of this reason, you should learn how to handle such data and make sure the incompleteness or errors present do not affect the final outcome.

Image result for data cleaning in data science

Image result for data cleaning in data science

Why should you clean data, especially if you did not produce it in the first place?

Using unclean data is a sure way to get poor results. You might be using a very powerful computer capable of performing calculations at a very high speed, but what they lack is intuition. Without this, you must make a judgement call each time you go through a set of data. In data analysis, your final presentation should be a reflection of the reality in the data you use. For this reason, you must eliminate any erroneous entries.

Possible Causes of Unclean Data

One of the most expensive overheads in many organizations is data cleaning. Unclean data is present in different forms. Your company might suffer in the form of omissions and errors present in the master data you need for analytical purposes. Since this data is used in important decision-making processes, the effects are costly. By understanding the different ways dirty data finds its way into your organization, you can find ways of preventing it, thereby improving the quality of data you use.
In most instances, automation is applied in data collection. Because of this, you might experience some challenges with the quality of data collected or consistency of the same. Since some data is obtained from different sources, they must be collated into one file before processing. It is during this process that concerns as to the integrity of the data might arise. The following are some explanations as to why you have unclean data:

● Incomplete data

The problem of incomplete data is very common in most organizations. When using incomplete data, you end up with many important parts of the data blank. For example, if you are yet to categorize your customers according to the target industry, it is impossible to create a segment in your sales report according to industry classification. This is an important part of your data analysis that will be
missing, hence your efforts will be futile, or expensive in terms of time and resources invested before you get the complete and appropriate data.

● Errors at input

Most of the mistakes that lead to erroneous data happen at data entry points. The individual in charge might enter the wrong data, use the wrong formula, misread the data, or innocently mistype the wrong data. In the case of an open-ended report like questionnaires, the respondents might input data with typos or use words and phrases that computers cannot decipher appropriately. Human error at
input points is always the biggest challenge in data accuracy.

● Data inaccuracies

Inaccurate data is in most cases a matter of context. You could have the correct data, but for the wrong purpose. Using such data can have far-reaching effects, most of which are very costly in the long run. Think about the example of a data analyst preparing a delivery schedule for clients, but the addresses are inaccurate. The company could end up delivering products to their customers, but with the wrong address details. As a matter of context, the company does have the correct addresses for their clients, but they are not matched correctly.

● Duplicate data

In cases where you collect data from different sources, there is always a high chance of data duplication. You must have a lot of checks in place to ensure that duplicates are identified. For example, one report might list student scores under Results, while another will have them under Performance. The data under these tags will be similar, but your sensors will consider them as two independent entities.

● Problematic sensors

Unless you are using a machine that periodically checks for errors and corrects them or alerts you, it is possible to encounter errors as a result of problematic sensors. Machines can be faulty or breakdown too, which increases the likelihood of a problematic data entry.

● Incorrect data entries

An incorrect entry will always deliver the wrong result. Incorrect entry happens when your dataset includes entries that are not within the acceptable range. For example, data for the month of February should range from 1 to 28 or 29. If you have data for February ranging up to 31, there is definitely an error in your entries.

● Data mungling

If at your data entry point you use a machine with problematic sensors, it is possible to record erroneous values. You might be recording people’s ages, and the machine inputs a negative figure. In some cases, the machine could actually record correct data, but between the input point and the data collection point, the data might be mungled, hence the erroneous results. If you are accessing data
from a public internet connection, a network outage during data transmission might also affect the integrity of the data.

● Standardization concerns

For data obtained from different sources, one of the concerns is often how to standardize the data. You should have a system or method in place to identify similar data and represent them accordingly. Unfortunately, it is not easy to manage this level of standardization. As a result, you end up with erroneous entries. Apart from data obtained from multiple sources, you can also experience challenges dealing with data obtained from the same source. Everyone inputs data uniquely, and this might pose a challenge at data analysis.

Data Manipulation

on December 17, 2019 with No comments

By this point, you are aware of how to draw summaries from the data in your possession. Beyond this, you should learn how to slice, select, and extract data from your DataFrame. I mentioned earlier that DataFrames and Series share many similarities, especially in the methods used on them. However, their attributes are not similar. Therefore you must be keen to make sure you are using the right attributes, or you will end up with attribute errors.

To extract a column, you use square brackets as shown below:

position_col = squad_df['position']
type(position_col)

You will get the output below:

pandas.core.series.Series

The result is a Series. However, if you need to return the column as a Dataframe, you must use column names as shown below:

position_col = squad_df[['position']]
type(position_col)

You will get the output below:

pandas.core.frame.DataFrame

What you have now is a simple list. Onto this list, you can add a new column as follows:

subset = squad_df[['position', 'earnings']]
subset.head()

You should get the output below:

Next, we will look at how to call data from your DataFrame using rows. You can do this using any of the following means:

● Locating the name (.loc)
● Locating the numerical index (.iloc)

Since we will still be indexed using the Teams, we must use .loc and assign it the name of the team as shown below:

eve = squad_df.loc["Everton"]
eve

Another option is to use .iloc for the numerical index of Everton as shown below:

eve = squad_df.iloc[1]

The .iloc slice works in the same way that you slice lists in Python. Therefore, the item found in the index section at the end is omitted.

Describing Variables

on December 16, 2019 with No comments

There is so much more information you can get from your DataFrames. A summary of the continuous variables can be arrived at using the following syntax:

squad_df.describe()

This will return information about continuous numbers. This information is useful when you are uncertain about the kind of plot diagram to use for visual representation. .describe() is a useful attribute because it returns the number of rows, categories, and frequency of the top category about a specific column.

squad_df['position'].describe()

The syntax above will return an output in the following format:

count xx
unique xx
top xx
freq xx
Name: genre, dtype: object

What we can deduce from this output is that the selected column contains xx number of unique values, the top value in that column, and the fact that the top column shows up xx number of times (freq) . To determine the frequency of all the values in the position column, you use the syntax below:

squad_df['position'].value_counts().head(10)

You can also find out the relationship between different continuous variables using the .corr() syntax as shown below:

squad_df.corr()

The output is a correlation table that represents different relationships in your dataset. You will notice positive and negative values in the output table. Positive results show a positive correlation between the variables. This means that one variable rises as the other rises and vice versa. Negative results show an inverse correlation between the variables. This means that one variable will rise as the other falls. A perfect correlation is represented by 1.0. A perfect correlation is obvious for each column with itself.

Data Imputation

on December 13, 2019 with No comments

Imputation is a cleaning process that allows you to maintain valuable data in your DataFrames, even if they have null values. This is important in situations where eliminating rows that contain null values might eliminate a lot of data from your dataset. Instead of losing all values, you can use the median or mean of the column in place of the null value.

Using the example above, and assuming a new column for earnings from gate receipts earned by the clubs over the season. Some values are missing in that revenue column. To begin, you must extract the revenue column and use it as a variable. This is done as shown below:

earnings = squad_df[‘earnings_billions’]

Take note that when you are selecting columns to use from a DataFrame, you must enclose them with square brackets as shown above. To handle the missing values, we can use the mean as follows:

earnings_mean = earnings.mean()
earnings_mean

The output should deliver the mean of all the values in the specified cells. Once you have this, you replace it in the null values using the following syntax:

fillna() as shown below:

earnings.fillna(earnings_mean, inplace=True)

This will replace all the null values in the earnings column with the mean of that column. The syntax inplace=True changes the original squad_df.

Computation with Missing Values

on December 12, 2019 with No comments

One thing you can be certain about as a data analyst is that you will not always come across complete sets of data. Since data is collected by different people, they might not use the same conventions you prefer. Therefore, you can always expect to bump into some challenges with missing values in datasets.

In Python, you will encounter None or np.nan in NumPy whenever you come across such types of data. Since you must proceed with your work, you must learn how to handle such scenarios. You have two options: either replace the null values with non-null values or eliminate all the columns and rows that have null values.

First, you must determine the number of null values present in each column within your dataset. You can do this in the syntax below:

squad_df.isnull()

The result is a DataFrame that has True or False in each cell, in relation to the null status of the cell in question. From here, you can also determine the number of null returns in every column through an aggregate summation function as shown below:

squad_df.isnull() .sum()

The result will list all the columns, and the number of null values in each. To eliminate null values from your data, you have to be careful. It is only advisable to eliminate such data if you have deep knowledge of the explanation behind the null values. Besides, it is only advisable to eliminate null data if you are missing a small amount. This should not have a noteworthy effect on the data. The following syntax will help you eliminate null data from your work:

squad_df.dropna()

The syntax above eliminates all rows with at least one null value from your dataset. However, this syntax will also bring forth a new DataFrame without changing the original DataFrame you have been using.

The problem with this operation is that it will eliminate data from the rows with null values. However, some of the columns might still contain some useful information in the eliminated rows. To circumvent this challenge, we must learn how to perform imputation on such datasets.

Instead of eliminating rows, you can choose to eliminate columns that contain null values too. This is performed with the syntax below:

axis=1

For example, squad_df.dropna(axis=1)

What is the explanation behind the axis=1 attribute? Why does it have to be 1 in order to work for columns? To understand this, we take a closer look at the .shape output discussed earlier.

squad_df.shape

Output
(20,2)

In the example above, the syntax returns the DataFrame in the form of a tuple of 20 rows and 2 columns. In this tuple, rows are represented as index zero, while columns are represented as index one. From this explanation, therefore, axis=1 will work on columns.

Cleaning Data in a Column

on December 11, 2019 with No comments

We often come across datasets that have varying names for their columns, encounter typos, spaces, and a mixture of upper and lower-case words.

Cleaning up these columns will make it easier for you to choose the correct column for your computations.

In the example shown in the previous post, the syntax below will help us print the column names:

squad_df.columns

You will have the following output:

Index ([‘Position’, ‘Designation’])

Once you have this information, you can use a simple command .rename() to rename some or all the columns in your data. Since we do not need to use any parentheses, we will rename the content as follows:

Assuming the Designation Column was named Designation (Next Season), you would have it renamed as follows

squad_df.rename(columns={
‘Designation (Next Season)': 'Designation_next_season',
}, inplace=True)
squad_df.columns

Our output would look like this:

Index ([‘Position’, ‘Designation_next_season’])

You can also use the same process to change the column content from upper to lower case without having to enter all the connotations individually. A list comprehension will help you instead of manually changing the name of each item on the column list as shown below:

squad_df.columns = [col.lower() for col in squad_df]
squad_df.columns

You will have the following output:

Index ([‘position’, ‘designation_next_season’])

Over time, you will use a lot of dict and list attributes in Pandas. To make your work easier, it is advisable to do away with special characters and use lower case connotations instead. You should also use underscores instead of spaces.

Dealing with Duplicates

on December 10, 2019 with No comments

The example we used in the previous post does not have any duplicate rows, thus we need to learn how to identify duplicates to ensure that we perform accurate computations. In the example in our previous post, we can append the squad Dataframe to itself and double it as shown:

temp_df = squad_df.append(squad_df)
temp_df.shape

Our output will be as follows:

(40, 2)

The append() attribute copies the data without altering the initial DataFrame. The example above does not use the real data, hence display in temp . In order to do away with the duplicates, we can use the following attribute:

temp_df = temp_df.drop_duplicates()
temp_df.shape

Our output will be as follows:

(20, 2)

The drop_duplicates() attribute works in the same manner that the append() attribute does. However, instead of doubling the DataFrame, it results in a fresh copy without duplicates. In the same example, .shape helps to confirm whether the dataset we are using has 20 rows as was present in the original file.

In Pandas, the keyword inplace is used to alter the DataFrame objects as shown below:

temp_df.drop_duplicates(inplace=True)

The syntax above will change your data automatically. The drop_duplicates() argument is further complemented with the keep argument in the following ways:

● False – This argument will eliminate all duplicates
● Last – This argument will eliminate all duplicates other than the last one.
● First – This argument will eliminate all duplicates other than the first one.

In the examples we used above, the keep argument has not been defined. Any argument that is not defined will always default to first . What this means is that if you have two duplicate rows, Pandas will maintain the first one but do away with the second.

If you use last , Pandas will drop the first row but maintain the second one. Using keep , however, will eliminate all the duplicates. Assuming that both rows are similar, keep will eliminate both of them. Let’s look at an example using temp_df below:

temp_df = squad_df.append(squad_df) # generate a fresh copy
temp_df.drop_duplicates(inplace=True, keep=False)
temp_df.shape

We will have the output below:

(0, 2)

In the above example, we appended the squad list, generating new duplicate rows. As a result, keep=False eliminated all the rows, leaving us with zero rows. This might sound absurd, but it is actually a useful technique that will help you determine all the duplicates present in the dataset you are working on.

Extracting Information from Data

on December 09, 2019 with No comments

The .info() command will help you derive information from your data sets. The syntax is as follows:

squad_df.info()

You will have the following output:

<class ‘pandas.core.frame.DataFrame’>
Index: 20 entries, Manchester United to Swansea
Data Columns (total 2 columns):
Position 20 non-null int64
Designation 20 non-null object
dtypes: int64 (1), object (1)
memory usage: 35.7+ KB

The .info() command will deliver all the important information you need about the dataset, including how many non-null values are available, the number of columns and rows, memory used by the DataFrame, and the type of data available in every column.

The dataset you are using might contain missing values in some columns. You will need to learn how to address these, to help in cleaning the data for final presentation.

Why do you need to determine the datatype?

Without this, you might struggle to interpret data correctly. If, for example, you are using a JSON file but the integers are stored as strings, most of your operations will not work. This is because it is impossible to perform mathematical computations with strings. This is why the .info() is useful. You know the kind of content present in every column.

The .shape attribute can also help you because it delivers the tuple of rows and columns in the dataset. In the example above, you can have it as follows:

squad_df.shape

Your output will be as follows:

(20, 2)

It is also important to remember that there are no parentheses used in the .shape attribute. It basically returns the tuple format for rows and columns. In the example above, we have 20 rows and 2 columns in the squad DataFrame. As you work with different sets of data, you will use the .shape attribute a lot to transform and clean data.

Obtaining Data from SQL Databases

on December 06, 2019 with No comments

Before you begin, check to ensure you have a connection with the Python library in question. Once the connection is established, you can then push a query to Pandas. You need SQLite to establish a connection with your database, from where you will then create a DataFrame using the SELECT query as follows:

import sqlite3
con = sqlite3.connect("database.db")

Using our car dealership example from the previous posts, the SQL database will have a table denoted as sales , and the index. We can read from the database using the command below:

df = pd.read_sql_query("SELECT * FROM sales", con)
df

You will have the following output:

Just as we did with the CSV files, you can also bypass the index as follows:
df = df.set_index('index')
df

You will have the output below:

Once you are done with your data, you need to save it in a file system that is relevant to your needs. In Pandas, you can convert files to and from any of the file formats discussed above in the same way that you read the data files, when storing them as shown below:

df.to_csv('new_sales.csv')
df.to_sql('new_sales', con)
df.to_json('new_sales.json')

In data analysis, there are lots of methods that you can employ when using DataFrames, all of which are important to your analysis. Some operations are useful in performing simple data transformations, while others are necessary for complex statistical approaches.

In the examples below, we will use an example of a dataset from the English Premier League below:

squad_df = pd.read_csv("EPL-Data.csv", index_col="Teams")

As we load this dataset from the CSV file, we will use teams as our index. To view the data, you must first open a new dataset by printing out rows as follows:

squad_df.head()

You will have the following Output:

.head() will by default print the first five rows of your DataFrame. However, if you need more rows displayed, you can input a specific number to be printed as follows:

squad_df.head(7)

This will output the top seven rows as shown below:

In case you need to display only the last rows, use the .tail() syntax. You can also input a specific number. Assuming we want to determine the last three teams, we will use the syntax below:

squad_df.tail(3)

Our output will be as follows:

Generally, whenever you access any dataset, you will often access the first five rows to determine whether you are looking at the correct data set. From the display, you can see the index, column names, and the preset values. You will notice from the example above that the index for our DataFrame is the Teams column.

Tuesday, December 31, 2019

Monday, December 30, 2019

Sunday, December 29, 2019

Saturday, December 28, 2019

Friday, December 27, 2019

Thursday, December 26, 2019

Wednesday, December 25, 2019

Tuesday, December 24, 2019

Monday, December 23, 2019

Saturday, December 21, 2019

Thursday, December 19, 2019

Wednesday, December 18, 2019

Tuesday, December 17, 2019

Monday, December 16, 2019

Friday, December 13, 2019

Thursday, December 12, 2019

Wednesday, December 11, 2019

Tuesday, December 10, 2019

Monday, December 9, 2019

Friday, December 6, 2019