Wednesday, July 3, 2019

A simple Data Science Project

Those who follow this blog regularly must have gotten a good understanding of NumPy and Pandas. Now let's use these libraries to work on a simple data science project. I am going to choose our dataset from the website Kaggle.com. Kaggle, is a Google-owned online community of data scientists and users. Kaggle allows users to find datasets, download the data, and use the data under very open licenses, in most cases.

I'll use the diamonds” database from Kaggle.com as it has a fairly simple structure and only has about 54,000 elements. Just download it at https://www.kaggle.com/shivam2503/diamonds. Once you have the diamonds database just see the metadata, it consists of variables, which also can be thought of as column headers. See the table below:


To use this as a training set for a machine-learning program, we would see a program using NumPy and TensorFlow to implement a set of simple pandas-based data analysis to read our data and ask some questions.

I’m going to use a DataFrame in pandas (a 2D-labeled data structure with columns that can be of different types). The Panels data structure is a 3D container of data.  See the program below:


import numpy as np
import pandas as pd


# read the diamonds CSV file
# build a DataFrame from the data
df = pd.read_csv('diamonds.csv')


print (df.head(10))
print()

# calculate total value of diamonds
sum = df.price.sum()
print ("Total $ Value of Diamonds: ${:0,.2f}".format( sum))

# calculate mean price of diamonds

mean = df.price.mean()
print ("Mean $ Value of Diamonds: ${:0,.2f}".format(mean))

#  summarize the data
descrip = df.carat.describe()
print()
print (descrip)


descrip = df.describe(include='object')
print()
print (descrip) 


The output is shown below:



We started with importing the needed libraries, NumPy and Pandas. Then we read the diamonds file into a pandas DataFrame and printed the first 10 rows in the DataFrame.

import numpy as np
import pandas as pd
# read the diamonds CSV file
# build a DataFrame from the data
df = pd.read_csv('diamonds.csv')
print (df.head(10))
print()


Next we calculate a couple of values from the column named price. Note that we get to use the column as part of the DataFrame object.

# calculate total value of diamonds
sum = df.price.sum()
print ("Total $ Value of Diamonds: ${:0,.2f}".format( sum))
# calculate mean price of diamonds
mean = df.price.mean()
print ("Mean $ Value of Diamonds: ${:0,.2f}".format(mean))


Then we run the built-in describe function to first describe and summarize the data about carat.

# summarize the data
descrip = df.carat.describe()
print()
print (descrip)


The following lines then prints out a description for all the nonnumeric columns in our DataFrame: specifically, the cut, color, and clarity columns:

descrip = df.describe(include='object')
print()
print (descrip)

Our next target is data visualization, this we'll do with MatPlotLib. Our first plot is a scatter plot showing diamond clarity versus diamond carat size. See the program below:

#import the pandas and numpy library
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read the diamonds CSV file
# build a DataFrame from the data
df = pd.read_csv('diamonds.csv')

carat = df.carat
clarity = df.clarity
plt.scatter(clarity, carat)
plt.show()

The plot produced by the program is shown below:

The diamond clarity is measured by how obvious inclusions are within the diamond: FL, IF, VVS1, VVS2, VS1, VS2, SI1, SI2, I1, I2, I3 (in order from best to worst: FL = flawless, I3= level 3 inclusions). Just to let you know, we had no flawless diamonds in our diamond database. Based on the plot we got we can say that in this dataset, the clarity ‘IL’ has the largest diamonds.

What if we need to find the number of diamonds in each clarity type? Our next program aims to do so:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read the diamonds CSV file
# build a DataFrame from the data
df = pd.read_csv('diamonds.csv')

# count the number of each textual type of clarity

clarityindexes = df['clarity'].value_counts().index.tolist()
claritycount= df['clarity'].value_counts().values.tolist()

print(clarityindexes)
print(claritycount)

plt.bar(clarityindexes, claritycount)
plt.show()


The resulting plot is shown below:

By this graph, we can see that the medium-quality diamonds SI1,VS2, and SI2 are most represented in our diamond dataset.

Just like we looked at clarity, let’s now look at color type in our pile of diamonds through the following program:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# read the diamonds CSV file
# build a DataFrame from the data
df = pd.read_csv('diamonds.csv')


import matplotlib.pyplot as plt

# count the number of each textual type of color

colorindexes = df['color'].value_counts().index.tolist()
colorcount= df['color'].value_counts().values.tolist()

print(colorindexes)
print(colorcount)

plt.bar(colorindexes, colorcount)
plt.show()


The resulting plot is shown below:
 

The color “G” represents about 25 percent of our sample size. That “G” is almost colorless. The general rule is less color, higher price.

Using Pandas we can generate heat plot which is used to graphically show correlations between numeric values inside our database. In this plot we take all the numerical values and create a correlation matrix that shows how closely they correlate with each other. To quickly and easily generate this graph, we use another library for Python and MatPlotLib called seaborn. Seaborn provides an API built on top of MatPlotLib that integrates with pandas DataFrames. See the program below:

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

# read the diamonds CSV file
# build a DataFrame from the data
df = pd.read_csv('diamonds.csv')

# drop the index column
df = df.drop('Unnamed: 0', axis=1)

f, ax = plt.subplots(figsize=(10, 8))
corr = df.corr()
print (corr)
# color
#sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool),
#        cmap=sns.diverging_palette(220, 10, as_cmap=True),
#                    square=True, ax=ax)

#grayscale

cmap = sns.cubehelix_palette(50, hue=0.05, rot=0, light=0.95, dark=0.05, as_cmap=True)
sns.heatmap(corr, mask=np.zeros_like(corr, dtype=np.bool),
        cmap=cmap,
                    square=True, ax=ax)

plt.show()

The resulting heat plot is shown below:


We can see from the plot that the more red the color, the higher the correlation between the two variables. The diagonal stripe from top left to top bottom shows that, for example, carat correlates 100 percent with carat. No surprise there. The x, y, and z variables quite correlate with each other, which says that as the diamonds in our database increase in one dimension, they increase in the other two dimensions as well.

How about price? As carat and size increases, so does price. This makes sense. Interestingly, depth (The height of a diamond, measured from the culet to the table, divided by its average girdle diameter) does not correlated very strongly at all with price and in fact is somewhat negatively correlated. Heat maps are fabulous for spotting general cross-correlations in our data.

It would be interesting to see the correlation between color/clarity and price. Try to do find this correlation as an exercise. Here I am ending this post, till we meet again keep practicing and learning Python as Python is easy to learn!


Share:

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Join this Data Science course in Hyderabad at AI Patasala, bring change in your career, and be the next Data Science today!
    Data Science Training in Hyderabad with Placements

    ReplyDelete