Tuesday, July 20, 2021

Understanding data analysis

In today's smart world, data analysis offers an effective decision-making process for business and government operations. Data analysis is the activity of inspecting, preprocessing, exploring, describing, and visualizing the given dataset. The main objective of the data analysis process is to discover the required information for decision-making. Data analysis offers multiple approaches, tools, and techniques, all of which can be applied to diverse domains such as business, social science, and fundamental science.

Let's look at some of the core fundamental data analysis libraries of the Python ecosystem:

NumPy: This is a short form of numerical Python. It is the most powerful scientific library available in Python for handling multidimensional arrays, matrices, and methods in order to compute mathematics efficiently.

SciPy: This is also a powerful scientific computing library for performing scientific, mathematical, and engineering operations.

Pandas: This is a data exploration and manipulation library that offers tabular data structures such as DataFrames and various methods for data analysis and manipulation.

Scikit-learn: This stands for "Scientific Toolkit for Machine learning". It is a machine learning library that offers a variety of supervised and unsupervised algorithms, such as regression, classification, dimensionality reduction, cluster analysis, and anomaly detection. 

Matplotlib: This is a core data visualization library and is the base library for all other visualization libraries in Python. It offers 2D and 3D plots, graphs, charts, and figures for data exploration. It runs on top of NumPy and SciPy.

Seaborn: This is based on Matplotlib and offers easy to draw, high-level, interactive, and more organized plots.

Plotly: Plotly is a data visualization library. It offers high quality and interactive graphs, such as scatter charts, line charts, bar charts, histograms, boxplots, heatmaps, and subplots. 

The standard process of data analysis

Data analysis refers to investigating the data, finding meaningful insights from it, and drawing conclusions. The main goal of this process is to collect, filter, clean, transform, explore, describe, visualize, and communicate the insights from this data to discover decision-making information. Generally, the data analysis process is comprised of the following phases:

1. Collecting Data: Collect and gather data from several sources.
2. Preprocessing Data: Filter, clean, and transform the data into the required format.
3. Analyzing and Finding Insights: Explore, describe, and visualize the data and find insights and               conclusions.
4. Insights Interpretations: Understand the insights and find the impact each variable has on the system.
5. Storytelling: Communicate your results in the form of a story so that a layman can understand them. 

In the next post, we will discuss the KDD process. 

Share:

Monday, July 19, 2021

Discrete Data Versus Continuous Data

As a simple rule of thumb: discrete data is a set of values that can be counted, whereas continuous data must be measured. Discrete data can “reasonably” fit in a drop-down list of values, but there is no exact value for making such a determination. One person might think that a list of 500 values is discrete, whereas another person might think it’s continuous.

For example, the list of provinces of Canada and the list of states of the United States are discrete data values, but is the same true for the number of countries in the world (roughly 200) or for the number of languages in the world (more than 7,000)?

On the other hand, values for temperature, humidity, and barometric pressure are considered continuous. Currency is also treated as continuous, even though there is a measurable difference between two consecutive values. The smallest unit of currency for U.S. currency is one penny, which is 1/100th of a dollar (accounting-based measurements use the “mil”, which is 1/1,000th of a dollar).

Continuous data types can have subtle differences. For example, someone who is 200 centimeters tall is twice as tall as someone who is 100 centimeters tall; the same is true for 100 kilograms versus 50 kilograms. However, temperature is different: 80 degrees Fahrenheit is not twice as hot as 40 degrees Fahrenheit.

Furthermore, keep in mind that the meaning of the word “continuous” in mathematics is not necessarily the same as continuous in machine learning. In the former, a continuous function (let’s say in the 2D Euclidean plane) can have an uncountably infinite number of values. On the other hand, a feature in a dataset that can have more values than can be “reasonably” displayed in a drop-down list is treated as though it’s a continuous variable.

For instance, values for stock prices are discrete: they must differ by at least a penny (or some other minimal unit of currency), which is to say, it’s meaningless to say that the stock price changes by one-millionth of a penny. However, since there are “so many” possible stock values, it’s treated as a continuous variable. The same comments apply to car mileage, ambient temperature, barometric pressure, and so forth.

Share:

Wednesday, July 14, 2021

DATASETS and DATA TYPES

A dataset is a source of data (such as a text file) that contains rows and columns of data. Each row is typically called a “data point”, and each column is called a “feature”. A dataset can be a CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS (Relational Database Management Systems), a document in a NoSQL database, the output from a Web service, and so forth.We need to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset.

A dataset can vary from very small (a couple of features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain, then you might struggle to determine the most important features in a large dataset. In this situation, you might need a “domain expert” who understands the importance of the features, their interdependencies (if any), and whether or not the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features. For example, PCA (Principal Component Analysis) is one such algorithm.

If you have written computer programs, then you know that explicit data types exist in many programming languages such as C, C++, Java, TypeScript, and so forth. Some programming languages, such as JavaScript and awk, do not require initializing variables with an explicit type: the type of a variable is inferred dynamically via an implicit type system (i.e., one that is not directly exposed to a developer).


In machine learning, datasets can contain features that have different types of data, such as a combination of one or more of the following:

•numeric data (integer/floating point and discrete/continuous)
•character/categorical data (different languages)
•date-related data (different formats)
•currency data (different formats)
•binary data (yes/no, 0/1, and so forth)
•nominal data (multiple unrelated values)
•ordinal data (multiple and related values)
 

Consider a dataset that contains real estate data, which can have as many as thirty columns (or even more), often with the following features:

•the number of bedrooms in a house: numeric value and a discrete value
•the number of square feet: a numeric value and (probably) a continuous value
•the name of the city: character data
•the construction date: a date value
•the selling price: a currency value and probably a continuous value
•the “for sale” status: binary data (either “yes” or “no”)

An example of nominal data is the seasons in a year: although many (most?) countries have four distinct seasons, some countries have two distinct seasons. However, keep in mind that seasons can be associated with different temperature ranges (summer versus winter). An example of ordinal data is an employee pay grade: 1=entry level, 2=one year of experience, and so forth. Another example of nominal data is a set of colors, such as {Red, Green, Blue}.

An example of binary data is the pair {Male, Female}, and some datasets contain a feature with these two values. If such a feature is required for training a model, first convert {Male, Female} to a numeric counterpart, such as {0,1}. Similarly, if you need to include a feature whose values are the previous set of colors, you can replace {Red, Green, Blue} with the values {0,1,2}.


Share: