Sunday, May 19, 2019

Pandas - 39 Machine Learning with scikit-learn - an introduction

The scikit-learn Library takes care of the construction phase of predictive models and their validation. scikit-learn is a Python module that integrates many of machine learning algorithms. This library is part of the SciPy (Scientific Python) group, a set of libraries created for scientific computing and especially for data analysis.

Generally these libraries are defined as SciKits, hence the first part of the name of this library. The second part of the library’s name is derived from machine learning, the
discipline pertaining to this library. Before we proceed further and explore this library, I'd like to introduce Machine Learning.

Machine Learning

Machine Learning deals with the study of methods for pattern recognition in datasets undergoing data analysis. In particular, it deals with the development of algorithms that learn from data and make predictions. Each methodology is based on building a specific model.

There are very many methods that belong to the machine learning , each with its unique characteristics, which are specific to the nature of the data and the predictive model that we want to build. The choice of which method is to be applied is called a learning problem.

The data to be subjected to a pattern in the learning phase can be arrays composed by a single value per element, or by a multivariate value. These values are often referred to as features or attributes. Depending on the type of the data and the model to be built, you can separate the learning problems into two broad categories: Supervised and Unsupervised Learning

Supervised Learning 

Supervised Learning includes the methods in which the training set contains additional attributes that
we want to predict (the target). We can instruct the model to provide similar values when we have to submit new values (the test set).

• Classification—The data in the training set belong to two or more classes or categories; then, the data, already being labeled, allow us to teach the system to recognize the characteristics that distinguish each class. When we will need to consider a new value unknown to the system, the system will evaluate its class according to its characteristics.

• Regression—When the value to be predicted is a continuous variable. The simplest case to understand is when we want to find the line that describes the trend from a series of points represented in a scatterplot.

Unsupervised Learning

Unsupervised Learning includes the methods in which the training set consists of a series of input values x without any corresponding target value.

• Clustering—The goal of these methods is to discover groups of similar examples in a dataset.

• Dimensionality reduction—Reduction of a high-dimensional dataset to one with only two or three dimensions is useful not just for data visualization, but for converting data of very high dimensionality into data of much lower dimensionality such that each of the lower dimensions conveys much more information.

In addition to these two main categories, there is a further group of methods that have the purpose of validation and evaluation of the models. Machine learning enables learning some properties by a model from a dataset and applying them to new data. This is because a common practice in machine learning is to evaluate an algorithm. This valuation consists of splitting the data into two parts, one
called the training set, with which we will learn the properties of the data, and the other
called the testing set, on which to test these properties.

Here I am ending today’s post. In the next post we'll explore Supervised Learning with scikit-learn. Until we meet again keep practicing and learning Python, as Python is easy to learn!
Share:

0 comments:

Post a Comment