Wednesday, July 14, 2021

DATASETS and DATA TYPES

A dataset is a source of data (such as a text file) that contains rows and columns of data. Each row is typically called a “data point”, and each column is called a “feature”. A dataset can be a CSV (comma separated values), TSV (tab separated values), Excel spreadsheet, a table in an RDMBS (Relational Database Management Systems), a document in a NoSQL database, the output from a Web service, and so forth.We need to analyze the dataset to determine which features are the most important and which features can be safely ignored in order to train a model with the given dataset.

A dataset can vary from very small (a couple of features and 100 rows) to very large (more than 1,000 features and more than one million rows). If you are unfamiliar with the problem domain, then you might struggle to determine the most important features in a large dataset. In this situation, you might need a “domain expert” who understands the importance of the features, their interdependencies (if any), and whether or not the data values for the features are valid. In addition, there are algorithms (called dimensionality reduction algorithms) that can help you determine the most important features. For example, PCA (Principal Component Analysis) is one such algorithm.

If you have written computer programs, then you know that explicit data types exist in many programming languages such as C, C++, Java, TypeScript, and so forth. Some programming languages, such as JavaScript and awk, do not require initializing variables with an explicit type: the type of a variable is inferred dynamically via an implicit type system (i.e., one that is not directly exposed to a developer).


In machine learning, datasets can contain features that have different types of data, such as a combination of one or more of the following:

•numeric data (integer/floating point and discrete/continuous)
•character/categorical data (different languages)
•date-related data (different formats)
•currency data (different formats)
•binary data (yes/no, 0/1, and so forth)
•nominal data (multiple unrelated values)
•ordinal data (multiple and related values)
 

Consider a dataset that contains real estate data, which can have as many as thirty columns (or even more), often with the following features:

•the number of bedrooms in a house: numeric value and a discrete value
•the number of square feet: a numeric value and (probably) a continuous value
•the name of the city: character data
•the construction date: a date value
•the selling price: a currency value and probably a continuous value
•the “for sale” status: binary data (either “yes” or “no”)

An example of nominal data is the seasons in a year: although many (most?) countries have four distinct seasons, some countries have two distinct seasons. However, keep in mind that seasons can be associated with different temperature ranges (summer versus winter). An example of ordinal data is an employee pay grade: 1=entry level, 2=one year of experience, and so forth. Another example of nominal data is a set of colors, such as {Red, Green, Blue}.

An example of binary data is the pair {Male, Female}, and some datasets contain a feature with these two values. If such a feature is required for training a model, first convert {Male, Female} to a numeric counterpart, such as {0,1}. Similarly, if you need to include a feature whose values are the previous set of colors, you can replace {Red, Green, Blue} with the values {0,1,2}.


Share:

0 comments:

Post a Comment