Tuesday, January 17, 2023

The Challenge with Structured Databases

We are generating data at an unprecedented pace right now. The scale and size of this data – it’s mind-boggling! Just check out these numbers:

  • Facebook generates four petabytes of data in just one day
  • Google generates twenty petabytes of data every day
  • Furthermore, Large Hadron Collider (27 kilometers long most powerful particle accelerator of the world) generates one petabyte of data per second. Most importantly this data is unstructured

Can you imagine using SQL to work with this volume of data? It’s setting yourself up for a nightmare!

SQL is a wonderful language to learn as a data scientist and it does work well when we’re dealing with structured data. But if your organization works with unstructured data, SQL databases can not fulfill the requirements.

Structured databases have two major disadvantages:

  • Scalability: It is very difficult to scale as the database grows larger
  • Elasticity: Structured databases need data in a predefined format. It the data is not following the predefined format, relational databases do not store it

So how do we solve this issue? If not SQL then what?

This is where we go for unstructured databases. Among a wide range of such databases, MongoDB is widely used because of its rich query language and quick access with concepts like indexing. In short, MongoDB is best suited for managing big data. Let’s see the difference between structured and unstructured databases:

Structured DatabasesUnstructured Databases
Structure:Every element has the same number of attributesDifferent elements can have different number of attributes.
Latency:Comparatively slower storageFaster storage
Ease of learning:Easy to learnComparatively tougher to learn
Storage Volume:Not appropriate for storing Big DataCan handle Big Data as well
Type of Data Stored:Generally textual data is storedAny type of data can be stored (Audio, Video, Clickstraem etc)
Examples:MySQL, PostgreSQLMongoDB, RavenDB

This article is the ultimate guide to get started with MongoDB using Python. In the coming posts we will demonstrate various operations on MongoDB with the help of examples and the PyMongo library.



Friday, January 6, 2023

Linear Regression

It is used to estimate real values (cost of houses, number of calls, total sales etc.) based on continuous variable(s). Here, we establish the relationship between independent and dependent variables by fitting the best line. This best fit line is known as the regression line and is represented by a linear equation Y= a *X + b.

The best way to understand linear regression is to relive this experience of childhood. Let us say, you ask a child in fifth grade to arrange people in his class by increasing the order of weight, without asking them their weights! What do you think the child will do? He/she would likely look (visually analyze) at the height and build of people and arrange them using a combination of these visible parameters. This is linear regression in real life! The child has actually figured out that height and build would be correlated to weight by a relationship, which looks like the equation above.

In this equation:

  • Y – Dependent Variable
  • a – Slope
  • X – Independent variable
  • b – Intercept

These coefficients a and b are derived based on minimizing the sum of the squared difference of distance between data points and the regression line.

Look at the below example. Here we have identified the best fit line having linear equation y=0.2811x+13.9. Now using this equation, we can find the weight, knowing the height of a person.

Linear_Regression | machine learning algorithms

Linear Regression is mainly of two types: Simple Linear Regression and Multiple Linear Regression. Simple Linear Regression is characterized by one independent variable. And, Multiple Linear Regression(as the name suggests) is characterized by multiple (more than 1) independent variables. While finding the best fit line, you can fit a polynomial or curvilinear regression. And these are known as polynomial or curvilinear regression. 



MongoDB is an unstructured database. It stores data in the form of documents. MongoDB is able to handle huge volumes of data very efficiently and is the most widely used NoSQL database as it offers rich query language and flexible and fast access to data.

Let’s take a moment to understand the architecture of a MongoDB database.

The Architecture of a MongoDB Database

The information in MongoDB is stored in documents. Here, a document is analogous to rows in structured databases.

  • Each document is a collection of key-value pairs
  • Each key-value pair is called a field
  • Every document has an _id  field, which uniquely identifies the documents
  • A document may also contain nested documents
  • Documents may have a varying number of fields (they can be blank as well)

These documents are stored in a collection. A collection is literally a collection of documents in MongoDB. This is analogous to tables in traditional databases.

Unlike traditional databases, the data is generally stored in a single collection in MongoDB, so there is no concept of joins (except $lookup operator, which performs left-outer-join like operation). MongoDB has the nested document instead. 


Wednesday, January 4, 2023

Machine Learning Algorithms


1. Supervised Learning Algorithms

How it works: This algorithm consists of a target/outcome variable (or dependent variable) which is to be predicted from a given set of predictors (independent variables). Using this set of variables, we generate a function that map inputs to desired outputs. The training process continues until the model achieves a desired level of accuracy on the training data. Examples of Supervised Learning: Regression, Decision Tree, Random Forest , KNN, Logistic Regression etc.


2. Unsupervised Learning Algorithms

How it works: In this algorithm, we do not have any target or outcome variable to predict / estimate. It is used for clustering populations in different groups, which is widely used for segmenting customers into different groups for specific interventions. Examples of Unsupervised Learning: Apriori algorithm, K-means.


3. Reinforcement Learning:

How it works: Using this algorithm, the machine is trained to make specific decisions. It works this way: the machine is exposed to an environment where it trains itself continually using trial and error. This machine learns from past experience and tries to capture the best possible knowledge to make accurate business decisions. Example of Reinforcement Learning: Markov Decision Process

Here is the list of commonly used machine learning algorithms. These algorithms can be applied to almost any data problem:

  1. Linear Regression
  2. Logistic Regression
  3. Decision Tree
  4. SVM
  5. Naive Bayes
  6. kNN
  7. K-Means
  8. Random Forest
  9. Dimensionality Reduction Algorithms
  10. Gradient Boosting algorithms
    1. GBM
    2. XGBoost
    3. LightGBM
    4. CatBoost

Monday, January 2, 2023

Components of a Time Series Forecasting in Python

 1. Trend: A trend is a general direction in which something is developing or changing. So we see an increasing trend in this time series. We can see that the passenger count is increasing with the number of years. Let’s visualize the trend of a time series:




Here the red line represents an increasing trend of the time series.

2. Seasonality:–  Another clear pattern can also be seen in the above time series, i.e., the pattern is repeating at a regular time interval which is known as the seasonality. Any predictable change or pattern in a time series that recurs or repeats over a specific time period can be said to be seasonality. Let’s visualize the seasonality of the time series:




We can see that the time series is repeating its pattern after every 12 months i.e there is a peak every year during the month of January and a trough every year in the month of September, hence this time series has a seasonality of 12 months.

Difference Between a Time Series and Regression Problem

Here you might think that as the target variable is numerical it can be predicted using regression techniques, but a time series problem is different from a regression problem in the following ways:

  • The main difference is that a time series is time-dependent. So the basic assumption of a linear regression model that the observations are independent doesn’t hold in this case.
  • Along with an increasing or decreasing trend, most Time Series have some form of seasonality trends,i.e. variations specific to a particular time frame.

So, predicting a time series using regression techniques is not a good approach.

Time series analysis comprises methods for analyzing time-series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
