Thursday, December 29, 2022

Time Series Data

Which of the following do you think is an example of time series? Even if you don’t know, try making a guess.


Time Series is generally data that is collected over time and is dependent on it. Here we see that the count of cars is independent of time, hence it is not a time series. While the CO2 level increases with respect to time, hence it is a time series.

Let us now look at the formal definition of Time Series.

This article was published as a part of the Data Science Blogathon.

Introduction on Time Series Forecasting Python

Unicorn Investors wants to make an investment in a new form of transportation – JetRail. JetRail uses Jet propulsion technology to run rails and move people at a high speed! The investment would only make sense if they can get more than 1 Million monthly users within the next 18 months. In order to help Unicorn Ventures in their decision, you need to forecast the traffic on JetRail for the next 7 months. You are provided with traffic data of JetRail since inception in the test file.

You can get the dataset here 

It is advised to look at the dataset after completing the hypothesis generation part.

Let’s Understand, What is Time Series Data?

Which of the following do you think is an example of time series? Even if you don’t know, try making a guess.

Time Series Data

Time Series is generally data that is collected over time and is dependent on it. Here we see that the count of cars is independent of time, hence it is not a time series. While the CO2 level increases with respect to time, hence it is a time series.

Let us now look at the formal definition of Time Series.

A series of data points collected in time order is known as a time series. Most business houses work on time series data to analyze sales numbers for the next year, website traffic, count of traffic, the number of calls received, etc. Data of a time series can be used for forecasting.

Not every data collected with respect to time represents a time series. Some of the examples of time series prediction Python are:

Stock Price


Passenger Count of Airlines


Temperature Over Time


A Number of Visitors in a Hotel

Now that we can differentiate between a Time Series and a non-Time Series data. let us explore Time Series further.

Now as we have an understanding of what a time series is and the difference between a time series and a non-time series, let’s look at the components of a time series in the next post.


Share:

Friday, December 9, 2022

What are datetimes?

You probably already understand that in the computer memory, all numeric information is represented as ones and zeros, so at the most basic level, there isn't anything special about dates or times. However, when working with real data in business and technical projects, we tend to think about time or dates in their own units, differently from other numbers. Time is most often thought of as hours, minutes, or seconds, and dates are usually years, months, and days. Other common patterns are the weekdays, day of the week, business days, and quarters. We often group data into bins of days, weeks, months, or quarters. Within these bins, there might be data every second, minute, hour, or on some other or even random period. Because it is natural to think of dates and time of day together, Python in general, and pandas in particular, provides objects to make it easy to work this way. The most fundamental time component in pandas is Timestamp, and it is equivalent to Datetime in Python, provided by the datetime package. Timestamp is used as the index for pandas time series data types, as we'll see a bit later. pandas provides the Timestamp method to convert various types of data into timestamps. Here, we convert a string in a familiar date format into a pandas timestamp:

my_timestamp = pd.Timestamp('12-25-2020')

my_timestamp

This code snippet produces the following output:


It's intuitive that Timestamp consists of year, month, day, hour, minute, and second. Since we did not provide any time information, Timestamp() assumes that the time portion is 00:00:00. The fact that Timestamp has these components is what makes pandas times series operations so flexible. To illustrate that pandas is already simplifying things for us, here, we import Python datetime, and use it to accomplish the same conversion:

from datetime import datetime

my_datetime = datetime.strptime('12-25-2020','%m-%d-%Y')

my_datetime

This produces the following output:


You can see that, in the case of using the datetime method, we have to provide a date format for Python to decode the string. The two resulting objects have very similar methods available to them.

Share:

Tuesday, December 6, 2022

Time series data

Time series data is nearly ubiquitous but can be a pain point in many analyses. For example, suppose you are asked to forecast sales for a retail store and are given daily sales figures for the last 6 months. When you review the data, you realize the store is usually open 5 days a week but sometimes has sales on Saturdays and even some Sundays. This makes most weekend days have missing values, and the time interval of the data is inconsistent. Also, when you consider estimating a monthly forecast, you realize months are of different lengths and have varying numbers of sales days. As simple and obvious as the issues are, they create a number of issues in analyzing and modeling the data over time.

The machine learning literature and popular articles are heavily biased toward classification problems, with little mention of time series. Yet much of the data we deal with is time series or at least starts out that way. Time series is a general term used to refer to data that is naturally ordered by time. For example, tweets arrive as a stream of timestamped data. Similarly, store transactions or online credit card transactions are time series. The log streams from data centers are time series.

It's important to note that, unlike tabular data in classification problems, time series data is ordered. In tabular data, random samples are shuffled before being used in a model. In time series, the order matters and we generally want to preserve it. The temporal relationship of events is critical; we can only recognize unusual server traffic if we analyze the sequence of data compared to normal use periods. The time sequence of store transactions can be compared day to day and over longer periods to anticipate high demand periods for inventory and staff planning. The examples are endless.

pandas has a wide range of features to work with time series data. In the pandas documentation, it is noted that pandas time series objects are based on NumPy datetime64 and timedelta64 object types. pandas consolidates some useful methods from libraries such as scikit.timeseries (so much so that pandas will eventually absorb this library), and adds a lot of additional functionality used for working with time series data. In this chapter, we'll introduce some of the more important capabilities and review how to deal with timestamps in data. The key to understanding how time series differs from other pandas data structures is that pandas provides a couple of additional object types, namely Timestamp and Timedelta, as well as Period.

Share:

Thursday, December 1, 2022

Components and applications of pandas

The pandas library is comprised of the following components:

• pandas/core: This contains the implementations of the basic data structures of pandas, such as Series and DataFrames. Series and DataFrames are basic toolsets that are very handy for data manipulation and are used extensively by data scientists.

• pandas/src: This consists of algorithms that provide the basic functionalities of pandas. These functionalities are part of the architecture of pandas, which you will not be using explicitly. This layer is written in C or Cython.

• pandas/io: This comprises toolsets for the input and output of files and data. These toolsets facilitate data input from sources such as CSV and text and allow you to write data to formats such as text and CSV.

• pandas/tools: This layer contains all the code and algorithms for pandas functions and methods, such as merge, join, and concat.

• pandas/sparse: This contains the functionalities for handling missing values within its data structures, such as DataFrames and Series.

• pandas/stats: This contains a set of tools for handling statistical functions such as regression and classification.

• pandas/util: This contains all the utilities for debugging the library.

• pandas/rpy: This is the interface for connecting to R.

The versatility of its different architectural components makes pandas useful in many real-world applications. Various data-wrangling functionalities in pandas (such as merge, join, and concatenation) save time when building real-world applications. Some notable applications where the pandas library can come in handy are as follows:

• Recommendation systems

• Advertising

• Stock predictions

• Neuroscience

• Natural language processing (NLP)

The list goes on. What's more important to note is that these are applications that have an impact on people's daily lives. For this reason, learning pandas has the potential to give a fillip to your analytics career.


Share: