Time series data is nearly ubiquitous but can be a pain point in many analyses. For example, suppose you are asked to forecast sales for a retail store and are given daily sales figures for the last 6 months. When you review the data, you realize the store is usually open 5 days a week but sometimes has sales on Saturdays and even some Sundays. This makes most weekend days have missing values, and the time interval of the data is inconsistent. Also, when you consider estimating a monthly forecast, you realize months are of different lengths and have varying numbers of sales days. As simple and obvious as the issues are, they create a number of issues in analyzing and modeling the data over time.
The machine learning literature and popular articles are heavily biased toward classification problems, with little mention of time series. Yet much of the data we deal with is time series or at least starts out that way. Time series is a general term used to refer to data that is naturally ordered by time. For example, tweets arrive as a stream of timestamped data. Similarly, store transactions or online credit card transactions are time series. The log streams from data centers are time series.
It's important to note that, unlike tabular data in classification problems, time series data is ordered. In tabular data, random samples are shuffled before being used in a model. In time series, the order matters and we generally want to preserve it. The temporal relationship of events is critical; we can only recognize unusual server traffic if we analyze the sequence of data compared to normal use periods. The time sequence of store transactions can be compared day to day and over longer periods to anticipate high demand periods for inventory and staff planning. The examples are endless.
pandas has a wide range of features to work with time series data. In the pandas documentation, it is noted that pandas time series objects are based on NumPy datetime64 and timedelta64 object types. pandas consolidates some useful methods from libraries such as scikit.timeseries (so much so that pandas will eventually absorb this library), and adds a lot of additional functionality used for working with time series data. In this chapter, we'll introduce some of the more important capabilities and review how to deal with timestamps in data. The key to understanding how time series differs from other pandas data structures is that pandas provides a couple of additional object types, namely Timestamp and Timedelta, as well as Period.
0 comments:
Post a Comment