Monday, July 18, 2022

A Typical Data Science Pipeline

Data science is a vast and dynamic field. In the modern business and technology space, the discipline of data science has assumed the role of a truly transformative force. Every kind of industry and socio-economic field from healthcare to transportation and from online retail to on-demand music uses DS tools and techniques in myriad ways.

Every day exabytes of business and personal data flow through increasingly complex dataflow pipelines architected by sophisticated DataOps architectures to be ingested, processed, and analyzed by database engines or machine learning algorithms, leading to insightful business decisions or technological breakthroughs. However, to illustrate the point of efficient data science practices, let’s take the generic example of a typical data science task flow shown below:


You are probably suspecting that there could be a high chance of writing inefficient code in the data wrangling or ingesting phase. However, you may wonder what could go wrong in the machine learning/statistical modeling phase as you may be using the out-of-the-box methods and routines from highly optimized Python libraries like Scikit-learn, Scipy, or TensorFlow. Furthermore, you may wonder why tasks like quality testing and app deployments should be included in a productive data science pipeline anyway.

Some modules of the DS pipeline in figure shown above, such as data wrangling, visualization, statistical modeling, ML training, and testing, are more directly impacted by inefficient programming styles and practices than others.

In the next posts I'll show some simple examples and take you through some data science stories.

Share:

0 comments:

Post a Comment