Friday, July 22, 2022

Combinatorial Explosion

At the beginning of a data analysis task, we are tempted to visualize the pairwise interrelationships between all kinds of numeric features that are present in the given dataset. This is often a necessary step for exploratory data analysis and can reveal significant insights about the general pattern of the dataset. However, for large datasets with hundreds of features (columns), this may put extreme pressure on the visualization routine, leading to poor plots and a slow response.

It is easy to explain why this apparently simple (pairwise) scatter plot task can become quickly intractable. The reason is combinatorial explosion. Essentially, you are trying to plot all combinations of two-way relationships and therefore you have nC2 possible combinations to plot where n is the number of numeric features and C denotes the combinatorial sign. Some concrete examples will help.

• 4C2 = 6 so you have 6 plots for pairwise plotting 4 features in a dataset

• 6C2 = 15 so you have 15 plots for pairwise plotting 6 features in a dataset

• 10C2 = 45 so you have 45 plots for pairwise plotting 10 features in a dataset

• 20C2 = 190 so you have 190 plots for pairwise plotting 20 features in a dataset

As you can see in Figure shown below, the number of plots increases rather quickly! On top of that, if you have a large dataset (with millions of samples), then each plot needs to have millions of data points rendered on the screen. It is computationally prohibitive to render millions of points on a web browser for hundreds of plots.



Share:

0 comments:

Post a Comment