Usually we are given a large pandas DataFrame and asked to check some relationships between various fields in the columns, in a row-by-row fashion. It could be a logical operation or a sophisticated mathematical transformation on the raw data.
Essentially, it is a simple case of iterating over the rows of the DataFrame and doing some processing at each iteration. However, it may not be that simple in terms of choosing the most efficient method of executing this apparently simple task. For example, you can choose from the following approaches.
Brute-Force for Loop
The code for this naïve approach will go something like this:
for i in range(len(df)):
if (some condition is satisfied):
<do some calculation with> df.iloc[i]
Essentially, you are iterating over each row (df.iloc[i]) using a generic for loop and processing it one at a time. There’s nothing wrong with the logic and you will get the correct result in the end.
But this is guaranteed to be inefficient. If you try this approach with a DataFrame with a large number of rows, say ~1,000,000 (1 million) and 10 columns, the total iteration may run for tens of seconds or more (even on a fast machine).
Now, you may think that being able to process a million records in tens of seconds is still acceptable. But, as you increase the number of columns or the complexity of the calculation (or of the condition checking done at each iteration), you will see that they quickly add up and this approach should be avoided as much as possible when building scalable DS pipelines. On top of that, if you have to repeat such iteration tasks for hundreds of datasets on a regular basis (in a standard business/production environment), the inefficiencies will stack up over time.
Depending on the situations at hand, you may have choices of two better approaches for this iteration task.
• The pandas library has a dedicated method for iterating over rows named iterrows(), which might be handy to use in this particular situation. Depending on the DataFrame size and the complexity of the row operations, this may reduce the total execution time by ~10X over the for loop approach.
• pandas offers a method for returning a NumPy representation of the DataFrame named df.values(). This can significantly speed things up (even better than iterrows). However, this method removes the axis labels (column names) and therefore you must use the generic NumPy array indexing like 0, 1, to process the data.
We will continue our discussion in the next post
0 comments:
Post a Comment