Let’s make it simple by understanding how a common operation is handled by Apache Spark against other serial/parallel execution pipelines.
The common operation be loading a file/table from a source, applying transformations(as per the business requirements) and saving the transformed data as output.
- Input — a file/table with one billion of records.
- Transformation to apply — Filter operation which filters out majority of the records. Only a hundred records matches the filter condition
- Output — The filtered hundred records are saved as output(file/table format).
How the scenario is handled in normal serial execution pipelines:
- Step 1: Complete set of records (one billion records) will be loaded.
- Step 2: Filter operation is applied and only a 100 records which matches the filter condition are retrieved for further operations.
- Step 3: Saving the output in either file/table format.
How the scenario is handled in Apache Spark following lazy evaluation:
Spark Execution begins only when an action is triggered. In the specified use case, loading and applying filter conditions are tranformations and saving the output is an action.
When the action (saving output) is called, Spark knows that the final output required is only the records that are getting filtered. So instead of loading the billion records, only the matching hundred records are loaded from source.
This is why lazy evaluation plays an important role in enhancing the performance of Spark. Lazy Evaluation is not limited only to Spark. Some other distributed systems and programming languages like Haskell use the concept of lazy evaluation.