We live in a world where every human interaction becomes an event in the system, whether it’s purchasing clothes online or in-store, scrolling social media, or taking an Uber. Unsurprisingly, all these events are processed in one way or the other. Some events expect a quick response, so they are processed immediately. For instance, when completing a ride with Uber, you will receive the receipt in a few seconds. The input and output are usually 1-to-1.
While other events create greater values when processed collectively in the background. An example is generating monthly reports where you need to combine all the transactions of this month. The input and output are usually many-to-1. This is also called batch processing.
As a data practitioner, we deal with batches every day. It is an old-school but still very powerful data processing method that every data person should know. As it’s such a fundamental area, there is much to explore. In this article, I will start with the use cases of batch processing — how businesses can benefit from it, followed by its technical aspects. By the end of the article, you should have an idea of how to work with batches effectively in your environment.
What is batch processing and why?
From the intro, we learned that batch processing is to process a group of events (aka a batch) in one job and it differs from transaction processing which handles one event at a time. Events in a batch usually have the same attributes and belong to the same business context.
In most cases, we choose batch processing for two reasons.
Business
Certain outputs can only be generated when a series of records are present. Examples are end-of-month report generation, payroll processing, billing, and invoicing systems. The…