spring batch Introduction
spring batch is a data processing framework provided by spring. Many applications in the enterprise domain require batch processing in order to perform business operations in mission-critical environments. These business operations include the following.
- Automated, complex processing of large amounts of information that can be processed most efficiently without user interaction. These operations often include time-based events (e.g., month-end calculations, notifications, or communications).
- Repetitive processing of periodic applications of complex business rules (e.g., insurance benefit determinations or rate adjustments) in very large data sets.
- Integration of information received from internal and external systems that often needs to be formatted, validated and processed into systems of record in a transactional manner. Batch processing is used to process billions of transactions per day for the enterprise.
Spring Batch is a lightweight, comprehensive batch framework designed to develop powerful batch applications critical to the day-to-day operation of enterprise systems. spring Batch builds on the expected spring framework features (productivity, POJO-based development approach, and general ease of use) while enabling developers to easily access and leverage more advanced enterprise services when necessary. Spring Batch is not a schuedling framework.
Spring Batch provides reusable features that are critical for handling large volumes of data, including logging/tracking, transaction management, job processing statistics, job restarts, skipping and resource management. It also provides more advanced technical services and features that enable extremely high volume and high performance batch jobs through optimization and partitioning techniques.
Spring Batch can be used for two simple use cases (e.g., reading files into a database or running stored procedures) and complex high-volume use cases (e.g., moving large amounts of data between databases, converting it, etc.). High-volume batch jobs can leverage the framework in a highly scalable way to process large amounts of information.
Spring Batch Architecture
A typical batch application is roughly as follows.
- Read a large number of records from a database, file, or queue.
- Process the data in some way.
- Write back the data in a modified form.
The corresponding schematic is as follows.
A general architecture of spring batch is as follows.
In spring batch a job can define many stepsteps, in each step you can define its own ItemReader for reading data,
ItemProcesseor for processing data, ItemWriter for writing data, and each defined job is in the
JobRepository inside, we can start a job through the
Spring Batch Core Concepts Introduction
Here are some concepts that are core to the Spring batch framework.
What is Job
Job and Step are the two core concepts of spring batch execution of batch tasks.
Job is a concept that encapsulates the entire batch process. It is reflected in the code as a top-level interface, whose code is as follows.
There are five methods defined in the Job interface. There are two main types of jobs, one is simplejob and the other is flowjob. in spring batch, job is the top level abstraction, besides job we have
JobExecution which are two more bottom level abstractions.
A job is the basic unit of our operation, and it consists of steps inside. A job can combine steps in a specified logical order and provides us with methods to set the same properties for all steps, such as some event listening, skipping strategies.
Spring Batch provides a default simple implementation of the Job interface in the form of the
SimpleJob class, which creates some standard functionality on top of the Job. An example code using
java config is as follows.
What this configuration means is that first of all, the job is given a name
footballJob and then the three steps of the job are specified, which are implemented by the methods
What is JobInstance
We have already mentioned
JobInstance above, which is a more underlying abstraction of Job, and it is defined as follows.
His method is very simple, one is to return the id of the Job, the other is to return the name of the Job.
JobInstance refers to the concept of the job execution process in the runtime of the job (Instance is originally the meaning of the instance).
Let’s say we now have a batch job that functions to execute rows once at the end of the day. Let’s assume that the name of this batch job is
EndOfDay. In this case, then there will be a logical sense of
JobInstance each day , and we must record each run of the job.
What are JobParameters
As we mentioned above, if the same job is run once a day, then there is a
jobIntsance every day, but their job definitions are the same, so how can we distinguish between different
jobinstances of a job? Let’s make a guess, although the job definition of
jobinstance is the same, but they have something different, such as the running time.
The thing provided in spring batch to identify a
JobParameters object contains a set of parameters for starting a batch job, which can be used for identification or even as reference data during a run. Our hypothetical runtime can then be used as a
For example, our previous ‘EndOfDay’ job now has two instances, one generated on January 1 and the other on January 2, so we can define two
JobParameter objects: one with parameters 01-01, and the other with parameters 01-02. Thus, the method to identify a
JobInstance is as follows.
So, we can manipulate the correct
JobInstance with the
What is JobExecution
JobExecution refers to the code level concept of a single attempt to run a job that we have defined. a single execution of a job may fail or succeed. Only when the execution completes successfully, the given
JobInstance corresponding to the execution is also considered to be completed.
Again using the EndOfDay job described earlier as an example, suppose the first run of
01-01-2019 results in a failure. Then if you run the job again with the same
Jobparameter parameters as the first run (i.e. 01-01-2019), a new
JobExecution instance is created corresponding to the previous
JobInstance, and there is still only one
The interface of
JobExecution is defined as follows.
The annotations of each method have been explained clearly, so I won’t explain more here. Just to mention
JobExecution provides a method
getBatchStatus to get a status of a particular execution of a job.
BatchStatus is an enumeration class representing the status of a job, and is defined as follows.
These properties are critical information for a job execution and spring batch will persist them to the database. In the process of using Spring batch spring batch automatically creates some tables to store some job related information, the table used to store
batch_job_execution, here is a screenshot of an example from the database.
What is a Step
Each Step object encapsulates a separate phase of a batch job. In fact, each Job is essentially composed of one or more steps. Each step contains all the information needed to define and control the actual batch process.
Any particular content is at the discretion of the developer writing the Job. A step can be very simple or very complex. For example, a step whose function is to load data from a file into a database would require little to no code based on the current spring batch support. More complex steps may have complex business logic that is processed as part of the process.
Like Job, Step has a StepExecution similar to JobExecution, as shown in the following figure.
What is StepExecution
StepExecution means that a Step is executed one at a time, and a new
StepExecution is created each time a Step is run, similar to a
JobExecution. However, a Step may not be executed because its previous Step failed. And
StepExecution is created only when the Step is actually started.
An instance of a step execution is represented by an object of class
StepExecution contains a reference to its corresponding step as well as
JobExecution and transaction-related data, such as commit and rollback counts and start and finish times.
In addition, each step execution contains an
ExecutionContext that contains any data that the developer needs to keep in the batch run, such as statistics or status information needed for restarting. Here is an example of a screenshot from the database.
What is ExecutionContext
ExecutionContext is the execution environment for each
StepExecution. It contains a series of key-value pairs. We can get the
ExecutionContext with the following code.
What is JobRepository
JobRepository is a class that is used to persist the above concepts of jobs, steps, etc. It provides CRUD operations for both Job and Step as well as the
JobLauncher implementation that will be mentioned below.
The first time a Job is started,
JobExecution will be retrieved from the
JobExecution will be stored in the repository during the execution of the batch process.
@EnableBatchProcessing annotation provides automatic configuration for the
What is JobLauncher
The function of JobLauncher is very simple, it is used to start the Job which has specified
JobParameters, why should we emphasize the specification of
JobParameter here, in fact, we have mentioned earlier,
jobparameter and job together to form a job execution. Here is the code example.
The above run method implements the function of getting a
JobExecution from the
JobRepository and executing the Job based on the incoming Job and the
What is Item Reader
ItemReader is an abstraction for reading data, its function is to provide data input for each Step. When the ItemReader finishes reading all the data, it returns null to tell subsequent operations that the data has been read.
Spring Batch provides very many useful implementation classes for
ItemReader, such as
JdbcPagingItemReader, JdbcCursorItemReader and so on.
The data sources supported by
ItemReader are also very rich, including various types of databases, files, data streams, and so on. Almost all of our scenarios are covered.
Here is an example code of
JdbcPagingItemReader must specify a
PagingQueryProvider that is responsible for providing SQL query statements to return data by paging.
Below is an example code for
What is Item Writer
Since ItemReader is an abstraction for reading data, ItemWriter is naturally an abstraction for writing data, which is to provide data writing function for each step. The unit of writing is configurable, we can write one piece of data at a time, or write a chunk of data at a time, about the chunk will be introduced below. The
ItemWriter can’t do any operation on the read data.
Spring Batch also provides a lot of useful implementation classes for
ItemWriter, but of course we can also implement our own writer function.
What is Item Processor
ItemProcessor is an abstraction of the project’s business logic processing, when
ItemReader reads a record before the ItemWriter writes it, we can use
temProcessor to provide a function to process the business logic and manipulate the data accordingly.
If we find in
ItemProcessor that a piece of data should not be written, we can indicate this by returning null. The
ItemProcessor works very well with the ItemReader and ItemWriter, and the data transfer between them is very easy. We can just use them directly.
The processing flow of chunk
spring batch provides the ability for us to process data by chunk. A schematic of a chunk is as follows.
It means the same as the one shown in the diagram, because we may have many data read and write operations in one batch task, so it is not very efficient to process one by one and commit to the database, so spring batch provides the concept of chunk, we can set a
chunk size, spring batch will process the data one by one, but not commit to the database, only when the number of processed data reaches the
chunk size set worth, then go together to commit.
The java example definition code is as follows.
In the above step, the chunk size is set to 10, and when the number of data read by the ItemReader reaches 10, this batch of data is passed to the
itemWriter together, and the
transaction is committed.
skip strategy and failure handling
A batch job’s step may deal with a very large amount of data and inevitably encounter errors, which are less likely to occur, but we have to consider them because the most important thing we do is to ensure the ultimate consistency of the data. spring batch certainly takes this into account and provides us with relevant technical support. See the configuration of the bean below.
We need to pay attention to these three methods, which are
skipLimit method means that we can set the number of exceptions we allow this step to skip, if we set it to 10, then when this step runs, as long as the number of exceptions does not exceed 10, the whole step will not fail.
Note that if skipLimit is not set, the default value is 0.
The skip method allows us to specify the exceptions that we can skip, because some of them can be ignored.
The noSkip method means that we don’t want to skip this exception, that is, to exclude this exception from all the
exceptions of skip, which in the above example means to skip all the exceptions except
FileNotFoundException is a fatal
exception, and when this
exception is thrown, the step will fail directly.
A guide to working with batches
This section is a guide to some noteworthy points when using spring batch.
Batch Processing Principles
The following key principles and considerations should be considered when building a batch processing solution.
Batch architecture usually affects the architecture
Simplify and avoid building complex logical structures in single-batch applications whenever possible
Keep data processing and storage physically close to each other (in other words, keep data in the process)
Minimize the use of system resources, especially
I / O. Perform as many operations as possible in
Look at application
I/O(analyze SQL statements) to ensure that unnecessary physical
I/Ois avoided. In particular, look for the following four common flaws.
- Reading data from each transaction when it can be read once and cached or saved in working storage.
- Re-reading data from a transaction that previously read data in the same transaction.
- Causes unnecessary table or index scans.
- Failure to specify a key value in the WHERE clause of an SQL statement.
Do not do the same thing twice in a batch run. For example, if data aggregation is needed for reporting purposes, you should (if possible) increment the stored totals as you initially process the data, so your reporting application does not have to reprocess the same data.
Allocate enough memory at the beginning of the batch application to avoid time-consuming reallocations in the process.
Always assume the worst data integrity. Insert appropriate checks and record validation to maintain data integrity.
Implement checksums for internal validation whenever possible. For example, for data in a file there should be a data entry record that tells the total number of records in the file and a summary of key fields.
Plan and execute stress tests early in production-like environments with real data volumes.
In large batch task systems, data backups can be challenging, especially if the system is running at 24-7 online. Database backups are usually well handled in an online design, but file backups should be considered equally important. If the system relies on files, the file backup process should not only be in place and documented, but also tested regularly.
Disable Job from running automatically at startup
When using java config with spring batch jobs, if we don’t do any configuration, the project will run our defined batch jobs by default at startup. then how to make the project not run jobs automatically at startup?
The spring batch job will automatically run when the project starts, if we don’t want it to run at startup, we can add the following properties to
Insufficient memory when reading data
When I was using spring batch to do data migration, I found that after the job was started, the execution was stuck at a certain point in time and the logs were no longer printed, and after waiting for some time, I got the following error.
Resource exhaustion event：the JVM was unable to allocate memory from the heap.
It means that the project reports a resource exhaustion event, telling us that the java virtual machine can no longer allocate memory for the heap.
The reason for this error is that the reader of the batch job in this project takes back all the data in the database at once and does not paginate it, which leads to insufficient memory when the amount of data is too large.
There are two solutions to this problem:
- Adjust the logic of reader to read data by paging, but it will be more troublesome to implement and the operation efficiency will be reduced.
- Increase the service memory