Spring Batch Basics

Spring Batch

spring batch Introduction

spring batch is a data processing framework provided by spring. Many applications in the enterprise domain require batch processing in order to perform business operations in mission-critical environments. These business operations include the following.

Automated, complex processing of large amounts of information that can be processed most efficiently without user interaction. These operations often include time-based events (e.g., month-end calculations, notifications, or communications).
Repetitive processing of periodic applications of complex business rules (e.g., insurance benefit determinations or rate adjustments) in very large data sets.
Integration of information received from internal and external systems that often needs to be formatted, validated and processed into systems of record in a transactional manner. Batch processing is used to process billions of transactions per day for the enterprise.

Spring Batch is a lightweight, comprehensive batch framework designed to develop powerful batch applications critical to the day-to-day operation of enterprise systems. spring Batch builds on the expected spring framework features (productivity, POJO-based development approach, and general ease of use) while enabling developers to easily access and leverage more advanced enterprise services when necessary. Spring Batch is not a schuedling framework.

Spring Batch provides reusable features that are critical for handling large volumes of data, including logging/tracking, transaction management, job processing statistics, job restarts, skipping and resource management. It also provides more advanced technical services and features that enable extremely high volume and high performance batch jobs through optimization and partitioning techniques.

Spring Batch can be used for two simple use cases (e.g., reading files into a database or running stored procedures) and complex high-volume use cases (e.g., moving large amounts of data between databases, converting it, etc.). High-volume batch jobs can leverage the framework in a highly scalable way to process large amounts of information.

Spring Batch Architecture

A typical batch application is roughly as follows.

Read a large number of records from a database, file, or queue.
Process the data in some way.
Write back the data in a modified form.

The corresponding schematic is as follows.

Spring Batch Architecture

A general architecture of spring batch is as follows.

general architecture of spring batch

In spring batch a job can define many stepsteps, in each step you can define its own ItemReader for reading data, ItemProcesseor for processing data, ItemWriter for writing data, and each defined job is in the JobRepository inside, we can start a job through the JobLauncher.

Spring Batch Core Concepts Introduction

Here are some concepts that are core to the Spring batch framework.

What is Job

Job and Step are the two core concepts of spring batch execution of batch tasks.

Job is a concept that encapsulates the entire batch process. It is reflected in the code as a top-level interface, whose code is as follows.

/**
 * Batch domain object representing a job. Job is an explicit abstraction
 * representing the configuration of a job specified by a developer. It should
 * be noted that restart policy is applied to the job as a whole and not to a
 * step.
 */
public interface Job {
 
 String getName();
 
 
 boolean isRestartable();
 
 
 void execute(JobExecution execution);
 
 
 JobParametersIncrementer getJobParametersIncrementer();
 
 
 JobParametersValidator getJobParametersValidator();
 
}

There are five methods defined in the Job interface. There are two main types of jobs, one is simplejob and the other is flowjob. in spring batch, job is the top level abstraction, besides job we have JobInstance and JobExecution which are two more bottom level abstractions.

A job is the basic unit of our operation, and it consists of steps inside. A job can combine steps in a specified logical order and provides us with methods to set the same properties for all steps, such as some event listening, skipping strategies.

Spring Batch provides a default simple implementation of the Job interface in the form of the SimpleJob class, which creates some standard functionality on top of the Job. An example code using java config is as follows.

@Bean
public Job footballJob() {
    return this.jobBuilderFactory.get("footballJob")
                     .start(playerLoad())
                     .next(gameLoad())
                     .next(playerSummarization())
                     .end()
                     .build();
}

What this configuration means is that first of all, the job is given a name footballJob and then the three steps of the job are specified, which are implemented by the methods playerLoad,gameLoad, playerSummarization.

What is JobInstance

We have already mentioned JobInstance above, which is a more underlying abstraction of Job, and it is defined as follows.

public interface JobInstance {
 /**
  * Get unique id for this JobInstance.
  * @return instance id
  */
 public long getInstanceId();
 /**
  * Get job name.
  * @return value of 'id' attribute from <job>
  */
 public String getJobName(); 
}

His method is very simple, one is to return the id of the Job, the other is to return the name of the Job.

JobInstance refers to the concept of the job execution process in the runtime of the job (Instance is originally the meaning of the instance).

Let’s say we now have a batch job that functions to execute rows once at the end of the day. Let’s assume that the name of this batch job is EndOfDay. In this case, then there will be a logical sense of JobInstance each day , and we must record each run of the job.

What are JobParameters

As we mentioned above, if the same job is run once a day, then there is a jobIntsance every day, but their job definitions are the same, so how can we distinguish between different jobinstances of a job? Let’s make a guess, although the job definition of jobinstance is the same, but they have something different, such as the running time.

The thing provided in spring batch to identify a jobinstance is: JobParameters. The JobParameters object contains a set of parameters for starting a batch job, which can be used for identification or even as reference data during a run. Our hypothetical runtime can then be used as a JobParameters.

For example, our previous ‘EndOfDay’ job now has two instances, one generated on January 1 and the other on January 2, so we can define two JobParameter objects: one with parameters 01-01, and the other with parameters 01-02. Thus, the method to identify a JobInstance is as follows.

JobParameters

So, we can manipulate the correct JobInstance with the Jobparameter

What is JobExecution

JobExecution refers to the code level concept of a single attempt to run a job that we have defined. a single execution of a job may fail or succeed. Only when the execution completes successfully, the given JobInstance corresponding to the execution is also considered to be completed.

Again using the EndOfDay job described earlier as an example, suppose the first run of JobInstance for 01-01-2019 results in a failure. Then if you run the job again with the same Jobparameter parameters as the first run (i.e. 01-01-2019), a new JobExecution instance is created corresponding to the previous JobInstance, and there is still only one JobInstance.

The interface of JobExecution is defined as follows.

public interface JobExecution {
 /**
  * Get unique id for this JobExecution.
  * @return execution id
  */
 public long getExecutionId();
 /**
  * Get job name.
  * @return value of 'id' attribute from <job>
  */
 public String getJobName(); 
 /**
  * Get batch status of this execution.
  * @return batch status value.
  */
 public BatchStatus getBatchStatus();
 /**
  * Get time execution entered STARTED status. 
  * @return date (time)
  */
 public Date getStartTime();
 /**
  * Get time execution entered end status: COMPLETED, STOPPED, FAILED 
  * @return date (time)
  */
 public Date getEndTime();
 /**
  * Get execution exit status.
  * @return exit status.
  */
 public String getExitStatus();
 /**
  * Get time execution was created.
  * @return date (time)
  */
 public Date getCreateTime();
 /**
  * Get time execution was last updated updated.
  * @return date (time)
  */
 public Date getLastUpdatedTime();
 /**
  * Get job parameters for this execution.
  * @return job parameters  
  */
 public Properties getJobParameters();
 
}

The annotations of each method have been explained clearly, so I won’t explain more here. Just to mention BatchStatus, JobExecution provides a method getBatchStatus to get a status of a particular execution of a job. BatchStatus is an enumeration class representing the status of a job, and is defined as follows.

1
2

public enum BatchStatus {STARTING, STARTED, STOPPING, 
   STOPPED, FAILED, COMPLETED, ABANDONED }

These properties are critical information for a job execution and spring batch will persist them to the database. In the process of using Spring batch spring batch automatically creates some tables to store some job related information, the table used to store JobExecution is batch_job_execution, here is a screenshot of an example from the database.

spring batch

What is a Step

Each Step object encapsulates a separate phase of a batch job. In fact, each Job is essentially composed of one or more steps. Each step contains all the information needed to define and control the actual batch process.

Any particular content is at the discretion of the developer writing the Job. A step can be very simple or very complex. For example, a step whose function is to load data from a file into a database would require little to no code based on the current spring batch support. More complex steps may have complex business logic that is processed as part of the process.

Like Job, Step has a StepExecution similar to JobExecution, as shown in the following figure.

spring batch Step

What is StepExecution

StepExecution means that a Step is executed one at a time, and a new StepExecution is created each time a Step is run, similar to a JobExecution. However, a Step may not be executed because its previous Step failed. And StepExecution is created only when the Step is actually started.

An instance of a step execution is represented by an object of class StepExecution. Each StepExecution contains a reference to its corresponding step as well as JobExecution and transaction-related data, such as commit and rollback counts and start and finish times.

In addition, each step execution contains an ExecutionContext that contains any data that the developer needs to keep in the batch run, such as statistics or status information needed for restarting. Here is an example of a screenshot from the database.

spring batch

What is ExecutionContext

ExecutionContext is the execution environment for each StepExecution. It contains a series of key-value pairs. We can get the ExecutionContext with the following code.

1
2

ExecutionContext ecStep = stepExecution.getExecutionContext();
ExecutionContext ecJob = jobExecution.getExecutionContext();

What is JobRepository

The JobRepository is a class that is used to persist the above concepts of jobs, steps, etc. It provides CRUD operations for both Job and Step as well as the JobLauncher implementation that will be mentioned below.

The first time a Job is started, JobExecution will be retrieved from the repository, and StepExecution and JobExecution will be stored in the repository during the execution of the batch process.

The @EnableBatchProcessing annotation provides automatic configuration for the JobRepository.

What is JobLauncher

The function of JobLauncher is very simple, it is used to start the Job which has specified JobParameters, why should we emphasize the specification of JobParameter here, in fact, we have mentioned earlier, jobparameter and job together to form a job execution. Here is the code example.

public interface JobLauncher {
 
public JobExecution run(Job job, JobParameters jobParameters)
            throws JobExecutionAlreadyRunningException, JobRestartException,
                   JobInstanceAlreadyCompleteException, JobParametersInvalidException;
}

The above run method implements the function of getting a JobExecution from the JobRepository and executing the Job based on the incoming Job and the jobparamaters.

What is Item Reader

ItemReader is an abstraction for reading data, its function is to provide data input for each Step. When the ItemReader finishes reading all the data, it returns null to tell subsequent operations that the data has been read.

Spring Batch provides very many useful implementation classes for ItemReader, such as JdbcPagingItemReader, JdbcCursorItemReader and so on.

The data sources supported by ItemReader are also very rich, including various types of databases, files, data streams, and so on. Almost all of our scenarios are covered.

Here is an example code of JdbcPagingItemReader.

@Bean
public JdbcPagingItemReader itemReader(DataSource dataSource, PagingQueryProvider queryProvider) {
        Map<String, Object> parameterValues = new HashMap<>();
        parameterValues.put("status", "NEW");
 
        return new JdbcPagingItemReaderBuilder<CustomerCredit>()
               .name("creditReader")
               .dataSource(dataSource)
               .queryProvider(queryProvider)
               .parameterValues(parameterValues)
               .rowMapper(customerCreditMapper())
               .pageSize(1000)
               .build();
}
 
@Bean
public SqlPagingQueryProviderFactoryBean queryProvider() {
        SqlPagingQueryProviderFactoryBean provider = new SqlPagingQueryProviderFactoryBean();
 
        provider.setSelectClause("select id, name, credit");
        provider.setFromClause("from customer");
        provider.setWhereClause("where status=:status");
        provider.setSortKey("id");
 
        return provider;
}

The JdbcPagingItemReader must specify a PagingQueryProvider that is responsible for providing SQL query statements to return data by paging.

Below is an example code for JdbcCursorItemReader.

 private JdbcCursorItemReader<Map<String, Object>> buildItemReader(final DataSource dataSource, String tableName,
            String tenant) {
 
        JdbcCursorItemReader<Map<String, Object>> itemReader = new JdbcCursorItemReader<>();
        itemReader.setDataSource(dataSource);
        itemReader.setSql("sql here");
        itemReader.setRowMapper(new RowMapper());
        return itemReader;
    }

What is Item Writer

Since ItemReader is an abstraction for reading data, ItemWriter is naturally an abstraction for writing data, which is to provide data writing function for each step. The unit of writing is configurable, we can write one piece of data at a time, or write a chunk of data at a time, about the chunk will be introduced below. The ItemWriter can’t do any operation on the read data.

Spring Batch also provides a lot of useful implementation classes for ItemWriter, but of course we can also implement our own writer function.

What is Item Processor

ItemProcessor is an abstraction of the project’s business logic processing, when ItemReader reads a record before the ItemWriter writes it, we can use temProcessor to provide a function to process the business logic and manipulate the data accordingly.

If we find in ItemProcessor that a piece of data should not be written, we can indicate this by returning null. The ItemProcessor works very well with the ItemReader and ItemWriter, and the data transfer between them is very easy. We can just use them directly.

The processing flow of chunk

spring batch provides the ability for us to process data by chunk. A schematic of a chunk is as follows.

spring batch chunk

It means the same as the one shown in the diagram, because we may have many data read and write operations in one batch task, so it is not very efficient to process one by one and commit to the database, so spring batch provides the concept of chunk, we can set a chunk size, spring batch will process the data one by one, but not commit to the database, only when the number of processed data reaches the chunk size set worth, then go together to commit.

The java example definition code is as follows.

spring batch chunk

In the above step, the chunk size is set to 10, and when the number of data read by the ItemReader reaches 10, this batch of data is passed to the itemWriter together, and the transaction is committed.

skip strategy and failure handling

A batch job’s step may deal with a very large amount of data and inevitably encounter errors, which are less likely to occur, but we have to consider them because the most important thing we do is to ensure the ultimate consistency of the data. spring batch certainly takes this into account and provides us with relevant technical support. See the configuration of the bean below.

spring batch - skip strategy and failure handling

We need to pay attention to these three methods, which are skipLimit(),skip(),noSkip().

The skipLimit method means that we can set the number of exceptions we allow this step to skip, if we set it to 10, then when this step runs, as long as the number of exceptions does not exceed 10, the whole step will not fail.

Note that if skipLimit is not set, the default value is 0.

The skip method allows us to specify the exceptions that we can skip, because some of them can be ignored.

The noSkip method means that we don’t want to skip this exception, that is, to exclude this exception from all the exceptions of skip, which in the above example means to skip all the exceptions except FileNotFoundException. FileNotFoundException is a fatal exception, and when this exception is thrown, the step will fail directly.

A guide to working with batches

This section is a guide to some noteworthy points when using spring batch.

Batch Processing Principles

The following key principles and considerations should be considered when building a batch processing solution.

Batch architecture usually affects the architecture
Simplify and avoid building complex logical structures in single-batch applications whenever possible
Keep data processing and storage physically close to each other (in other words, keep data in the process)
Minimize the use of system resources, especially I / O. Perform as many operations as possible in internal memory.
Look at application I/O (analyze SQL statements) to ensure that unnecessary physical I/O is avoided. In particular, look for the following four common flaws.
- Reading data from each transaction when it can be read once and cached or saved in working storage.
- Re-reading data from a transaction that previously read data in the same transaction.
- Causes unnecessary table or index scans.
- Failure to specify a key value in the WHERE clause of an SQL statement.
Do not do the same thing twice in a batch run. For example, if data aggregation is needed for reporting purposes, you should (if possible) increment the stored totals as you initially process the data, so your reporting application does not have to reprocess the same data.
Allocate enough memory at the beginning of the batch application to avoid time-consuming reallocations in the process.
Always assume the worst data integrity. Insert appropriate checks and record validation to maintain data integrity.
Implement checksums for internal validation whenever possible. For example, for data in a file there should be a data entry record that tells the total number of records in the file and a summary of key fields.
Plan and execute stress tests early in production-like environments with real data volumes.
In large batch task systems, data backups can be challenging, especially if the system is running at 24-7 online. Database backups are usually well handled in an online design, but file backups should be considered equally important. If the system relies on files, the file backup process should not only be in place and documented, but also tested regularly.

Disable Job from running automatically at startup

When using java config with spring batch jobs, if we don’t do any configuration, the project will run our defined batch jobs by default at startup. then how to make the project not run jobs automatically at startup?

The spring batch job will automatically run when the project starts, if we don’t want it to run at startup, we can add the following properties to application.properties.

`1`	`spring.batch.job.enabled=false`

Insufficient memory when reading data

When I was using spring batch to do data migration, I found that after the job was started, the execution was stuck at a certain point in time and the logs were no longer printed, and after waiting for some time, I got the following error.

spring batch - Insufficient memory when reading data

Resource exhaustion event：the JVM was unable to allocate memory from the heap.

It means that the project reports a resource exhaustion event, telling us that the java virtual machine can no longer allocate memory for the heap.

The reason for this error is that the reader of the batch job in this project takes back all the data in the database at once and does not paginate it, which leads to insufficient memory when the amount of data is too large.

There are two solutions to this problem:

Adjust the logic of reader to read data by paging, but it will be more troublesome to implement and the operation efficiency will be reduced.
Increase the service memory

Reference https://blog.csdn.net/topdeveloperr/article/details/84337956

Table of Contents

spring batch Introduction

Spring Batch Architecture

Spring Batch Core Concepts Introduction

What is Job

What is JobInstance

What are JobParameters

What is JobExecution

What is a Step

What is StepExecution

What is ExecutionContext

What is JobRepository

What is JobLauncher

What is Item Reader

What is Item Writer

What is Item Processor

The processing flow of chunk

skip strategy and failure handling

A guide to working with batches

Batch Processing Principles

Disable Job from running automatically at startup

Insufficient memory when reading data