Background

Currently, the company’s services are deployed using Spring Cloud framework and k8s, but when new services need to be upgraded, although the current rolling update method is used, there will be a vacuum time ranging from 30 seconds to 1 minute when the services are registered to Eureka, which will cause the online services to be inaccessible for a short period of time, so In the service upgrade, it is necessary to make the service upgrade smoothly to achieve the effect of user insensitivity.

Cause Analysis

In Spring Cloud services, users generally access the gateway (Gateway or Zuul), and then access the internal services through the gateway for a transit, but accessing the internal services through the gateway requires a process, and the general process is like this: after the service is started, it will first register (report) its registration information (service name -> ip:port) to the Then other services will visit the registry regularly (the default interval for polling fetch is 30s) to get the latest service registration list in Eureka.

Then, if the services are updated by k8s in a rolling update fashion, the situation may be as follows

At time T, serverA_1 (the old service) is down and serverA_2 (the new service) has been started and registered in eureka, but the registration information of serverA_1 (the old service) still exists in the registration list cached in the gateway, so when the user accesses serverA, an exception is thrown because the containers where serverA_1 is located have been stopped.

Solution

1. Eureka parameter optimization

Client side

1
2
3
4
5
6
7
eureka:
  client:
    # Indicates how often the eureka client will pull service registration information, default is 30 seconds
    registryFetchIntervalSeconds: 5
ribbon:
  # ribbon local service list refresh interval, default is 30 seconds
  ServerListRefreshInterval: 5000

Server side

1
2
3
4
5
6
eureka:
  server:
    # The time interval for eureka server to clean up invalid nodes, default 60 seconds
    eviction-interval-timer-in-ms: 5000
    # Time for eureka server to refresh the readCacheMap (secondary cache), default time 30 seconds
    response-cache-update-interval-ms: 5000

The above two optimizations are mainly to shorten the time when the service goes online and offline, and to refresh the cache of the service registration list on eureka client side and server side as fast as possible.

2. Gateways enable retry mechanism

Since we are using the zuul gateway, enable the retry mechanism to prevent requests from being forwarded to nodes that have been taken offline during rolling updates. zuul requests that fail will automatically retry once to retry other available nodes.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
ribbon:
  # Maximum number of retries for the same instance, not including the first call
  MaxAutoRetries: 0
  # Maximum number of retries to retry other instances, not including the first selected server
  MaxAutoRetriesNextServer: 1
  # Whether all operations are retried
  OkToRetryOnAllOperations: false
zuul:
  # Enable Zuul retry function
  retryable: true

About OkToRetryOnAllOperations property, the default value is false, only when the request is GET will be retried, if set to true, so set after all types of methods (GET, POST, PUT, DELETE, etc.) will be retried, the server side needs to ensure the idempotence of the interface, such as the occurrence of read timeout, if the interface is not idempotent, it may cause dirty data, this is a point that needs attention!

3. Services that need to be down are actively removed from the registry

Use the k8s container callback PreStop hook to proactively remove services that need to be down from the registry before the container is stopped and terminated. There are two types of callback handlers available for containers.

  • Exec - Executes specific commands in the container’s cgroups and namespaces, and the resources consumed by the commands count towards the container’s resource consumption.

    1
    2
    3
    4
    5
    6
    7
    
    lifecycle:
    preStop:
        exec:
        command:
            - bash
            - -c                
            - 'curl -X "POST" "http://127.0.0.1:9401/ticket/actuator/service-registry?status=DOWN" -H "Content-Type: application/vnd.spring-boot.actuator.v2+json;charset=UTF-8";sleep 90'
    

    Also specify the grace period for k8s graceful termination: terminationGracePeriodSeconds: 90, and add a sleep time in the command configuration, mainly as a buffer time for the service to stop, to solve the problem that some requests may be stopped before processing is completed. Here we use the Eurek Client’s own forced offline interface. It should be noted that this method requires the service to introduce the spring-boot-starter-actuator component, which requires the service to whitelist the /actuator/service-registry and the base image to install the curl command to work.

  • HTTP - Performs HTTP requests to a specific endpoint on the container.

    1
    2
    3
    4
    5
    
    lifecycle:
        preStop:
        httpGet:
        path: /eureka/stop/client
        port: 8080
    

    With the http approach, we need to actively remove the current service from the registry at the code level inside each service.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    
    @RestController
    public class EurekaShutdownController {
    
        @Autowired
        private EurekaClient eurekaClient;
    
        @GetMapping("/eureka/stop/client")
        public ResultDto stopEurekaClient() {
            eurekaClient.shutdown();
            return new ResultDto(Consts.ErrCode.SUCCESS, "服务下线成功!");
        }
    }
    

    Note that if the service needs to have a black and white list, remember to add /eureka/stop/client to the whitelist, and if some services have context-path set, note that it needs to be prefixed, otherwise it will be blocked and will be of no use.

4. Delay the first probe time of the ready probe

Add redainessProbe and livenessProbe to the deployment configuration file of the k8s of the service, but what is the difference between these two?

  • LivenessProbe: The main purpose of LivenessProbe is to check if the application in the container is running properly by entering the container in the specified way, if the check fails, the container is considered unhealthy, then Kubelet will determine if the Pod should be restarted based on the restartPolicy set in the Pod. If livenessProbe is not configured in the container configuration, Kubelet will assume that the survival probe detection is always successful.

    1
    2
    3
    4
    5
    6
    7
    8
    
    livenessProbe:
        initialDelaySeconds: 35
        periodSeconds: 5
        timeoutSeconds: 10
        httpGet:
            scheme: HTTP
            port: 8081
            path: /actuator/health
    

    The container started in the Pod above is a SpringBoot application that references the Actuator component, which provides the /actuator/health health check address, and the survival probe can make a request to the service using HTTPGet. The /actuator/health path on port 8081 is requested to make a survival determination.

  • ReadinessProbe: Used to determine whether the application in the container has finished starting, when the probe is successful before the Pod provides network access to the outside, set the container Ready state to true, if the probe fails, set the container Ready state to false. For Pods managed by Service, the association of Service with Pod and EndPoint will also be set based on whether the Pod is in the Ready state. If the Pod reverts to the Ready state, it is automatically removed from the EndPoint list associated with the Service. It will be added back to the Endpoint list. This mechanism prevents traffic from being forwarded to an unavailable Pod .

    1
    2
    3
    4
    5
    6
    7
    
    readinessProbe:
        initialDelaySeconds: 30
        periodSeconds: 10
        httpGet:
            scheme: HTTP
            port: 8081
            path: /actuator/health
    

    periodSeconds parameter indicates how often the probe detects, here is set to 10s, parameter initialDelaySeconds represents the delay time of the first probe, here 30 means wait for 30 seconds after the pod is started, and then carry out the survivability detection, the same as the survivability pointer, use the HTTPGet method to send a request to the If the request is successful, it means the service is ready, and the new service will be reached if configured this way. After 30 seconds k8s will bring down the old service, and after 30 seconds, after optimizing the Eureka configuration, basically all the services have already gotten the registration information of the new service from Eureka.

In practice, the value of initialDelaySeconds of LivenessProbe should be greater than the value of initialDelaySeconds of ReadinessProbe, otherwise the pod node will not start, because the pod is not ready at this time, and if the survival pointer goes to probe, it will definitely fail, and then k8s will think that the pod is no longer alive, and will destroy the pod and rebuild it.

5. graceful shutdown to ensure that the ongoing business operations are not affected

First of all, let’s clarify how the old Pod is taken offline. If it is a linux system, the command kill -15 will be executed by default to notify the web application to stop and finally the Pod is deleted. Then what is meant by graceful shutdown? What does it do? Simply put, after sending a stop command to the application process, it ensures that the business operations being performed are not affected. The steps after the application receives the stop command should be to stop receiving access requests and wait until the requests that have been received are processed and can be successfully returned, then the application is actually stopped. SpringBoot 2.3 now supports graceful stopping, when enabled with server.shutdown=graceful, the web server will not receive new requests when the web container is shut down and will wait for a buffer period for active requests to complete. However, our company uses SpringBoot version 2.1.5.RELEASE, and we need to write some extra code to achieve graceful shutdown, depending on the web container, there are tomcat and undertow solutions.

tomcat

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@Slf4j
@Component
public class GracefulShutdownTomcat implements TomcatConnectorCustomizer, ApplicationListener<ContextClosedEvent> {
    private volatile Connector connector;
    private final int waitTime = 30;

    @Override
    public void customize(Connector connector) {
        this.connector = connector;
    }

    @Override
    public void onApplicationEvent(ContextClosedEvent contextClosedEvent) {
        this.connector.pause();
        Executor executor = this.connector.getProtocolHandler().getExecutor();
        if (executor instanceof ThreadPoolExecutor) {
            try {
                ThreadPoolExecutor threadPoolExecutor = (ThreadPoolExecutor) executor;
                threadPoolExecutor.shutdown();
                if (!threadPoolExecutor.awaitTermination(waitTime, TimeUnit.SECONDS)) {
                    log.warn("Tomcat thread pool did not shut down gracefully within " + waitTime + " seconds. Proceeding with forceful shutdown");
                }
            } catch (InterruptedException ex) {
                Thread.currentThread().interrupt();
            }
        }
    }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
@EnableDiscoveryClient
@SpringBootApplication
public class ShutdownApplication {

    public static void main(String[] args) {
        SpringApplication.run(ShutdownApplication.class, args);
    }

    @Autowired
    private GracefulShutdownTomcat gracefulShutdownTomcat;

    @Bean
    public ServletWebServerFactory servletContainer() {
        TomcatServletWebServerFactory tomcat = new TomcatServletWebServerFactory();
        tomcat.addConnectorCustomizers(gracefulShutdownTomcat);
        return tomcat;
    }
}

undertow

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
@Component
public class GracefulShutdownUndertow implements ApplicationListener<ContextClosedEvent> {

    @Autowired
    private GracefulShutdownUndertowWrapper gracefulShutdownUndertowWrapper;

    @Autowired
    private ServletWebServerApplicationContext context;

    @Override
    public void onApplicationEvent(ContextClosedEvent contextClosedEvent) {
        gracefulShutdownUndertowWrapper.getGracefulShutdownHandler().shutdown();
        try {
            UndertowServletWebServer webServer = (UndertowServletWebServer)context.getWebServer();
            Field field = webServer.getClass().getDeclaredField("undertow");
            field.setAccessible(true);
            Undertow undertow = (Undertow) field.get(webServer);
            List<Undertow.ListenerInfo> listenerInfo = undertow.getListenerInfo();
            Undertow.ListenerInfo listener = listenerInfo.get(0);
            ConnectorStatistics connectorStatistics = listener.getConnectorStatistics();
            while (connectorStatistics.getActiveConnections() > 0){}
        } catch (Exception e) {
            // Application Shutdown
        }
    }
}
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
@Component
public class GracefulShutdownUndertowWrapper implements HandlerWrapper {
    private GracefulShutdownHandler gracefulShutdownHandler;
    @Override
    public HttpHandler wrap(HttpHandler handler) {
        if(gracefulShutdownHandler == null) {
            this.gracefulShutdownHandler = new GracefulShutdownHandler(handler);
        }
        return gracefulShutdownHandler;
    }
    public GracefulShutdownHandler getGracefulShutdownHandler() {
        return gracefulShutdownHandler;
    }
}
public class UnipayProviderApplication {
    public static void main(String[] args) {
        SpringApplication.run(UnipayProviderApplication.class);
    }
    @Autowired
    private GracefulShutdownUndertowWrapper gracefulShutdownUndertowWrapper;
    @Bean
    public UndertowServletWebServerFactory servletWebServerFactory() {
        UndertowServletWebServerFactory factory = new UndertowServletWebServerFactory();
        factory.addDeploymentInfoCustomizers(deploymentInfo -> deploymentInfo.addOuterHandlerChainWrapper(gracefulShutdownUndertowWrapper));
        factory.addBuilderCustomizers(builder -> builder.setServerOption(UndertowOptions.ENABLE_STATISTICS, true));
        return factory;
    }
}

ok, after the above optimization, basically it will be possible to do the scrolling update without user perception.

Reference https://blog.leeyom.top/#/posts/27