为什么会出现随机 504,408 错误代码?(一)

Posted by Luffyao on Thursday, June 3, 2021

问题描述

这是一个我们系统测试人员对我们的微服务请求的 Server 端做模拟断网的时候,发生的一种异常处理的情况。 由于不是在 overload 情况下,所以如果微服务请求的 Server 断网,正常处理逻辑,我们的微服务会回复 500 错误代码给微服务的请求端。但是测试人员发现,这种情况下,我们的微服务会出现回复 504,408 的错误代码给请求端的异常情况。

问题分析

我们的测试人员提供了下面可参考信息: istio 回复 408,504 的具体错误信息 408 downstream duration timeout504 upstream request timeout,以及提供了 ss 命令的监控结果。

Mon May 31 06:20:36 UTC 2021
tcp    ESTAB      0      5809   192.168.209.20:41914                10.139.17.19:opsmessaging 
...
Mon May 31 06:22:12 UTC 2021
tcp    ESTAB      0      55766  192.168.209.20:41914                10.139.17.19:opsmessaging         
Mon May 31 06:22:17 UTC 2021
tcp    ESTAB      0      57413  192.168.209.20:41914                10.139.17.19:opsmessaging         
Mon May 31 06:22:22 UTC 2021
tcp    ESTAB      0      57413  192.168.209.20:41914                10.139.17.19:opsmessaging         

192.168.209.20 这个 IP 是我们的微服务的地址,10.139.17.19 则是断网的那个 Server 的地址。由于看到这个 tcp 连接是建立的 (ESTAB),所以看到的 SEND-Q 那么大的数值,表述我们发送的但是没有收到 ACK 的数据包。

问题复现

部署两个虚拟机

具体如何部署,参考我之前的这篇文章 本地 ubuntu18.04 VM 安装 k8s

目前我是两台虚拟机,下面测试情况下 192.168.56.21 为服务端,192.168.56.22 为客户端。

编写测试程序

服务端程序

package main

import (
    "fmt"
    "log"
    "net/http"
    //"time"

    "golang.org/x/net/http2"
    "golang.org/x/net/http2/h2c"
)

func main() {

    handlerGet := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        fmt.Println("receive a new request")
        w.WriteHeader(http.StatusNoContent)

    })

    mux := http.NewServeMux()
    mux.Handle("/", handlerGet)

    h2s := &http2.Server{}

    h1s := &http.Server{
        Addr:    ":10001",
        Handler: h2c.NewHandler(mux, h2s),
    }
    log.Fatal(h1s.ListenAndServe())
}

客户端程序

package main

import (
    "bytes"
    "strings"

    //"context"
    "bufio"
    "crypto/tls"
    "log"
    "net"
    "net/http"
    "os"
    "strconv"
    "time"

    "golang.org/x/net/http2"
)

const uri = "http://192.168.56.21:10001"

func createHTTPClient() *http.Client {

    var tr *http2.Transport

    tr = &http2.Transport{
        TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        AllowHTTP:       true,
        DialTLS: func(network, addr string, cfg *tls.Config) (net.Conn, error) {
            d := &net.Dialer{
                Timeout: time.Duration(2 * time.Second),
            }
            return d.Dial(network, addr)
        },
        //ReadIdleTimeout: time.Duration(15 * time.Second),
    }
    return &http.Client{
        Transport: tr,
        Timeout:   time.Duration(2 * time.Second)}
}

func main() {
    cli := createHTTPClient()
    reader := bufio.NewReader(os.Stdin)
    for {
        text, _ := reader.ReadString('\n')
        text = strings.Replace(text, "\n", "", -1)
        if text == "" {
            continue
        }
        log.Println("input text ", text)
        body := []byte(text)
        req, _ := http.NewRequest("POST", uri, bytes.NewBuffer(body))
        if len(body) > 0 {
            req.Header.Set("Content-Type", "application/json")
            req.Header.Set("Content-Length", strconv.Itoa(len(body)))
        }

        log.Println("start time", time.Now())

        _, err := cli.Do(req)

        log.Println("end time", time.Now())

        if err != nil {
            log.Println("error", err)
        }
    }
}

模拟测试

预备操作

保持网络正常情况下:

  • 开启服务端程序;
  • 开启 ss 命令监控 socket 状态;
  • 开启 tcpdump 查看数据包传输情况(由于 Server 端网络后面会被关闭,所以在 client 端监听)。
  • 运行客户端程序,并开启 GODEBUG=http2debug=2

客户端输出日志如下:

shenjie@shenjie-VirtualBox:~/go/src/mockhttp2client$ go run client.go 
2021/06/03 12:49:33 input text  abc
2021/06/03 12:49:33 start time 2021-06-03 12:49:33.2222355 +0800 CST m=+3.233222828
2021/06/03 12:49:33 http2: Transport creating client conn 0xc00010a000 to 192.168.56.21:10001
2021/06/03 12:49:33 http2: Framer 0xc00011a000: wrote SETTINGS len=18, settings: ENABLE_PUSH=0, INITIAL_WINDOW_SIZE=4194304, MAX_HEADER_LIST_SIZE=10485760
2021/06/03 12:49:33 http2: Framer 0xc00011a000: wrote WINDOW_UPDATE len=4 (conn) incr=1073741824
2021/06/03 12:49:33 http2: Transport encoding header ":authority" = "192.168.56.21:10001"
2021/06/03 12:49:33 http2: Transport encoding header ":method" = "POST"
2021/06/03 12:49:33 http2: Transport encoding header ":path" = "/"
2021/06/03 12:49:33 http2: Transport encoding header ":scheme" = "http"
2021/06/03 12:49:33 http2: Transport encoding header "content-type" = "application/json"
2021/06/03 12:49:33 http2: Transport encoding header "content-length" = "3"
2021/06/03 12:49:33 http2: Transport encoding header "accept-encoding" = "gzip"
2021/06/03 12:49:33 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2021/06/03 12:49:33 http2: Framer 0xc00011a000: wrote HEADERS flags=END_HEADERS stream=3 len=55
2021/06/03 12:49:33 http2: Framer 0xc00011a000: wrote DATA flags=END_STREAM stream=3 len=3 data="abc"
2021/06/03 12:49:33 http2: Framer 0xc00011a000: read SETTINGS len=24, settings: MAX_FRAME_SIZE=1048576, MAX_CONCURRENT_STREAMS=250, MAX_HEADER_LIST_SIZE=1048896, INITIAL_WINDOW_SIZE=1048576
2021/06/03 12:49:33 http2: Transport received SETTINGS len=24, settings: MAX_FRAME_SIZE=1048576, MAX_CONCURRENT_STREAMS=250, MAX_HEADER_LIST_SIZE=1048896, INITIAL_WINDOW_SIZE=1048576
2021/06/03 12:49:33 http2: Framer 0xc00011a000: wrote SETTINGS flags=ACK len=0
2021/06/03 12:49:33 http2: Framer 0xc00011a000: read SETTINGS flags=ACK len=0
2021/06/03 12:49:33 http2: Transport received SETTINGS flags=ACK len=0
2021/06/03 12:49:33 http2: Framer 0xc00011a000: read WINDOW_UPDATE len=4 (conn) incr=983041
2021/06/03 12:49:33 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=983041
2021/06/03 12:49:33 http2: Framer 0xc00011a000: read HEADERS flags=END_STREAM|END_HEADERS stream=3 len=26
2021/06/03 12:49:33 http2: decoded hpack field header field ":status" = "204"
2021/06/03 12:49:33 http2: decoded hpack field header field "date" = "Thu, 03 Jun 2021 04:49:34 GMT"
2021/06/03 12:49:33 http2: Transport received HEADERS flags=END_STREAM|END_HEADERS stream=3 len=26
2021/06/03 12:49:33 http2: Framer 0xc00011a000: read WINDOW_UPDATE len=4 (conn) incr=3
2021/06/03 12:49:33 http2: Transport received WINDOW_UPDATE len=4 (conn) incr=3
2021/06/03 12:49:33 end time 2021-06-03 12:49:33.225610556 +0800 CST m=+3.236597863

ss 命令输出结果:

shenjie@shenjie-VirtualBox:~$ while true;do date;ss|grep 10001;sleep 5;done
20210603 日 星期四 12:49:24 CST
20210603 日 星期四 12:49:29 CST
20210603 日 星期四 12:49:34 CST
tcp               ESTAB               0                    0       192.168.56.22:47238        192.168.56.21:10001  

tcpdump 命令显示建立连接信息:

12:49:33.222612 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [S], seq 1563218660, win 64240, options [mss 1460,sackOK,TS val 1794053933 ecr 0,nop,wscale 7], length 0
12:49:33.223026 IP 192.168.56.21.10001 > shenjie-VirtualBox.47238: Flags [S.], seq 1065696065, ack 1563218661, win 65160, options [mss 1460,sackOK,TS val 292741325 ecr 1794053933,nop,wscale 7], length 0
12:49:33.223074 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [.], ack 1, win 502, options [nop,nop,TS val 1794053934 ecr 292741325], length 0
12:49:33.223642 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 1:65, ack 1, win 502, options [nop,nop,TS val 1794053935 ecr 292741325], length 64
12:49:33.223934 IP 192.168.56.21.10001 > shenjie-VirtualBox.47238: Flags [.], ack 65, win 509, options [nop,nop,TS val 292741326 ecr 1794053935], length 0
12:49:33.224130 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 65:129, ack 1, win 502, options [nop,nop,TS val 1794053935 ecr 292741326], length 64
12:49:33.224172 IP 192.168.56.21.10001 > shenjie-VirtualBox.47238: Flags [P.], seq 1:34, ack 65, win 509, options [nop,nop,TS val 292741326 ecr 1794053935], length 33
12:49:33.224186 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [.], ack 34, win 502, options [nop,nop,TS val 1794053935 ecr 292741326], length 0
12:49:33.224326 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 129:141, ack 34, win 502, options [nop,nop,TS val 1794053935 ecr 292741326], length 12
12:49:33.224335 IP 192.168.56.21.10001 > shenjie-VirtualBox.47238: Flags [P.], seq 34:56, ack 65, win 509, options [nop,nop,TS val 292741326 ecr 1794053935], length 22
12:49:33.224336 IP 192.168.56.21.10001 > shenjie-VirtualBox.47238: Flags [.], ack 129, win 509, options [nop,nop,TS val 292741326 ecr 1794053935], length 0
12:49:33.224698 IP 192.168.56.21.10001 > shenjie-VirtualBox.47238: Flags [P.], seq 56:104, ack 141, win 509, options [nop,nop,TS val 292741327 ecr 1794053935], length 48

模拟 Server 断网

  • 在 Server 机器上 输入该命令进行断网 sudo ifdown enp0s8

  • 客户端程序发送消息

此时我们可以看到如下日志:

客户端程序输出下面日志(由于我们设置了 Timeout=2s, 所以可以看到后面我们的 Stream 被 cancel 了,但是这个 Frame 已经写了,从 ss 命令输出结果中也能看到):

2021/06/03 12:49:47 input text  abcdgagd
2021/06/03 12:49:47 start time 2021-06-03 12:49:47.924363864 +0800 CST m=+17.935351175
2021/06/03 12:49:47 http2: Transport encoding header ":authority" = "192.168.56.21:10001"
2021/06/03 12:49:47 http2: Transport encoding header ":method" = "POST"
2021/06/03 12:49:47 http2: Transport encoding header ":path" = "/"
2021/06/03 12:49:47 http2: Transport encoding header ":scheme" = "http"
2021/06/03 12:49:47 http2: Transport encoding header "content-type" = "application/json"
2021/06/03 12:49:47 http2: Transport encoding header "content-length" = "8"
2021/06/03 12:49:47 http2: Transport encoding header "accept-encoding" = "gzip"
2021/06/03 12:49:47 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2021/06/03 12:49:47 http2: Framer 0xc00011a000: wrote HEADERS flags=END_HEADERS stream=5 len=10
2021/06/03 12:49:47 http2: Framer 0xc00011a000: wrote DATA flags=END_STREAM stream=5 len=8 data="abcdgagd"
2021/06/03 12:49:49 http2: Framer 0xc00011a000: wrote RST_STREAM stream=5 len=4 ErrCode=CANCEL
2021/06/03 12:49:49 RoundTrip failure: context deadline exceeded
2021/06/03 12:49:49 end time 2021-06-03 12:49:49.931512103 +0800 CST m=+19.942499411
2021/06/03 12:49:49 error Post "http://192.168.56.21:10001": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

ss 命令结果如下, 我们可以看到 SEND-Q 增加到 36 了。TODO: 这里为啥是涨了 36????

20210603 日 星期四 12:49:49 CST
tcp               ESTAB               0                    36         192.168.56.22:47238      192.168.56.21:10001 

tcpdump 输出如下: 看到我们发送的数据包 seq 150:186 没有收到 ACK,由于 TCP 重传机制,所以它在程序 timeout 之前会触发重传,但最后程序 cancel 了这个 stream,所以没有继续重传。

12:49:47.924574 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:169, ack 104, win 502, options [nop,nop,TS val 1794068635 ecr 292741327], length 19
12:49:47.924712 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 169:186, ack 104, win 502, options [nop,nop,TS val 1794068636 ecr 292741327], length 17
12:49:47.940110 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 169:186, ack 104, win 502, options [nop,nop,TS val 1794068651 ecr 292741327], length 17
12:49:48.152971 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:186, ack 104, win 502, options [nop,nop,TS val 1794068864 ecr 292741327], length 36
12:49:48.584546 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:186, ack 104, win 502, options [nop,nop,TS val 1794069295 ecr 292741327], length 36
12:49:49.416558 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:186, ack 104, win 502, options [nop,nop,TS val 1794070127 ecr 292741327], length 36
12:49:51.085043 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:186, ack 104, win 502, options [nop,nop,TS val 1794071796 ecr 292741327], length 36
12:49:54.376554 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:186, ack 104, win 502, options [nop,nop,TS val 1794075087 ecr 292741327], length 36
12:50:01.032263 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:186, ack 104, win 502, options [nop,nop,TS val 1794081743 ecr 292741327], length 36
12:50:14.345196 IP shenjie-VirtualBox.47238 > 192.168.56.21.10001: Flags [P.], seq 150:186, ack 104, win 502, options [nop,nop,TS val 1794095056 ecr 292741327], length 36

再次测试一遍,结果如下:

2021/06/03 12:49:57 input text  defagagrg
2021/06/03 12:49:57 start time 2021-06-03 12:49:57.233483241 +0800 CST m=+27.244470546
2021/06/03 12:49:57 http2: Transport encoding header ":authority" = "192.168.56.21:10001"
2021/06/03 12:49:57 http2: Transport encoding header ":method" = "POST"
2021/06/03 12:49:57 http2: Transport encoding header ":path" = "/"
2021/06/03 12:49:57 http2: Transport encoding header ":scheme" = "http"
2021/06/03 12:49:57 http2: Transport encoding header "content-type" = "application/json"
2021/06/03 12:49:57 http2: Transport encoding header "content-length" = "9"
2021/06/03 12:49:57 http2: Transport encoding header "accept-encoding" = "gzip"
2021/06/03 12:49:57 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2021/06/03 12:49:57 http2: Framer 0xc00011a000: wrote HEADERS flags=END_HEADERS stream=7 len=10
2021/06/03 12:49:57 http2: Framer 0xc00011a000: wrote DATA flags=END_STREAM stream=7 len=9 data="defagagrg"
2021/06/03 12:49:59 http2: Framer 0xc00011a000: wrote RST_STREAM stream=7 len=4 ErrCode=CANCEL
2021/06/03 12:49:59 RoundTrip failure: context deadline exceeded
2021/06/03 12:49:59 end time 2021-06-03 12:49:59.234699462 +0800 CST m=+29.245686766
2021/06/03 12:49:59 error Post "http://192.168.56.21:10001": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
20210603 日 星期四 12:49:59 CST
tcp               ESTAB               0                    99        192.168.56.22:47238    192.168.56.21:10001 

我们看到程序中写了 HEADERS 和 Data,而从 ss 看到 SEND-Q 又增加了。

为了模拟 SEND-Q 增加到最大情况下,我修改了客户端程序,每次输入都开启 100 个 goroutine 去发送消息。

再多次输入数据后,是的 SEND-Q 达到最大,不再增加,测试观察到程序新上来的请求不能写数据了,并且阻塞在那里,不会 timeout 出来。

2021/06/03 13:04:00 http2: Framer 0xc000104000: wrote HEADERS flags=END_HEADERS stream=1191 len=8
2021/06/03 13:04:00 http2: Transport encoding header ":authority" = "192.168.56.21:10001"
2021/06/03 13:04:00 http2: Transport encoding header ":method" = "POST"
2021/06/03 13:04:00 http2: Transport encoding header ":path" = "/"
2021/06/03 13:04:00 http2: Transport encoding header ":scheme" = "http"
2021/06/03 13:04:00 http2: Transport encoding header "content-type" = "application/json"
2021/06/03 13:04:00 http2: Transport encoding header "content-length" = "96"
2021/06/03 13:04:00 http2: Transport encoding header "accept-encoding" = "gzip"
2021/06/03 13:04:00 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2021/06/03 13:04:00 http2: Framer 0xc000104000: wrote HEADERS flags=END_HEADERS stream=1193 len=8

gghshthdhjyjfjagjalgapjagoalmdlfhaogfajgohanldkmflangufhaufodanflafjagoihanjdaghaugalgjagianjdal
2021/06/03 13:04:27 input text  gghshthdhjyjfjagjalgapjagoalmdlfhaogfajgohanldkmflangufhaufodanflafjagoihanjdaghaugalgjagianjdal
2021/06/03 13:04:27 start time 2021-06-03 13:04:27.083829364 +0800 CST m=+119.981749071 in goroutine  30
2021/06/03 13:04:27 start time 2021-06-03 13:04:27.083909942 +0800 CST m=+119.981829639 in goroutine  0
2021/06/03 13:04:27 start time 2021-06-03 13:04:27.083934273 +0800 CST m=+119.981853968 in goroutine  1
20210603 日 星期四 13:04:58 CST
tcp               ESTAB               0                    68891      192.168.56.22:47272    192.168.56.21:10001  
20210603 日 星期四 13:05:54 CST
tcp              ESTAB                  0                  68891      192.168.56.22:47272    192.168.56.21:10001 

这个时候基本上模拟出来 TCP 上 SEND-Q 满的情况,然后得到的结论是,此时发送出来的测试消息不能够被处理,并且一直阻塞,不能 timeout 出来。

而之所以真实环境上,请求端收到了 408,504 的错误代码,是因为我们在 VS 里面设置了 Timeout,所以如果请求端发送的请求由于微服务接收,但是在处理消息的时候,阻塞住了,导致了 Timeout 了。 这里下面会具体将为啥会回复 408 的错误代码,因为正常 Timeout,我们理解为,会回复 504 才对。对于 Istio 回复 504/408 的问题,我将在(二)中具体分析。

程序阻塞分析

添加下面 log

//add log in net/http2/transport.go
func (cc *ClientConn) idleState() clientConnIdleState {
    log.Println("start idleState")
    cc.mu.Lock()
    log.Println("end idleState")
    defer cc.mu.Unlock()
    return cc.idleStateLocked()
}
//add following logs in RoundTripOpt in net/http2/transport.go
        log.Println("start t.connPool().GetClientConn")
        cc, err := t.connPool().GetClientConn(req, addr)
        log.Println("end t.connPool().GetClientConn")
        if err != nil {
            t.vlogf("http2: Transport failed to get client conn for %s: %v", addr, err)
            return nil, err
        }
//add logs in Write in net/http2/transport.go

func (sew stickyErrWriter) Write(p []byte) (n int, err error) {
    if *sew.err != nil {
        return 0, *sew.err
    }
    log.Println("start sew.w.Write")
    n, err = sew.w.Write(p)
    log.Println("end sew.w.Write")
    *sew.err = err
    return
}

发送直到阻塞:

2021/06/03 17:59:03 end in idleStateLocked freshConn false
2021/06/03 17:59:03 http2: Transport encoding header ":authority" = "192.168.56.21:10001"
2021/06/03 17:59:03 http2: Transport encoding header ":method" = "POST"
2021/06/03 17:59:03 http2: Transport encoding header ":path" = "/"
2021/06/03 17:59:03 http2: Transport encoding header ":scheme" = "http"
2021/06/03 17:59:03 http2: Transport encoding header "content-type" = "application/json"
2021/06/03 17:59:03 http2: Transport encoding header "content-length" = "2555"
2021/06/03 17:59:03 http2: Transport encoding header "accept-encoding" = "gzip"
2021/06/03 17:59:03 http2: Transport encoding header "user-agent" = "Go-http-client/2.0"
2021/06/03 17:59:03 start cc.writeHeaders
2021/06/03 17:59:03 http2: Framer 0xc000184460: wrote HEADERS flags=END_HEADERS stream=1969 len=8
2021/06/03 17:59:03 start sew.w.Write

下一条请求:

receive a new request
2021/06/03 17:59:35 start time 2021-06-03 17:59:35.070256042 +0800 CST m=+97.009212543 count 985
2021/06/03 17:59:35 start t.connPool().GetClientConn
2021/06/03 17:59:35 start idleState

再一条请求:

receive a new request
2021/06/03 17:59:42 start time 2021-06-03 17:59:42.93241995 +0800 CST m=+104.871376490 count 986
2021/06/03 17:59:42 start t.connPool().GetClientConn

所以分析下来,当 Write 的不了的时候,就不会释放 cc.mu,导致下一条请求拿不到 cc.mu 锁,从而阻塞了。 其实在这里,我加了 Pprof 去分析 goroutine 的调度栈,也发现了死锁的问题,但是还是加了点 log 看的更明显一点。

然后发现也有人遇到这个问题:x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely

番外

此时如果将 Sever 端网络重新开启 sudo ifup enp0s8, 我们将看到,之前阻塞的消息能够被处理,从 ss 看到 SEND-Q 也被处理掉了。

20210603 日 星期四 13:05:54 CST
tcp              ESTAB                  0                   68891        192.168.56.22:47272       192.168.56.21:10001                              
20210603 日 星期四 13:05:59 CST
tcp              ESTAB                  0                   13           192.168.56.22:47280       192.168.56.21:10001                              
20210603 日 星期四 13:06:04 CST
tcp              ESTAB                  0                   0             192.168.56.22:47280      192.168.56.21:10001 

问题定位

好了,问题复现出来了,我们抛开 istio 的处理,我们只需要关心如何修改程序去避免这个问题。

其实从上面的复现步骤,我们最终能发现,这个 Server 断网的操作,客户端并没有感知到,所以建立的 TCP 连接仍然是 ESTAB 的状态,这从而导致了上层 HTTP 并不能感知到这个 TCP 是坏的,它收到消息,仍然会往这个连接上发送,所以最终导致这个 TCP 上的 SEND-Q 满了,再上来的消息则没法再写,就一直阻塞在这里了。

所以这个问题的根本原因则是在 Server 端断网情况下,客户端并没有去关闭这个连接。所以如果想要修复这个问题,客户端程序则需要感知 Server 是否是正常的,或者客户端收到了某种情况下,可以主动去关闭这个连接。综合考虑下来,及参考 net/http, x/net/http2: TCP connection is not closed after http client request times out after missing RST (TCP) or GOAWAY(HTTP/2) frame 这个 issue,我们决定启用 Ping 消息,去感知这种异常状态。

解决方案

在客户端程序中添加 ReadIdleTimeout,从而开启 Ping 消息。

    tr = &http2.Transport{
        TLSClientConfig: &tls.Config{InsecureSkipVerify: true},
        AllowHTTP:       true,
        DialTLS: func(network, addr string, cfg *tls.Config) (net.Conn, error) {
            d := &net.Dialer{
                Timeout: time.Duration(2 * time.Second),
            }
            return d.Dial(network, addr)
        },
        ReadIdleTimeout: time.Duration(15 * time.Second),
    }

这样我们就能够解决这个问题。

总结

这个问题从系统测试人员提供的信息上,我们大概能分析出来是我们的 HTTP2 客户端程序出现了问题,也就是对于 Server 端断网的情况下,我们的程序没有主动去关闭连接,或者说没有处理这种异常情况,导致连接的 SEND-Q 满了,还在发送消息。从而是后面的消息阻塞在那里,从而导致练级的问题。

该问题的出现,也是由于我们对 golang net/HTTP2 库的理解不够,对其中的很多参数并没有关注。