案例学习-Sysbench压测故障

背景:

故障源于知识星球的一个案例:

我们的数据库需要做在线升级丝滑的验证,所以构造了一个测试环境,客户端Sysbench 用长连接一直打压力,同时Server 端的数据库做在线升级,这个在线升级会让 Server进程重启,所以毫无疑问连接会断开重连,所以期望升级的时候 Sysbench端 QPS 跌0几秒钟然后快速恢复。

但是每次升级都是 Sysbench端 QPS永久跌0,再也不能恢复,所以需要分析为什么,问题出在哪里?有人说是服务端的问题因为只有服务端做了变更

整个测试过程中 Sysbench 是配置的2个连接去压 Server

故障重现:

https://articles.zsxq.com/id_zj5qazqa4odz.html 下面也保留一份。

  • 1. 数据库
docker run -it -d --net=host -e MYSQL_ROOT_PASSWORD=123 --name=plantegg mysql
  • 2. 安装依赖包
yum install -y java-1.8.0-openjdk.x86_64 java-1.8.0-openjdk-devel.x86_64  podman-docker.noarch wireshark sysbench mysql.x86_64

mysql -h127.1 -uroot -p123

  • 3. 生成数据库
mysql -h127.1  -uroot -p123 -e "create database test"
  • 4. 生成测试数据
sysbench --mysql-user='root' --mysql-password='123' --mysql-db='test' --mysql-host='127.0.0.1' --mysql-port='3306' --tables='16'  --table-size='10000' --range-size='5' --db-ps-mode='disable' --skip-trx='on' --mysql-ignore-errors='all' --time='1180' --report-interval='1' --histogram='on' --threads=1 oltp_read_only prepare
  • 5. 压测
sysbench --mysql-user='root' --mysql-password='123' --mysql-db='test' --mysql-host='127.0.0.1' --mysql-port='3306' --tables='16'  --table-size='10000' --range-size='5' --db-ps-mode='disable' --skip-trx='on' --mysql-ignore-errors='all' --time='1180' --report-interval='1' --histogram='on' --threads=1 oltp_read_only run
  • 6. kill,mysql进程,复现故障
#  mysql -h127.1  -uroot -p123
Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MySQL connection id is 14
Server version: 8.3.0 MySQL Community Server - GPL

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MySQL [(none)]> show processlist;
+----+-----------------+-----------------+------+---------+------+------------------------+------------------+
| Id | User            | Host            | db   | Command | Time | State                  | Info             |
+----+-----------------+-----------------+------+---------+------+------------------------+------------------+
|  5 | event_scheduler | localhost       | NULL | Daemon  |   94 | Waiting on empty queue | NULL             |
| 13 | root            | 127.0.0.1:33492 | test | Sleep   |    0 |                        | NULL             |
| 14 | root            | 127.0.0.1:51398 | NULL | Query   |    0 | init                   | show processlist |
+----+-----------------+-----------------+------+---------+------+------------------------+------------------+
MySQL [(none)]> kill 13;
Query OK, 0 rows affected (0.000 sec)

 

故障现象

  • 1.sysbench tps归 0

这个地方开始没有注意到, 由于已经给出分析结果的案例, 自己没有去主动分析。后面还是要先自己分析一下再看分析结果。

[ 20s ] thds: 1 tps: 620.98 qps: 8699.68 (r/w/o: 8699.68/0.00/0.00) lat (ms,95%): 2.03 err/s: 0.00 reconn/s: 0.00
[ 21s ] thds: 1 tps: 657.00 qps: 9196.98 (r/w/o: 9196.98/0.00/0.00) lat (ms,95%): 1.86 err/s: 0.00 reconn/s: 0.00
[ 22s ] thds: 1 tps: 660.02 qps: 9238.30 (r/w/o: 9238.30/0.00/0.00) lat (ms,95%): 1.79 err/s: 0.00 reconn/s: 0.00
[ 23s ] thds: 1 tps: 509.99 qps: 7136.82 (r/w/o: 7136.82/0.00/0.00) lat (ms,95%): 1.67 err/s: 0.00 reconn/s: 0.00
[ 24s ] thds: 1 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00
  • 2. 使用访问3306无法链接, 使用mysql访问返回99

plantegg 还验证了不同错误的返回信息,举一反三的能力

# mysql -h127.1 -uroot -p123
ERROR 2002 (HY000): Can't connect to MySQL server on '127.1' (99)
# perror 99
OS error code 99: Cannot assign requested address
  • 3. tcp链接不停增长, 大量CLOSE_WAIT状态
# ss -s
Total: 21548
TCP:   26392 (estab 298, closed 5002, orphaned 12, timewait 5000)

Transport Total     IP        IPv6
RAW       0         0         0
UDP       8         5         3
TCP       21390     21226     164
INET      21398     21231     167
FRAG      0         0         0

# netstat -an|grep 3306
tcp      132      0 127.0.0.1:60010         127.0.0.1:3306          CLOSE_WAIT
tcp      132      0 127.0.0.1:60002         127.0.0.1:3306          CLOSE_WAIT
tcp        0      0 127.0.0.1:34984         127.0.0.1:3306          ESTABLISHED
。。。
  • 4. sysbench cpu增长
# top
4156 root      20   0  545220 461792   6272 S  84.7  23.9   2:24.99 sysbench
。。。
#perf top -p 4156  (sysbench进程号)
Samples: 50K of event 'cpu-clock:pppH', 4000 Hz, Event count (approx.): 8617272536 lost: 0/0 drop: 0/0
Overhead  Shared Object       Symbol
  76.10%  [kernel]            [k] __inet_check_established
  11.45%  [kernel]            [k] __inet_hash_connect
。。。

 

故障分析

 

减少本地端口分配快速进入”状态”:

echo "60000 60010" > /proc/sys/net/ipv4/ip_local_port_range

使用tcpdump抓包查看

tcpdump -i lo port 3306 -w server.pcap

kill mysql进程后可以看到,3306端口发送FIN要求主动关闭,客户端没有响应,并再次发查询时,mysql回复RST。

从后面抓一个包看下,systbench使用了新的本地端口60004重新发起请求,这是MySQL返回1159

#perror 1159
MariaDB error code 1159 (ER_NET_READ_INTERRUPTED): Got timeout reading communication packet

mariadb发送 Greeting 10秒后,返回1159

server/sql/sql_const.h
/*
  Default time to wait before aborting a new client connection
  that does not respond to "initial server greeting" timely
*/
#define CONNECT_TIMEOUT        10

客户端这种处理导致大量链接处于 CLOSE WAIT状态。

CLOSE WAIT状态:

被动关闭时出现的状态,当状态由ESTABLEISTED收到FIN时进入CLOSE WAIT状态, 被动关闭一方没有等待调用close()系统调用,出现此状态。此案例中是客户端未关闭。

图片from wiki

 

Recv-Q

Recv-Q队列中有132字节数据待处理,这时的状态是CLOSE_WAIT。可以看到最后一个ACK报文中Ack = 133, 也就是在等待第133字节报文。那么前面已经收到了132字节数据。

man netstat 中对有Recv-Q的说明

Recv-Q
    Established:  The  count of bytes not copied by the user program connected to this socket.  Listening: Since Kernel
    2.6.18 this column contains the current syn backlog.

Send-Q
    Established: The count of bytes not acknowledged by the remote host.  Listening: Since Kernel  2.6.18  this  column
    contains the maximum size of the syn backlog.

 

Recv-Q计算方法

http://blog.chinaunix.net/uid-20662820-id-3509532.html

//net/ipv4/tcp_diag.c

static void tcp_diag_get_info(struct sock *sk, struct inet_diag_msg *r,
                  void *_info)
{
    struct tcp_info *info = _info;

    if (inet_sk_state_load(sk) == TCP_LISTEN) {
        r->idiag_rqueue = READ_ONCE(sk->sk_ack_backlog);
        r->idiag_wqueue = READ_ONCE(sk->sk_max_ack_backlog);
    } else if (sk->sk_type == SOCK_STREAM) {
        const struct tcp_sock *tp = tcp_sk(sk);

        r->idiag_rqueue = max_t(int, READ_ONCE(tp->rcv_nxt) -
                         READ_ONCE(tp->copied_seq), 0);
        r->idiag_wqueue = READ_ONCE(tp->write_seq) - tp->snd_una;
    }
    if (info)
        tcp_get_info(sk, info);
}

打印一下tp->rcv_nxt, tp->copied_seq 期待要接收的下一个序号,已经拷贝到用户的的序号

[root@iZ8vbd88lmglnbsnad85q3Z bpf]# vi state.t
#include <linux/skbuff.h>
#include <linux/ip.h>
#include <linux/socket.h>
#include <linux/netdevice.h>
#include <linux/tcp.h>

BEGIN {
  @tcp_stat[TCP_ESTABLISHED] = "TCP_ESTABLISHED";
  @tcp_stat[TCP_SYN_SENT] = "TCP_SYN_SENT";
  @tcp_stat[TCP_SYN_RECV] = "TCP_SYN_RECV";
  @tcp_stat[TCP_FIN_WAIT1] = "TCP_FIN_WAIT1";
  @tcp_stat[TCP_FIN_WAIT2] = "TCP_FIN_WAIT2";
  @tcp_stat[TCP_TIME_WAIT] = "TCP_TIME_WAIT";
  @tcp_stat[TCP_CLOSE] = "TCP_CLOSE";
  @tcp_stat[TCP_CLOSE_WAIT] = "TCP_CLOSE_WAIT";
  @tcp_stat[TCP_LAST_ACK] = "TCP_LAST_ACK";
  @tcp_stat[TCP_LISTEN] = "TCP_LISTEN";
  @tcp_stat[TCP_CLOSING] = "TCP_CLOSING";
  @tcp_stat[TCP_NEW_SYN_RECV] = "TCP_NEW_SYN_RECV";
}

kprobe:tcp_set_state
{
  // 1. 第一个参数是 struct sock
  $sk = (struct sock*)arg0;

  // 2. 获取struct tcp_sock
  $tp = (struct tcp_sock *)($sk);

  // 3. 打印 等待接收的下一个tcp段的序号 尚未从内核空间copy到用户空间的段最小的一个序号
  printf("comm %d %s %s : rcv_ntx %lu copied_seq %lu\n", pid, comm, @tcp_stat[$sk->sk_state], $tp->rcv_nxt, $tp->copied_seq);
}

END{
  clear(@tcp_stat);
}

参考链接 https://github.com/bpftrace/bpftrace/tree/master/tools

bpftrace state.t  | tee bpftrace.log

同时进行抓包

当时的疑惑点在 Server Greeting 77个字节, Response Error 1159 54个字节 , 131个字节,怎么Recv-Q显示是132呢? 后来翻到上面链接才看到这个值是一个差值,(等待接收下一个tcp序号 – 尚未从内核空间拷贝到用户空间的最小的序号)

Server Greeting Response Error 1159

FIN

sysbench最后一个ACK

# 对应的值
comm 7830 sysbench TCP_CLOSE_WAIT : rcv_ntx 949563644 copied_seq 949563512

往前找到949563512 ,Server Greeting 的报文开始序号。 

sysbench 一直再重连但是未发送mysql相关报文

相关故障处理方法:

https://www.ctyun.cn/developer/article/405333884604485

https://articles.zsxq.com/id_d27pgzmhuq08.html

https://github.com/akopytov/sysbench/pull/528

static int mysql_drv_reconnect(db_conn_t *sb_con)
{
  db_mysql_conn_t *db_mysql_con = (db_mysql_conn_t *) sb_con->ptr;
  MYSQL *con = db_mysql_con->mysql;

  log_text(LOG_DEBUG, "Reconnecting");

  DEBUG("mysql_close(%p)", con);
  mysql_close(con);
  mysql_init(con);

  while (mysql_drv_real_connect(db_mysql_con))
  {
    if (sb_globals.error)
      return DB_ERROR_FATAL;

    usleep(1000);
    mysql_close(con);
    mysql_init(con);
  }

  log_text(LOG_DEBUG, "Reconnected");

  return DB_ERROR_IGNORABLE;
}

 

ip_local_port_range

/proc/sys/net/ipv4/ip_local_port_range

另外一个现象就是本地端口使用,ss -s tcp链接不停增长导致本地端口用尽。__inet_check_established

//net/ipv4/inet_hashtables.c
/* called with local bh disabled */
static int __inet_check_established(struct inet_timewait_death_row *death_row,
                    struct sock *sk, __u16 lport,
                    struct inet_timewait_sock **twp)
{
...

    spin_lock(lock);
  
    sk_nulls_for_each(sk2, node, &head->chain) {
        if (sk2->sk_hash != hash)
            continue;
        // 四元组匹配
        if (likely(inet_match(net, sk2, acookie, ports, dif, sdif))) {
            if (sk2->sk_state == TCP_TIME_WAIT) {
                tw = inet_twsk(sk2);
                if (twsk_unique(sk, sk2, twp))
                    break;
            }
            goto not_unique;
        }
    }

...

not_unique:
    spin_unlock(lock);
    return -EADDRNOTAVAIL;
}

由于这个测试场景,源地址,目标地址,目标端口都是固定的, 那么只有源端口变化。所以当源端口用尽, 返回99

An application using the connect() call fails with an error code EADDRNOTAVAIL [99], 也就是mysql返回的那个返回码

最后发送的RST
bpftrace -e 'kprobe:tcp_send_active_reset {printf("kstack: %s\n", kstack);}'

kstack:
tcp_send_active_reset+1
tcp_close+317
inet_release+66
__sock_release+61
sock_close+17
__fput+150
task_work_run+92
do_exit+549
do_group_exit+51
get_signal+337
arch_do_signal_or_restart+236
exit_to_user_mode_loop+179
exit_to_user_mode_prepare+110
syscall_exit_to_user_mode+18
entry_SYSCALL_64_after_hwframe+97

手工终止sysbench后 通过bpftrace可以看到,在程序退出时如果有未处理的数据, 操作系统发送了RST包。

/net/ipv4/tcp.c

/* As outlined in RFC 2525, section 2.17, we send a RST here because
 * data was lost. To witness the awful effects of the old behavior of
 * always doing a FIN, run an older 2.1.x kernel or 2.0.x, start a bulk
 * GET in an FTP client, suspend the process, wait for the client to
 * advertise a zero window, then kill -9 the FTP client, wheee...
 * Note: timeout is always zero in such a case.
 */
2558         } else if (data_was_unread) {
2559                 /* Unread data was tossed, zap the connection. */
2560                 NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPABORTONCLOSE);
2561                 tcp_set_state(sk, TCP_CLOSE);
2562                 tcp_send_active_reset(sk, sk->sk_allocation);

不太熟悉bpf实现原理,这里tcp_close+317 定位到。

2600         sk_stream_wait_close(sk, timeout);

后面可以梳理一下tcp_close的处理流程

参考链接 https://www.iteye.com/blog/simohayha-503856

 

我正在「程序员案例分享圈」和朋友们讨论有趣的话题,你⼀起来吧?https://t.zsxq.com/18VYd8fPm

 

参考及引用

http://blog.chinaunix.net/uid-20662820-id-3509532.html

https://www.ctyun.cn/developer/article/405333884604485

https://plantegg.github.io/2021/04/06/%E4%B8%BA%E4%BB%80%E4%B9%88%E8%BF%99%E4%B9%88%E5%A4%9ACLOSE_WAIT/

https://plantegg.github.io/2019/04/21/netstat%E5%AE%9A%E4%BD%8D%E6%80%A7%E8%83%BD%E6%A1%88%E4%BE%8B/

https://plantegg.github.io/2020/11/30/%E4%B8%80%E5%8F%B0%E6%9C%BA%E5%99%A8%E4%B8%8A%E6%9C%80%E5%A4%9A%E8%83%BD%E5%88%9B%E5%BB%BA%E5%A4%9A%E5%B0%91%E4%B8%AATCP%E8%BF%9E%E6%8E%A5/

图片from chatgpt

Comments are closed.