IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    QEMU热迁移中的vhost-user-nvme连接问题

    OenHan发表于 2018-07-14 12:48:33
    love 0

    之前耗了一段时间做SPDK下的vhost-user-nvme的开发,就是让VM内使用NVME存储。涉及到vhost-user-nvme设备的QEMU live迁移或者热升级的开发,碰到一个有意思的问题,值得写一下。

    QEMU版本:https://github.com/spdk/qemu.git 最好用https://review.gerrithub.io/#/c/spdk/qemu/+/406011/

    SPDK版本:https://github.com/spdk/spdk

    现象:大约QEMU迁移1000次左右时,SPDK拉起的vhost进程就会因段错误crash掉。

    直接原因就出在spdk_vhost_nvme_admin_passthrough函数下的SPDK_NVME_OPC_ABORT过程,当qemu下发NVME命令给vhost时,发生了段错误

    case SPDK_NVME_OPC_DOORBELL_BUFFER_CONFIG:
        ret = vhost_nvme_doorbell_buffer_config(nvme, req, cpl);
        break;
    case SPDK_NVME_OPC_ABORT:
        //出问题的代码在此处
        sq_tail = nvme->dbbuf_dbs[sq_offset(1, 1)] & 0xffffu;
        cq_head = nvme->dbbuf_dbs[cq_offset(1, 1)] & 0xffffu;
        SPDK_NOTICELOG("ABORT: CID %u, SQ_TAIL %u, CQ_HEAD %u\n", (req->cdw10 >> 16) & 0xffffu, sq_tail, cq_head);

    很明显,dbbuf_dbs此时已经为空,仔细翻看一下代码就清楚了,只有destroy_device_poller_cb执行后才会释放dbbuf_dbs。spdk_vhost_nvme_stop_device会调用destroy_device_poller_cb,spdk_vhost_nvme_stop_device则是spdk_vhost_nvme_device_backend结构中stop_device的定义:

    static const struct spdk_vhost_dev_backend spdk_vhost_nvme_device_backend = {
    .start_device = spdk_vhost_nvme_start_device,
    .stop_device = spdk_vhost_nvme_stop_device,
    .dump_info_json = spdk_vhost_nvme_dump_info_json,
    .write_config_json = spdk_vhost_nvme_write_config_json,
    .remove_device = spdk_vhost_nvme_dev_remove,
    };

    在往上走就是stop_device,即spdk_vhost_event_send(vdev, vdev->backend->stop_device, 3, "stop device"),stop_device同样是另外一个结构体内的函数定义

    const struct vhost_device_ops g_spdk_vhost_ops = {
    .new_device = start_device,
    .destroy_device = stop_device,
    .get_config = get_config,
    .set_config = set_config,
    .new_connection = new_connection,
    .destroy_connection = destroy_connection,
    .vhost_nvme_admin_passthrough = spdk_vhost_nvme_admin_passthrough,
    .vhost_nvme_set_cq_call = spdk_vhost_nvme_set_cq_call,
    .vhost_nvme_get_cap = spdk_vhost_nvme_get_cap,
    };

    再往上走就是vhost_destroy_device,最后退到vhost_user_read_cb函数,之前我在此处是有一个改动的,目前一个vhost_user_connection对应一个virtio_net,我搞成了多对一格式,那么vhost_user_read_cb接受qemu消息失败的情况下,就需要释放vhost_user_connection,最后一个引用就需要释放virtio_net。

    static void vhost_user_read_cb(int connfd, void *dat, int *remove) 
    {
        struct vhost_user_connection *conn = dat;
        struct vhost_user_socket *vsocket = conn->vsocket;
        int ret;
        ret = vhost_user_msg_handler(conn->vid, connfd);
        if (ret < 0) 
        {
            close(connfd);
            *remove = 1;
            vhost_destroy_device(conn->vid);
            if (vsocket->notify_ops->destroy_connection)
            vsocket->notify_ops->destroy_connection(conn->vid);
            pthread_mutex_lock(&vsocket->conn_mutex);
            TAILQ_REMOVE(&vsocket->conn_list, conn, next);
            pthread_mutex_unlock(&vsocket->conn_mutex);
            free(conn);
            if (vsocket->reconnect) 
            {
                create_unix_socket(vsocket);
                vhost_user_start_client(vsocket);
            }
        }
    }

    验证之后问题并不是出在多对一的机制上,从vhost收取消息的路径跟踪过去,vhost_user_msg_handler,到read_vhost_message,看返回错误时:

    ret = read_vhost_message(fd, &msg);
    if (ret <= 0 || msg.request >= VHOST_USER_MAX) 
    {
        if (ret < 0)
            //下面是返回错误
            RTE_LOG(ERR, VHOST_CONFIG, "vhost read message failedn");
        else if (ret == 0)
            //正常VM关闭的断连提示
            RTE_LOG(INFO, VHOST_CONFIG, "vhost peer closedn");
        else
            RTE_LOG(ERR, VHOST_CONFIG, "vhost read incorrect messagen");
        return -1;
    }

    进入到read_vhost_message中,看read_fd_message下

    msgh.msg_iov = &iov
    msgh.msg_iovlen = 1;
    msgh.msg_control = control;
    msgh.msg_controllen = sizeof(control);
    ret = recvmsg(sockfd, &msgh, 0);
    if (ret <= 0) 
    {
        RTE_LOG(ERR, VHOST_CONFIG, "recvmsg failedn");
        return ret;
    }
    if (msgh.msg_flags & (MSG_TRUNC | MSG_CTRUNC)) {
        RTE_LOG(ERR, VHOST_CONFIG, "truncted msgn");
        return -1;
    }

    错误出在"truncted msg"上,而且flags是MSG_CTRUNC,man recvmsg手册中如下描述:

    MSG_CTRUNC
    indicates that some control data were discarded due to lack of space in the buffer for ancillary data.

    此处恶补了一下收发包的知识,msgh有两部分组成,msg_iov和msg_control,这两部分都可以通过sendmsg/recvmsg传递信息,在read_vhost_message中,这两部分的长度已经固定好了,即VHOST_USER_HDR_SIZE和VHOST_MEMORY_MAX_NREGIONS,那如果发包的buffer太长呢,此处需要再看QEMU代码。

    以vhost_set_vring_file为例,vhost_user_write下发的msg和fds,在qemu_chr_fe_set_msgfds中,CHARDEV_GET_CLASS(s)->set_msgfds即tcp_set_msgfds,此处fds被赋值给了write_msgfds。

    s->write_msgfds = g_new(int, num);
    memcpy(s->write_msgfds, fds, num * sizeof(int));

    回到vhost_user_write下,qemu_chr_fe_write_all到qemu_chr_write再到qemu_chr_write_buffer,继续tcp_chr_write,此处就有write_msgfds的使用了。

    int ret = io_channel_send_full(s->ioc, buf, len, s->write_msgfds, s->write_msgfds_num);
    if (s->write_msgfds_num /*&& errno != EAGAIN*/) {
         g_free(s->write_msgfds);
         s->write_msgfds = 0;
         s->write_msgfds_num = 0;
    }

    此处还有一个涉及重连的问题,即在tcp_chr_write返回给qemu_chr_write_buffer EAGAIN后tcp_chr_write就会重试,而write_msgfds此刻都已被释放掉了,需要加上上面那段注释的代码,这个重连的问题和我遇到的问题十分契合,然而测试之后,仍然有问题,且稳定出现。

    retry:
    res = cc->chr_write(s, buf + *offset, len - *offset);
    if (res < 0 && errno == EAGAIN && write_all) {
        g_usleep(100);
        goto retry;
    }

    通过走读代码或者在qio_channel_socket_writev加日志记录sendmsg使用的msg中的各种信息,然而没看出任何异常状态或者长度,sendmsg也没有任何返回错误。

    memset(control, 0, CMSG_SPACE(sizeof(int) * SOCKET_MAX_FDS));
    
    msg.msg_iov = (struct iovec *)iov;
    msg.msg_iovlen = niov;
    
    if (nfds) {
        if (nfds > SOCKET_MAX_FDS) {
             error_setg_errno(errp, EINVAL,"Only %d FDs can be sent, got %zu",SOCKET_MAX_FDS, nfds);
             return -1;
        }
    
        msg.msg_control = control;
        msg.msg_controllen = CMSG_SPACE(sizeof(int) * nfds);
    
        cmsg = CMSG_FIRSTHDR(&msg);
        cmsg->cmsg_len = CMSG_LEN(fdsize);
        cmsg->cmsg_level = SOL_SOCKET;
        cmsg->cmsg_type = SCM_RIGHTS;
        memcpy(CMSG_DATA(cmsg), fds, fdsize);
    }
    
    retry:
    ret = sendmsg(sioc->fd, &msg, 0);

    逐渐收缩范围基本确定在QEMU迁移1000左右,sendmsg发送recvmsg接受会发生MSG_CTRUNC错误。咨询了其他同事,大家看法和man手册一致。

    没办法,赶鸭子上架看网络,看sendmsg/recvmsg内核实现,因为问题搞5个小时就会出现一次,所以还是很容易就发现了问题:

    在内核代码函数scm_detach_fds下,最后几行:

    //此处i=0时msg_flags获得MSG_CTRUNC
    if (i < fdnum || (fdnum && fdmax <= 0)) {
        msg->msg_flags |= MSG_CTRUNC;
    
    //代码往上走,在循环里面,一开始跳出i才为0,
    for (i=0, cmfptr=(__force int __user *)CMSG_DATA(cm); i<fdmax;
          i++, cmfptr++)
    {
     struct socket *sock;
     int new_fd;
     err = security_file_receive(fp[i]);
     if (err)
     break;
    //确定get_unused_fd_flags执行失败
     err = get_unused_fd_flags(MSG_CMSG_CLOEXEC & msg->msg_flags
     ? O_CLOEXEC : 0);
     if (err < 0)
     break;
    }

    通过内核代码分析,可以得知,当使用recvmsg的进程fd被耗光的时候,进程也会收到MSG_CTRUNC标志。

    通过 ll /proc/21383/fd/就可以看到
    lrwx—— 1 root root 64 Aug 16 23:08 952 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:08 953 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:09 954 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:09 955 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:09 956 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:09 957 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:10 958 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:10 959 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 20:03 96 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:10 960 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:10 961 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:10 962 -> anon_inode:[eventfd]
    lrwx—— 1 root root 64 Aug 16 23:11 963 -> anon_inode:[eventfd]

    vhost使用的这些fd来自于,QEMU会使用nvme_admin_cmd发送NVME_ADM_CMD_DB_BUFFER_CFG给vhost,spdk_vhost_nvme_set_cq_call进行virq需要将原来的值释放掉。

    至于为什么是1000左右crash,那是因为:

    ulimit -a | grep files
    open files (-n) 1024

    完结


    QEMU热迁移中的vhost-user-nvme连接问题来自于OenHan

    链接为:http://oenhan.com/qemu-spdk-vhost-user-nvme



沪ICP备19023445号-2号
友情链接