为什么nova计算节点上报的剩余磁盘空间为负数？

来源：互联网发布：新网互联域名如何续费编辑：程序博客网时间：2024/06/06 04:07

<span style="font-family: Tahoma; text-align: -webkit-auto; background-color: rgb(255, 255, 255);">注：本文针对Kilo版本。</span>

在使用openstack时，遇到了计算节点上报的可用磁盘空间为负数的情况，这里通过代码走读来一窥究竟。

在计算节点上运行的nova-compute服务中，由一个周期任务update_available_resource来负责资源统计和上报：

    @<strong>periodic_task</strong>.<strong>periodic_task</strong>    def <span style="color:#000066;">update_available_resource</span>(self, context):        """See driver.get_available_resource()        Periodic process that keeps that the compute host's understanding of        resource availability and usage in sync with the underlying hypervisor.        :param context: security context        """

这个函数中，调用的是ResourceTracker的接口获取可用资源：

            rt = self._get_resource_tracker(nodename)            rt.<strong>update_available_resource</strong>(context)

而ResourceTracker又是实际调用libvirt driver来进行资源统计信息的获取：

    def <strong>update_available_resource</strong>(self, context):        """Override in-memory calculations of compute node resource usage based        on data audited from the hypervisor layer.        Add in resource claims in progress to account for operations that have        declared a need for resources, but not necessarily retrieved them from        the hypervisor layer yet.        """        LOG.info(_LI("Auditing locally available compute resources for "                     "node %(node)s"),                 {'node': self.nodename})        resources = <strong>self.driver.get_available_resource</strong>(self.nodename)

这个获取资源统计信息的函数定义在virt\libvirt\driver.py中：

    def <strong>get_available_resource</strong>(self, nodename):        """Retrieve resource information.        This method is called when nova-compute launches, and        as part of a periodic task that records the results in the DB.        :param nodename: will be put in PCI device        :returns: dictionary containing resource info        """        disk_info_dict = self._get_local_gb_info()        data = {}        # NOTE(dprince): calling capabilities before getVersion works around        # an initialization issue with some versions of Libvirt (1.0.5.5).        # See: https://bugzilla.redhat.com/show_bug.cgi?id=1000116        # See: https://bugs.launchpad.net/nova/+bug/1215593        # Temporary convert supported_instances into a string, while keeping        # the RPC version as JSON. Can be changed when RPC broadcast is removed        data["supported_instances"] = jsonutils.dumps(            self._get_instance_capabilities())        data["vcpus"] = self._get_vcpu_total()        data["memory_mb"] = self._get_memory_mb_total()        data["local_gb"] = disk_info_dict['total']        data["vcpus_used"] = self._get_vcpu_used()        data["memory_mb_used"] = self._get_memory_mb_used()        data["local_gb_used"] = disk_info_dict['used']        data["hypervisor_type"] = self._host.get_driver_type()        data["hypervisor_version"] = self._host.get_version()        data["hypervisor_hostname"] = self._host.get_hostname()        # TODO(berrange): why do we bother converting the        # libvirt capabilities XML into a special JSON format ?        # The data format is different across all the drivers        # so we could just return the raw capabilities XML        # which 'compare_cpu' could use directly        #        # That said, arch_filter.py now seems to rely on        # the libvirt drivers format which suggests this        # data format needs to be standardized across drivers        data["cpu_info"] = jsonutils.dumps(self._get_cpu_info())        disk_free_gb = disk_info_dict['free']        disk_over_committed = self._get_disk_over_committed_size_total()        available_least = disk_free_gb * units.Gi - disk_over_committed        data['disk_available_least'] = available_least / units.Gi        data['pci_passthrough_devices'] = \            self._get_pci_passthrough_devices()        numa_topology = self._get_host_numa_topology()        if numa_topology:            data['numa_topology'] = numa_topology._to_json()        else:            data['numa_topology'] = None        return data

看一下跟磁盘资源相关的部分，首先是调用了libvirt driver的这个静态函数，得到total/free/used三个值，以gigabytes为单位：

    @staticmethod    def get_local_gb_info():        """Get local storage info of the compute node in GB.        :returns: A dict containing:             :total: How big the overall usable filesystem is (in gigabytes)             :free: How much space is free (in gigabytes)             :used: How much space is used (in gigabytes)        """        if CONF.libvirt.images_type == 'lvm':            info = libvirt_utils.get_volume_group_info(CONF.libvirt.images_volume_group)        else:            info = libvirt_utils.get_fs_info(CONF.instances_path)        for (k, v) in info.iteritems():            info[k] = v / units.Gi  //注意：这里把结果的单位都换算成了GB！        return info

从get_local_gb_info这个函数中可以看到，如果存放instances用的是文件系统而非lvm，则调用下面的函数获取资源数据：

def get_fs_info(path):    """Get free/used/total space info for a filesystem    :param path: Any dirent on the filesystem    :returns: A dict containing:             :free: How much space is free (in bytes)             :used: How much space is used (in bytes)             :total: How big the filesystem is (in bytes)    """    hddinfo = os.statvfs(path)    total = hddinfo.f_frsize * hddinfo.f_blocks    free = hddinfo.f_frsize * hddinfo.f_bavail    used = hddinfo.f_frsize * (hddinfo.f_blocks - hddinfo.f_bfree)    return {'total': total,            'free': free,            'used': used}

get_fs_info这个函数获取到的信息和用df命令看到的结果基本是一样的：

[root@host123 ~]# python
Python 2.7.5 (default, Feb 11 2014, 07:46:25)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-13)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> hddinfo = os.statvfs("/var/lib/nova")
>>> total = hddinfo.f_frsize * hddinfo.f_blocks
>>> free = hddinfo.f_frsize * hddinfo.f_bavail
>>> used = hddinfo.f_frsize * (hddinfo.f_blocks - hddinfo.f_bfree)
>>>
>>> print total/1024/1024/1024
254
>>> print free/1024/1024/1024
194
>>> print used/1024/1024/1024
46

[root@host123 ~]# df -h
Filesystem                   Size Used Avail Use% Mounted on
/dev/mapper/vg_sys-lv_root    20G 3.6G   16G 20% /
devtmpfs                      11G     0   11G   0% /dev
tmpfs                         12G     0   12G   0% /dev/shm
tmpfs                         12G   83M   12G   1% /run
tmpfs                         12G     0   12G   0% /sys/fs/cgroup
/dev/sda1                    380M   96M 260M 27% /boot
/dev/mapper/vg_nova-lv_nova  255G   47G 195G 20% /var/lib/nova

update_status直接利用了获取到的total和used数据项，但是注意free却没有直接使用，而是计算成了disk_available_least：

        <strong>disk_free_gb </strong>= disk_info_dict['free']        <strong>disk_over_committed </strong>= self.<strong>_get_disk_over_committed_size_total</strong>()        <strong>available_least </strong>= <strong>disk_free_gb </strong>* units.Gi - <strong>disk_over_committed</strong>        data['<strong>disk_available_least</strong>'] = available_least / units.Gi

可以看到，它从操作系统给的disk_free_gb 里面又减去了disk_over_committed的值。

我们来看看get_disk_over_committed_size_total是怎么获取的，这个函数也是libvirt driver的成员：

      def _get_disk_over_committed_size_total(self):        """Return total over committed disk size for all instances."""        # Disk size that all instance uses : virtual_size - disk_size        disk_over_committed_size = 0        for dom in self._host.list_instance_domains():            try:                xml = dom.XMLDesc(0)                disk_infos = jsonutils.loads(                        self._get_instance_disk_info(dom.name(), xml))                for info in disk_infos:                    disk_over_committed_size += int(                        info['over_committed_disk_size'])            except ……（此处略过）            # NOTE(gtt116): give other tasks a chance.            greenthread.sleep(0)        return disk_over_committed_size

它是逐个获取每个instance的over_committed_disk_size，然后把它们累加起来。

意思是有的instance已经在超额使用磁盘了，那么超额在哪里呢？

对于每一个instance，是通过下面的函数获取over_committed_disk_size的：

    def _get_instance_disk_info(self, instance_name, xml,                                block_device_info=None):        block_device_mapping = driver.block_device_info_get_mapping(            block_device_info)        volume_devices = set()        for vol in block_device_mapping:            disk_dev = vol['mount_device'].rpartition("/")[2]            volume_devices.add(disk_dev)        disk_info = []        doc = etree.fromstring(xml)        disk_nodes = doc.findall('.//devices/disk')        path_nodes = doc.findall('.//devices/disk/source')        driver_nodes = doc.findall('.//devices/disk/driver')        target_nodes = doc.findall('.//devices/disk/target')        for cnt, path_node in enumerate(path_nodes):            disk_type = disk_nodes[cnt].get('type')            path = path_node.get('file') or path_node.get('dev')            target = target_nodes[cnt].attrib['dev']            if not path:                LOG.debug('skipping disk for %s as it does not have a path',                          instance_name)                continue            if disk_type not in ['file', 'block']:                LOG.debug('skipping disk because it looks like a volume', path)                continue            if target in volume_devices:                LOG.debug('skipping disk %(path)s (%(target)s) as it is a '                          'volume', {'path': path, 'target': target})                continue            # get the real disk size or            # raise a localized error if image is unavailable<strong>            if disk_type == 'file':                dk_size = int(os.path.getsize(path))            elif disk_type == 'block':                dk_size = lvm.get_volume_size(path)            disk_type = driver_nodes[cnt].get('type')            if disk_type == "qcow2":                backing_file = libvirt_utils.get_disk_backing_file(path)                virt_size = disk.get_disk_size(path)                over_commit_size = int(virt_size) - dk_size            else:                backing_file = ""                virt_size = dk_size                over_commit_size = 0</strong>            disk_info.append({'type': disk_type,                              'path': path,                              'virt_disk_size': virt_size,                              'backing_file': backing_file,                              'disk_size': dk_size,                              'over_committed_disk_size': over_commit_size})        return jsonutils.dumps(disk_info)

举个例子，对于qcow2格式的镜像，这个overcommit size等于virt_size减去dk_size：

[root@host123 ~]# ll -h /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk
-rw-r--r-- 1 root root 5.0G Feb 25 11:41 /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk

镜像文件实际大小dk_size是5.0G。我们再用qemu-img命令查看一下qcow2的详细信息：

[root@host123 ~]# qemu-img info /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk
image: /var/lib/nova/instances/109291c0-0bf0-412c-9e87-6ab01e16bc06/disk
file format: qcow2
virtual size: 20G (21474836480 bytes)
disk size: 4.9G
cluster_size: 65536
backing file: /var/lib/nova/instances/_base/afd631de55a9b7026775a4a1ada098a9ae6888c7
Format specific information:
compat: 0.10

这里的virtual size减去disk size，便是over_commit_size。

可以看到，这里仅仅对qcow2格式的镜像做了overcommit处理，其它文件的over_commit_size等于0。

我们知道，在nova调度服务的DiskFilter里面，用到了disk_allocation_ratio对磁盘资源做了超分，它和这里的overcommit不是一个概念，它是从控制节点角度看到的超额使用，而计算节点看不到，overcommit是计算节点看到了磁盘qcow2压缩格式之后所得到的结果，它最终上报的剩余空间是扣除了假设qcow2镜像文件解压之后的实际结果。所以会遇到实际上报的剩余空间小于肉眼看到的空间大小。

如果管理员部署时指定了计算节点，则不走调度流程，就会把虚拟机硬塞给该计算节点，强行占用了已经归入超额分配计划的空间，则最终可能导致计算节点上报的磁盘资源为负数。并且将来随着虚拟机实际占用的磁盘空间越来越大，最终可能就导致计算节点硬盘空间不足了。

0 0