Improving Linux kernel networking performance 笔记

来源:互联网 发布:擎洲广达计价软件 编辑:程序博客网 时间:2024/06/05 15:41

原文链接:
https://lwn.net/Articles/629155/
By Jonathan Corbet
January 13, 2015

    • Time budgets预算
      • 只有672ns
      • 可行性分析
      • 可行方案
        • 批量操作
        • 免锁
        • 减少系统调用
        • cache 优化
      • 提升batchingLatency and throughput的折中权衡
      • Memory Management需要bypass以便提升性能

Time budgets预算

只有67.2ns

The smallest Ethernet frame that can be sent is 84 bytes; on a 10G adapter, Jesper said, there are 67.2ns between minimally-sized packets.

可行性分析

  • a cache miss on Jesper’s 3GHz processor takes about 32ns to resolve
  • thus only takes two misses to wipe out the entire time budget for processing a packet
  • Given that a socket buffer (“SKB”) occupies four cache lines on a 64-bit system and that much of the SKB is written during packet processing
  • the x86 LOCK prefix for atomic operations takes about 8.25ns, 所以the shortest spinlock lock/unlock cycle takes a little over 16ns. So there is not room for a lot of locking within the time budget.
  • the cost of performing a system call 大约75ns

可行方案

批量操作

免锁

减少系统调用

cache 优化

The key appears to be batching of operations, along with preallocation and prefetching of resources. These solutions keep work CPU-local and avoid locking. It is also important to shrink packet metadata and reduce the number of system calls. Faster, cache-optimal data structures also help. Of all of these techniques, batching of operations is the most important. A cost that is intolerable on a per-packet basis is easier to absorb if it is incurred once per dozens of packets. 16ns of locking per packet hurts; if sixteen packets are processed at once, that overhead drops to 1ns per packet.

2个cache miss以内,不能有spin lock

提升batching,Latency and throughput的折中权衡

The tricky part, he said, is adding batching APIs to the networking stack without increasing the latency of the system. Latency and throughput must often be traded off against each other; here the objective is to optimize both.

TCP bulk transmission work :
Bulk network packet transmission [LWN.net]
https://lwn.net/Articles/615238/

Memory Management【需要bypass以便提升性能】

implemented a subsystem called qmempool; it does bulk allocation and free operations in a lockless manner

[RFC PATCH 0/3] Faster than SLAB caching of SKBs with qmempool (backed by alf_queue) [LWN.net]
https://lwn.net/Articles/625427/

原创粉丝点击