CPU和内存调优(非常全)

来源：互联网发布：淘宝店快速提高信誉编辑：程序博客网时间：2024/04/29 03:24

http://www.amd5.cn/atang_3537.html

CPU和内存调优

Monitor:

Process: 一个独立运行单位

系统资源：CPU时间，存储空间

Process: 一个独立运行单位

OS: VM

CPU: 

时间：切片

缓存：缓存当前程序数据

进程切换：保存现场、恢复现场

内存：线性地址 <– 物理地址

空间：映射

I/O：

内核 –> 进程

进程描述符：

进程元数据

双向链表

Linux: 抢占

系统时钟：时钟

tick: 滴答

时间解析度

100Hz

1000Hz

时钟中断

A: 5ms，1ms

C:

进程类别：

交互式进程(I/O)

批处理进程(CPU)

实时进程(Real-time)

CPU: 时间片长，优先级低

IO：时间片短，优先级高

Linux优先级：priority

实时优先级: 1-99，数字越小，优先级越低

静态优先级：100-139，数据越小，优先级越高   -20，19：100，139

0：120

实时优先级比静态优先级高

nice值：调整静态优先级

调度类别：

实时进程：

SCHED_FIFO：First In First Out

SHCED_RR: Round Robin

SCHED_Other: 用来调度100-139之间的进程

100-139

10: 110

30: 115

50: 120

2: 130

动态优先级：

dynamic priority = max (100, min (  static priority – bonus + 5, 139)) 

bonus: 0-10

110，10

110

手动调整优先级：

100-139: nice

nice N COMMAND

renice -n # PID

chrt -p [prio] PID

1-99: 

chrt -f -p [prio] PID 

chrt -r -p [prio] PID

chrt -f -p [prio] COMMAND

ps -e -o class,rtprio,pri,nice,cmd

CPU affinity: CPU姻亲关系

numastat

numactl

taskset: 绑定进程至某CPU上

mask:

0x0000 0001

0001: 0

0x0000 0003

0011：0和1

0x0000 0005: 

0101: 0和2

0007

0111：0-2号

# taskset -p mask pid

101, 3# CPU

# taskset -p 0x00000003 101

taskset -p -c 0-2,7 101

应该将中断绑定至那些非隔离的CPU上，从而避免那些隔离的CPU处理中断程序；

echo CPU_MASK > /proc/irq/<irq number>/smp_affinity

sar -w

查看上下文切换的平均次数，以及进程创建的平均值；

查看CPU相关信息

sar -q

vmast 1 5

mpstat 1 2

sar -P 0 1

iostat -c

dstat -c

/etc/grub.conf 

isolcpu=cpu number,….cpu number

slab allocator:

buddy system:

memcached: 

MMU: Memory Management Unit

地址映射

内存保护

进程：线性地址 –> 物理地址

物理：页框

地址

进程：页面

页面

TLB: Transfer Lookaside Buffer

sar -R: 观察内存分配与释放动态信息

dstat –vm: 

Given this performance penalty, performance-sensitive applications should avoid regularly accessing remote memory in a NUMA topology system. The application should be set up so that it stays on a particular node and allocates memory from that node.

To do this, there are a few things that applications will need to know:

What is the topology of the system?

Where is the application currently executing?

Where is the closest memory bank?

CPU affinity is represented as a bitmask. The lowest-order bit corresponds to the first logical CPU, and the highest-order bit corresponds to the last logical CPU. These masks are typically given in hexadecimal, so that 0x00000001 represents processor 0, and 0x00000003 represents processors 0 and 1.

# taskset -p mask pid

To launch a process with a given affinity, run the following command, replacing mask with the mask of the processor or processors you want the process bound to, and program with the program, options, and arguments of the program you want to run.

# taskset mask — program

Instead of specifying the processors as a bitmask, you can also use the -c option to provide a comma-delimited list of separate processors, or a range of processors, like so:

# taskset -c 0,5,7-9 — myprogram

numactl can also set a persistent policy for shared memory segments or files, and set the CPU affinity and memory affinity of a process. It uses the /sys file system to determine system topology.

The /sys file system contains information about how CPUs, memory, and peripheral devices are connected via NUMA interconnects. Specifically, the /sys/devices/system/cpu directory contains information about how a system's CPUs are connected to one another. The /sys/devices/system/node directory contains information about the NUMA nodes in the system, and the relative distances between those nodes.

numactl allows you to bind an application to a particular core or NUMA node, and to allocate the memory associated with a core or set of cores to that application.

numastat displays memory statistics (such as allocation hits and misses) for processes and the operating system on a per-NUMA-node basis. By default, running numastat displays how many pages of memory are occupied by the following event categories for each node. Optimal CPU performance is indicated by low numa_miss and numa_foreign values.

numad is an automatic NUMA affinity management daemon. It monitors NUMA topology and resource usage within a system in order to dynamically improve NUMA resource allocation and management (and therefore system performance).

Depending on system workload, numad can provide benchmark performance improvements of up to 50%. To achieve these performance gains, numad periodically accesses information from the /proc file system to monitor available system resources on a per-node basis. The daemon then attempts to place significant processes on NUMA nodes that have sufficient aligned memory and CPU resources for optimum NUMA performance. Current thresholds for process management are at least 50% of one CPU and at least 300 MB of memory. numad attempts to maintain a resource utilization level, and rebalances allocations when necessary by moving processes between NUMA nodes.

To restrict numad management to a specific process, start it with the following options.

# numad -S 0 -p pid

-p pid

Adds the specified pid to an explicit inclusion list. The process specified will not be managed until it meets the numad process significance threshold.

-S mode

The -S parameter specifies the type of process scanning. Setting it to 0 as shown limits numad management to explicitly included processes.

To stop numad, run:

# numad -i 0

Stopping numad does not remove the changes it has made to improve NUMA affinity. If system use changes significantly, running numad again will adjust affinity to improve performance under the new conditions.

Using Valgrind to Profile Memory Usage

Profiling Memory Usage with Memcheck

Memcheck is the default Valgrind tool, and can be run with valgrind program, without specifying –tool=memcheck. It detects and reports on a number of memory errors that can be difficult to detect and diagnose, such as memory access that should not occur, the use of undefined or uninitialized values, incorrectly freed heap memory, overlapping pointers, and memory leaks. Programs run ten to thirty times more slowly with Memcheck than when run normally.

Profiling Cache Usage with Cachegrind

Cachegrind simulates your program's interaction with a machine's cache hierarchy and (optionally) branch predictor. It tracks usage of the simulated first-level instruction and data caches to detect poor code interaction with this level of cache; and the last-level cache, whether that is a second- or third-level cache, in order to track access to main memory. As such, programs run with Cachegrind run twenty to one hundred times slower than when run normally.

Profiling Heap and Stack Space with Massif

Massif measures the heap space used by a specified program; both the useful space, and any additional space allocated for book-keeping and alignment purposes. It can help you reduce the amount of memory used by your program, which can increase your program's speed, and reduce the likelihood that your program will exhaust the swap space of the machine on which it executes. Massif can also provide details about which parts of your program are responsible for allocating heap memory. Programs run with Massif run about twenty times more slowly than their normal execution speed.

Capacity-related Kernel Tunables

1、内存区域划分：

32bits: ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM

64bits: ZONE_DMA, ZONE_DMA32, ZONE_NORMAL

2、MMU：

10bit, 10bit, 12bit PTE

3、TLB

HugePage

CPU

O(1): 100-139

SCHED_Other: CFS

1-99

SCHED_FIFO

SCHED_RR

动态优先级：

sar -p

mpstat

iostat -c

dstat -c

–top-cpu

top

sar -q

vmstat

uptime

内存子系统组件：

slab allocator

buddy system

kswapd

pdflush

mmu

虚拟化环境：

PA –> HA –> MA

虚拟机转换：PA –> HA

GuestOS, OS

Shadow PT

Memory: 

TLB：提升性能

启用大页面/etc/sysctl.conf

vm.nr_hugepages=n

strace:

strace COMMAND: 查看命令的syscall

strace -p PID: 查看已经启动进程的syscall

-c: 只输出其概括信息；

-o FILE: 将追踪结果保存至文件中，以供后续分析使用；

1、降低微型内存对象的系统开销

slab

2、缩减慢速子系统的服务时间

使用buffer cache缓存文件元数据据；

使用page cache缓存DISK IO；

使用shm完成进程间通信；

使用buffer cache、arp cache和connetion tracking提升网络IO性能；

过量使用：

2，2，2，2：8  

物理内存的过量使用是以swap为前提的：

可以超出物理内存一部分：

Swap

/proc/slabinfo

slabtop

vmstat -m

vfs_cache_pressure：

0：不回收dentries和inodes; 

1-99: 倾向于不回收；

100: 倾向性与page cache和swap cache相同；

100+：倾向于回收；

调优思路：性能指标，定位瓶颈

进程管理，CPU

内存调优

I/O调优

文件系统调优

网络子系统调优

setting the /proc/sys/vm/panic_on_oom parameter to 0 instructs the kernel to call the oom_killer function when OOM occurs

oom_adj

Defines a value from -16 to 15 that helps determine the oom_score of a process. The higher the oom_score value, the more likely the process will be killed by the oom_killer. Setting a oom_adj value of 

-16-15：协助计算oom_score

-17：disables the oom_killer for that process.

进程间通信管理类命令：

ipcs

ipcrm

shm:

shmmni: 系统级别，所允许使用的共享内存段上限；

shmall: 系统级别，能够为共享内存分配使用的最大页面数；

shmmax: 单个共享内存段的上限；

messages:

msgmnb: 单个消息队列的上限，单位为字节；

msgmni: 系统级别，消息队列个数上限；

msgmax: 单个消息大小的上限，单位为字节；

手动清写脏缓存和缓存：

sync

echo s > /proc/sysrq-trigger

vm.dirty_background_ratio

总体脏页占总体内存比，开始清除

vm.dirty_ratio

单个进程。。。

vm.dirty_expire_centisecs

单位1/100秒，每隔多少时间启动一次

vm.dirty_writeback_centisecs

一个脏页在内存中存在多久以后，进行清理

vm.swappiness

内存调优

HugePage：TLB

IPC：

pdflush

slab

swap

oom

I/O, Filesystem, Network

Note that the I/O numbers reported by vmstat are aggregations of all I/O to all devices. Once you have determined that there may be a performance gap in the I/O subsystem, you can examine the problem more closely with iostat, which will break down the I/O reporting by device. You can also retrieve more detailed information, such as the average request size, the number of reads and writes per second, and the amount of I/O merging going on.

vmstat命令和dstat -r用来查看整体IO活动状况；

iostat可以查看单个设备的IO活动状况；

slice_idle = 0

quantum = 64

group_idle = 1

blktrace

blkparse

btt

fio

io-stress

iozone

iostat

ext3

ext4: 16TB

xfs:

mount -o nobarrier,noatime 

ext3: noatime

data=ordered, journal, writeback

ext2, ext3

Tuning Considerations for File Systems

 Formatting Options:

  File system block size

Mount Options

Barrier: 为了保证元数据写入的安全性；可以使用nobarrier

Access Time (noatime)

Historically, when a file is read, the access time (atime) for that file must be updated in the inode metadata, which involves additional write I/O

Increased read-ahead support

# blockdev -getra device

# blockdev -setra N device

Ext4 is supported for a maximum file system size of 16 TB and a single file maximum size of 16TB. It also removes the 32000 sub-directory limit present in ext3.

优化Ext4:

1、格式大文件系统化时延迟inode的写入；

-E lazy_itable_init=1

# mkfs.ext4 -E lazy_itable_init=1 /dev/sda5

2、关闭Ext4的自动fsync()调用；

-o noauto_da_alloc

mount -o noauto_da_alloc

3、降低日志IO的优先级至与数据IO相同；

-o journal_ioprio=n 

n的用效取值范围为0-7，默认为3；

优化xfs:

xfs非常稳定且高度可扩展，是64位日志文件系统，因此支持非常大的单个文件及文件系统。在RHEL6.4上，其默认的格式化选项及挂载选项均已经达到很优的工作状态。

dd

iozone

bonnie++

I/O:

I/O scheduler: CFQ, deadline, NOOP

EXT4:

net.ipv4.tcp_window_scaling = 1

net.ipv4.tcp_syncookies = 1

net.core.rmem_max = 12582912

net.core.rmem_default 

net.core.netdev_max_backlog = 5000

net.core.wmem_max = 12582912

net.core.wmem_default 

net.ipv4.tcp_rmem= 10240 87380 12582912

net.ipv4.tcp_wmem= 10240 87380 12582912

net.ipv4.tcp_tw_reuse=1

Set the max OS send buffer size (wmem) and receive buffer size (rmem) to 12 MB for queues on all protocols. In other words set the amount of memory that is allocated for each TCP socket when it is opened or created while transferring files

netstat -an

ss

lsof

ethtool 

Systemtap

Oprofile

Valgrind

perf 

perf stat

Task-clock-msecs：CPU 利用率，该值高，说明程序的多数时间花费在 CPU 计算上而非 IO。

Context-switches：进程切换次数，记录了程序运行过程中发生了多少次进程切换，频繁的进程切换是应该避免的。

Cache-misses：程序运行过程中总体的 cache 利用情况，如果该值过高，说明程序的 cache 利用不好

CPU-migrations：表示进程 t1 运行过程中发生了多少次 CPU 迁移，即被调度器从一个 CPU 转移到另外一个 CPU 上运行。

Cycles：处理器时钟，一条机器指令可能需要多个 cycles，

Instructions: 机器指令数目。

IPC：是 Instructions/Cycles 的比值，该值越大越好，说明程序充分利用了处理器的特性。

Cache-references: cache 命中的次数

Cache-misses: cache 失效的次数。

通过指定 -e 选项，您可以改变 perf stat 的缺省事件 ( 关于事件，在上一小节已经说明，可以通过 perf list 来查看 )。假如您已经有很多的调优经验，可能会使用 -e 选项来查看您所感兴趣的特殊的事件。

perf Top

使用 perf stat 的时候，往往您已经有一个调优的目标。也有些时候，您只是发现系统性能无端下降，并不清楚究竟哪个进程成为了贪吃的 hog。此时需要一个类似 top 的命令，列出所有值得怀疑的进程，从中找到需要进一步审查的家伙。类似法制节目中办案民警常常做的那样，通过查看监控录像从茫茫人海中找到行为古怪的那些人，而不是到大街上抓住每一个人来审问。

Perf top 用于实时显示当前系统的性能统计信息。该命令主要用来观察整个系统当前的状态，比如可以通过查看该命令的输出来查看当前系统最耗时的内核函数或某个用户进程。

使用 perf record, 解读 report

使用 top 和 stat 之后，您可能已经大致有数了。要进一步分析，便需要一些粒度更细的信息。比如说您已经断定目标程序计算量较大，也许是因为有些代码写的不够精简。那么面对长长的代码文件，究竟哪几行代码需要进一步修改呢？这便需要使用 perf record 记录单个函数级别的统计信息，并使用 perf report 来显示统计结果。

您的调优应该将注意力集中到百分比高的热点代码片段上，假如一段代码只占用整个程序运行时间的 0.1%，即使您将其优化到仅剩一条机器指令，恐怕也只能将整体的程序性能提高 0.1%。

Disk:

IO Scheduler:

CFQ

deadline

anticipatory

NOOP

/sys/block/<device>/queue/scheduler

Memory:

MMU

TLB

vm.swappiness={0..100}：使用交换分区的倾向性, 60

overcommit_memory=2: 过量使用

overcommit_ratio=50：

swap+RAM*ratio

swap: 2G

RAM: 8G

memory=2G+4G=6G

充分使用物理内存：

1、swap跟RAM一样大；swappiness=0;

2、 overcommit_memory=2, overcommit_ratio=100：swappiness=0;

memory: swap+ram

tcp_max_tw_buckets: 调大

tw：连接个数

established –> tw 

IPC: 

message

msgmni

msgmax

msgmnb

shm

shmall

shmmax

shmmni

常用命令：

sar, dstat, vmstat, mpstat, iostat, top, free, iotop, uptime, cat /proc/meminfo, ss, netstat, lsof, time, perf, strace 

blktrace, blkparse, btt

dd, iozone, io-stress, fio

阅读全文

0 0