redis-aof-latency

来源：互联网发布：网络摄像头uid密码大全编辑：程序博客网时间：2024/06/07 10:28

源地址：http://idning.github.io/redis-aof-latency.html

Table of Contents

1 一些分析
- 1.1 为什么慢查询看不到?
- 1.2 观察
- 1.3 为什么 appendfsync no 无效
2 一些想法
3 关于page cache
- 3.1 查看当前page cache 状态
- 3.2 参数
- 3.3 Stable Page Write
- 3.4 查看线上
  - 3.4.1 调整dirty_ratio
  - 3.4.2 调整 dirty_expire_centisecs
4 小结
5 相关

我的redis配置的aof如下:

appendonly yesappendfsync everysec

redis-mgr配置每天早上 6:00-8:00 做aof_rewrite 和 rdb, 所以每天早上这段时间, 我们就会收到twempxoy的forward_err报警, 大约每分钟会损失5000个请求.

失败率是 10/10000.

在线上测试, 做一个10G的文件写操作, 就会触发上面问题:

dd if=/dev/zero of=xxxxx bs=1M count=10000 &

我们修改了 appendfsync no, 发现这个问题能缓解, 但是不能解决.

关于redis的各种延迟, 作者antirez的这篇文章已经说的很清楚了.

我们这里遇到的就是有disk I/O 的时候aof受到影响.

1 一些分析

1.1 为什么慢查询看不到?

慢查询统计的时间只包括cpu计算的时间, 写aof这个过程不计入查询时间统计(也不应该计入)

1.2 观察

用下面命令:

strace -f -p $(pidof redis-server) -T -e trace=fdatasync,write 2>&1 | grep -v '0.0' | grep -v unfinished

我们在做copy 的时候, 可以观察到:

[pid 24734] write(42, "*4\r\n$5\r\nhmset\r\n$37\r\np-lc-d687791"..., 272475) = 272475 <0.036430>[pid 24738] <... fdatasync resumed> )   = 0 <2.030435>[pid 24738] <... fdatasync resumed> )   = 0 <0.012418>[pid 24734] write(42, "*4\r\n$5\r\nHMSET\r\n$37\r\np-lc-6787211"..., 73) = 73 <0.125906>[pid 24738] <... fdatasync resumed> )   = 0 <4.476948>[pid 24734] <... write resumed> )       = 294594 <2.477184>   (2.47s)

此时输出:

$ ./_binaries/redis-cli --latency-history -h 10.38.114.60 -p 2000min: 0, max: 223, avg: 1.24 (1329 samples) -- 15.01 seconds rangemin: 0, max: 2500, avg: 3.46 (1110 samples) -- 15.00 seconds range   (这里观察到2.5s)min: 0, max: 5, avg: 1.01 (1355 samples) -- 15.01 seconds range

watchdog 输出:

[24734] 07 Jul 10:54:41.006 * Asynchronous AOF fsync is taking too long (disk is busy?). Writing the AOF buffer without waiting for fsync to complete, this may slow down Redis.[24734 | signal handler] (1404701682)--- WATCHDOG TIMER EXPIRED ---bin/redis-server *:2000(logStackTrace+0x4b)[0x443bdb]/lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f]/lib64/tls/libpthread.so.0[0x302b80c420]/lib64/tls/libpthread.so.0(__write+0x4f)[0x302b80b03f]bin/redis-server *:2000(flushAppendOnlyFile+0x76)[0x43f616]bin/redis-server *:2000(serverCron+0x325)[0x41b5b5]bin/redis-server *:2000(aeProcessEvents+0x2b2)[0x416a22]bin/redis-server *:2000(aeMain+0x3f)[0x416bbf]bin/redis-server *:2000(main+0x1c8)[0x41dcd8]/lib64/tls/libc.so.6(__libc_start_main+0xdb)[0x302af1c4bb]bin/redis-server *:2000[0x415b1a][24734 | signal handler] (1404701682) --------

所以确定是write hang住

1.3 为什么 appendfsync no 无效

当磁盘写buf满的时候, write就会阻塞, 释放一些buf才会允许继续写入,

所以, 如果程序不调用sync, 系统就会在不确定的时候做sync, 此时 wirte() 就会hang住

2 一些想法

能否对rdb/aof_rewrite/cp等命令限速,
- 不可能针对每个进程(比如有其它写日志的进程) 都做限制. 所以最好不要这样.
增加proxy timeout, 目前400ms, 增加到2000ms?
- 如果超时400ms, 想当于快速失败. 客户端重试效果一样, 所以还是不必改.
master 关aof
- 这个方法不需要做任何改动, 代价较小, 效果最好, 缺点是提高运维复杂性和数据可靠性, redis-mgr可以做这个支持.
write 时的阻塞貌似无法避免, 能否用一个新的线程来做write呢?
- 关于这个想法写了个patch提给作者: https://github.com/antirez/redis/pull/1862
- 不过作者貌似不太感冒.

3 关于page cache

IO调度一般是针对读优化的, 因为读的时候是同步的, 进程读取不到, 就会睡眠. 写是异步的, 只是写到page cache.

3.1 查看当前page cache 状态

grep ^Cached: /proc/meminfo # page cache sizegrep ^Dirty: /proc/meminfo # total size of all dirty pagesgrep ^Writeback: /proc/meminfo # total size of actively processed dirty pages

3.2 参数

ning@ning-laptop ~/test$ sysctl -a | grep dirtyvm.dirty_background_ratio = 10vm.dirty_background_bytes = 0vm.dirty_ratio = 20vm.dirty_bytes = 0vm.dirty_writeback_centisecs = 1500vm.dirty_expire_centisecs = 3000

详细参考: https://www.kernel.org/doc/Documentation/sysctl/vm.txt

/proc/sys/vm/dirty_expire_centisecs         #3000, 表示3000*0.01s = 30s, 队列中超过30s的被刷盘./proc/sys/vm/dirty_writeback_centisecs      #1500, 表示1500*0.01s = 15s, 内核pdflush wakeup 一次./proc/sys/vm/dirty_background_ratio/proc/sys/vm/dirty_ratioBoth values are expressed as a percentage of RAM. When the amount of dirty pages reaches the first threshold (dirty_background_ratio), write-outs begin in the background via the “flush” kernel threads. When the second threshold is reached, processes will block, flushing in the foreground.The problem with these variables is their minimum value: even 1% can be too much. This is why another two controls were introduced in 2.6.29:/proc/sys/vm/dirty_background_bytes/proc/sys/vm/dirty_bytes

x_bytes 和 x_ratio是互斥的, 设置dirty_bytes 的时候, dirty_ratio 会被清0:

root@ning-laptop:~# cat /proc/sys/vm/dirty_bytes0root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio20root@ning-laptop:~# echo '5000000' > /proc/sys/vm/dirty_bytesroot@ning-laptop:~# cat /proc/sys/vm/dirty_bytes5000000root@ning-laptop:~# cat /proc/sys/vm/dirty_ratio0

Lower values generate more I/O requests (and more interrupts), significantly decrease sequential I/O bandwidth but also decrease random I/O latency 数值小的时候, 会减小IO系统带宽, 同时减少随机的IO延迟.

http://monolight.cc/2011/06/barriers-caches-filesystems/

3.3 Stable Page Write

http://yoshinorimatsunobu.blogspot.com/2014/03/why-buffered-writes-are-sometimes.html

When a dirty page is written to disk, write() to the same dirty page is blocked until flushing to disk is done. This is called Stable Page Write.

This may cause write() stalls, especially when using slower disks. Without write cache, flushing to disk takes ~10ms usually, ~100ms in bad cases.

有patch在较新的内核上能缓解这个问题, 原理是减少write调用 wait_on_page_writeback 的几率:

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=1d1d1a767206fbe5d4c69493b7e6d2a8d08cc0a0Here's the result of using dbench to test latency on ext2:3.8.0-rc3: Operation      Count    AvgLat    MaxLat ---------------------------------------- WriteX        109347     0.028    59.817 ReadX         347180     0.004     3.391 Flush          15514    29.828   287.283Throughput 57.429 MB/sec  4 clients  4 procs  max_latency=287.290 ms3.8.0-rc3 + patches: WriteX        105556     0.029     4.273 ReadX         335004     0.005     4.112 Flush          14982    30.540   298.634Throughput 55.4496 MB/sec  4 clients  4 procs  max_latency=298.650 msAs you can see, the maximum write latency drops considerably with thispatch enabled.

据说xfs 也能解决问题.

3.4 查看线上

$ cat /proc/sys/vm/dirty_background_ratio10$ cat /proc/sys/vm/dirty_ratio20

平时dirty:

$ grep ^Dirty: /proc/meminfoDirty:            104616 kB机器内存128G.

早上做rdb/aof_rewrite时, dirty:

500,000 kB (500M)

都还没到达配置的 dirty_background_ratio , dirty_ratio 所以调这两个参数估计没用.

测试:

#1. 最常90s.vm.dirty_expire_centisecs = 9000echo '9000' > /proc/sys/vm/dirty_expire_centisecs#2. 改大dirty_ratioecho '80' > /proc/sys/vm/dirty_ratio

3.4.1 调整dirty_ratio

在一个io较差的48G机器上, 设置 dirty_ratio = 80, dirty 会涨的很高, 但是redis延迟看不明显的改善:

$ grep ^Dirty: /proc/meminfoDirty:           8598180 kB  =>echo '80' > /proc/sys/vm/dirty_ratio$ grep ^Dirty: /proc/meminfoDirty:          11887180 kB$ grep ^Dirty: /proc/meminfoDirty:          21295624 kB

3.4.2 调整 dirty_expire_centisecs

看上去也没有效果, 有变差趋势. 因为我在线下是通过长期dd来压测, 和线上还不太一样.

看来只能线上测试了.

4 小结

master关aof应该是目前最可以接受的方法
antirez在做一个latency采样的工作
XFS/Solaris 貌似没有这个问题.

5 相关

http://redis.io/topics/latency
11 年就有的讨论 https://groups.google.com/forum/#!msg/redis-db/jgGuGngDEb0/ZwnvUdx-gdAJ
- 作者本来想把write和fsync都移到另一个线程, 结论是把fsync移到一个线程了,
Linkedin 的一个工程师做了这样一个实验, 测试用1/4的带宽来写的时候, 产生的延迟情况:
- http://blog.empathybox.com/post/35088300798/why-does-fwrite-sometimes-block

0 0