How to check health of Linux OS

来源:互联网 发布:阿里云克隆 编辑:程序博客网 时间:2024/05/16 08:06

I learned some experience while Nanjing found 0x03 error. At begining, we don't know why our GSRM(a linux process) hang in a short time 5 seconds sometimes. It didn't handle any message at that time and the interruption is not regularly. So we assume we have Linux OS problem. We did following checks:

1. Turn off Iptables service.
[root@Motorola-SRM-1A ~]# iptables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination        
ACCEPT     tcp  --  Motorola-SRM-1A      anywhere            tcp dpt:glrpc flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  10.0.0.2             anywhere            tcp dpt:glrpc flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  Motorola-SRM-1B      anywhere            tcp dpt:glrpc flags:FIN,SYN,RST,ACK/SYN
DROP       tcp  --  anywhere             anywhere            tcp dpt:glrpc flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  Motorola-SRM-1A      anywhere            tcp dpt:sqlexec flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  10.0.0.2             anywhere            tcp dpt:sqlexec flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  Motorola-SRM-1B      anywhere            tcp dpt:sqlexec flags:FIN,SYN,RST,ACK/SYN
DROP       tcp  --  anywhere             anywhere            tcp dpt:sqlexec flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  10.0.0.2             anywhere            tcp dpt:9070 flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  Motorola-SRM-1B      anywhere            tcp dpt:9070 flags:FIN,SYN,RST,ACK/SYN
DROP       tcp  --  anywhere             anywhere            tcp dpt:9070 flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  Motorola-SRM-1A      anywhere            tcp dpt:9085 flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  10.0.0.2             anywhere            tcp dpt:9085 flags:FIN,SYN,RST,ACK/SYN
ACCEPT     tcp  --  Motorola-SRM-1B      anywhere            tcp dpt:9085 flags:FIN,SYN,RST,ACK/SYN
DROP       tcp  --  anywhere             anywhere            tcp dpt:9085 flags:FIN,SYN,RST,ACK/SYN

2. Change Linux core parameters:
(1) Add following lines into /etc/sysctl.conf
net.core.rmem_default=4096000
net.core.wmem_default=4096000
net.core.rmem_max=8192000
net.core.wmem_max=8192000
(2) Make change effective: run sysctl –p
(3) Check change result: run sysctl -a|grep 'net.core'

3.Check network card error.
[root@Motorola-SRM-1A ~]# netstat -us
Udp:
    1924142763 packets received
    2047410 packets to unknown port received.
    347842 packet receive errors
    1582986591 packets sent

4. Check services on Linux
[root@Moto-SRM-C ~]#  chkconfig --list
NetworkManager  0:off   1:off   2:off   3:off   4:off   5:off   6:off
NetworkManagerDispatcher        0:off   1:off   2:off   3:off   4:off   5:off   6:off
acpid           0:off   1:off   2:off   3:on    4:on    5:on    6:off
anacron         0:off   1:off   2:on    3:on    4:on    5:on    6:off

5. Check Linux version
[root@Moto-SRM-C ~]# lsb_release –a

[root@Moto-SRM-C ~]# uname -a

6. Check MySQL
mysql> show variables like '%timeout';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| connect_timeout            | 10    |
| delayed_insert_timeout     | 300   |
| innodb_lock_wait_timeout   | 120   |
| innodb_rollback_on_timeout | OFF   |
| interactive_timeout        | 28800 |
| net_read_timeout           | 30    |
| net_write_timeout          | 60    |
| slave_net_timeout          | 3600  |
| table_lock_wait_timeout    | 50    |
| wait_timeout               | 28800 |
+----------------------------+-------+
10 rows in set (0.00 sec)

7. Tcpdump
tcpdump -i eth0 -nn -X 'port 13819 and udp' -s 0 -w tcpdump.log -W 40 -C 10
40 files count
10MB per file

8. check ulimits
[root@Motorola-SRM-1A ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
max nice                        (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 65536
max locked memory       (kbytes, -l) 32
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
max rt priority                 (-r) 0
stack size              (kbytes, -s) 10240
cpu time               (seconds, -t) unlimited
max user processes              (-u) 65536
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

9. vmstat 3,3
[root@Moto-SRM-C B_IPRM_SANDBOX]# vmstat 3,3
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
1  0      0 1456912 226008 1130316    0    0    68    22  338  148  1  0 98  1  0
1  0      0 1456440 226008 1130820    0    0   225     5 1334 2468 15  9 73  3  0
1  0      0 1455968 226016 1131544    0    0   243   233 1345 2515 16  9 73  3  0
1  0      0 1454816 226016 1132300    0    0   245     0 1338 2533 16  9 73  2  0
If the system is very busy, the cs and us is a bit high.

I turned off a lot services and changed the core parameters. The problem wasn't resolved. So after KunZhong checked the messages from STB, the message length is error which is very bigger than the actual message length. A lot of messages are supposed to one message.