关于collectl工具

来源:互联网 发布:java 字符串格式化 编辑:程序博客网 时间:2024/06/01 10:31
无聊中,搜了几个网页看了看.应该是很久以前以前的文章了,很易读,不像国人的文章,写得晦涩,也许就因为英语是计算机英语的母语吧.
If you can measure it, you can manage it. 这句话很对,很多事情,在设计阶段就应该量化,这样也能对未来可控,遗憾的是目前很多上线的应用,都没有做到这一点,也许以后能做到.
关于Io的监控,Linux还真的不好监控,利用已有的os的监控工具,不太好判断哪个进程做了大量的读写操作,只可进行部分的近似的猜测, 有一个IOTOP的工具,却是对内核,os有要求,不方便大量部署.所以我才倾向单个主机部署单一的应用,这样也方便trouble shooting.performance tunnings.
存储经常被神化,可是io确实是一个很稀缺的资源.硬盘,或者磁盘柜,再怎么radi1+0 ,效果却实在不理想, 以前在工厂,用最高端的小机 E25K,最高端的存储9970(hds oem),所以体会不出IO的压力,但在目前的公司的应用中,IO的压力却普遍存在着.
Two of the more important reasons why I’d like to see better IO load monitoring are:
The mechanical drives have big latency. In general the CPU feels much better than the disks when overloaded. For example if load average 10 is caused by CPU bound processes the system feels much more responsive than the same load but caused by IO bond processes. CPU load average 10 on a server system with two processors isn’t very noticeable. At the same time IO load average of 10 on the same system with 2x 7200 rpm disk drives in RAID1 feels very sluggish.
The hard disk drives failed to keep up with the performance improvements in microprocessor technologies. Disk capacity has grown quite well, but the speed and especially access times are far behind. The IO performance is the most common bottleneck and most precious resource in today’s systems. Or at least the systems I work with 
At the beginning of my Linux career, ten years ago, there was only one metric – blocks read/written. And that’s it. How busy the disk is you can guess only by looking at load average and checking how many processes are stuck in D state. I wish there are separate load average readings for CPU and IO…
At some point (linux 2.5 times?) extended statistics were added and things like queue size, utilization in % etc. became available. Much better. Still it was hard to tell who exactly is causing the load. If we speak of multi user system all you can see is multiple processes in D state. It’s unclear whether these are the ones causing the IO havoc or just victims of the already overloaded IO subsystem waiting.
In Linux 2.6.20 another step was made by adding per process IO accounting. I was very excited when I heard about this feature and eager to try it. It turned out that this per process IO accounting counts only the bytes read/written by a process. Not that better. A modern 7200 rpm SATA drive is only capable of about 90 IOPS so it could be choked with the pathetic 90 bytes per second…
Then there are the atop patches. These add per process IO occupation percentage. That sounds great but… when you have a lot of small random writes they go to the page cache first and only then are periodically flushed to the physical device. This is performance feature and is generally a (very) good thing as it allows the elevators to group writes together etc. Unfortunately, atop ends up accounting all these writes and IO utilization to pdflush and kjournald.
Ok, lets see what’s the state of the affairs in some other operating system. Everybody talks about dtrace so it’s time to check it out. Linux doesn’t have dtrace. At least yet. There is work in progress by Paul Fox. On the other hand Linux has system tap but it doesn’t look very mature to me. Anyway, there are number of operating systems that support dtrace: as it is create by Sun engineers first come Solaris and OpenSolaris. Then there is the FreeBSD port and Apple OS X. I’m familiar with FreeBSD but I wanted to check the current state of OpenSolaris kernel. On the other hand I wanted to keep the learning curve less sloppy, so I opted for Nexenta core 2 rc1. Nexenta is GNU userspace (Debian/Ubuntu) and OpenSolaris kernel.
Download, install – everything was smooth. The install defaulted to root fs on ZFS. Good! I was thinking about playing with ZFS these days anyway.
....
介绍了一个小工具. 我下载了,试用了.a great tool!
If you’re a performance monitoring junkie I’ll bet you’d like collectl – http://collectl.sourceforge.net/. It’s goal is to be able to monitor everything form one tool so that you can actually correlate what is going not with your storage subsystem but with everything. For example if you disk is slow, it could be related to memory fragmentation (buddyinfo), slab activity or other resources. collectl does it all and you can even run it at sub-second monitoring levels, synchronized to the nearest second wthin microseconds!
But don’t take my work for it, download and check it out for yourself.
-mark
 
这篇文档也写得很好,好老的文章了..
Performance Monitoring Tools for Linux
 http://www.linuxjournal.com/article/2396 
 
Performance Monitoring Tools for Linux
December 1st, 1998 by David Gavin in
Security
Mr. Gavin provides tools for systems data collection and display and discusses what information is needed and why.
For the last few years, I have been supporting users on various flavors of UNIX systems and have found the System Accounting Reports data invaluable for performance analysis. When I began using Linux for my personal workstation, the lack of a similar performance data collection and reporting tool set was a real problem. It's hard to get management to upgrade your system when you have no data to back up your claims of “I need more POWER!”. Thus, I started looking for a package to get the information I needed, and found out there wasn't any. I fell back on the last resort—I wrote my own, using as many existing tools as possible. I came up with scripts that collect data and display it graphically in an X11 window or hard copy.
What Do We Want to Know?
To get a good idea of how a system is performing, watch key system resources over a period of time to see how their usage and availability changes depending upon what's running on the system. The following categories of system resources are ones I wished to track.
CPU Utilization: The central processing unit, as viewed from Linux, is always in one of the following states:
idle: available for work, waiting
user: high-level functions, data movement, math, etc.
system: performing kernel functions, I/O and other hardware interaction
nice: like user, a job with low priority will yield the CPU to another task with a higher priority
By noting the percentage of time spent in each state, we can discover overloading of one state or another. Too much idle means nothing is being done; too much system time indicates a need for faster I/O or additional devices to spread the load. Each system will have its own profile when running its workload, and by watching these numbers over time, we can determine what's normal for that system. Once a baseline is established, we can easily detect changes in the profile.
Interrupts: Most I/O devices use interrupts to signal the CPU when there is work for it to do. For example, SCSI controllers will raise an interrupt to signal that a requested disk block has been read and is available in memory. A serial port with a mouse on it will generate an interrupt each time a button is pressed/released or when the mouse is moved. Watching the count of each interrupt can give you a rough idea of how much load the associated device is handling.
Context Switching: Time slicing is the term often used to describe how computers can appear to be doing multiple jobs at once. Each task is given control of the system for a certain “slice” of time, and when that time is up, the system saves the state of the running process and gives control of the system to another process, making sure that the necessary resources are available. This administrative process is called context switching. In some operating systems, the cost of this switching can be fairly expensive, sometimes using more resources than the processes it is switching. Linux is very good in this respect, but by watching the amount of this activity, you will learn to recognize when a system has a lot of tasks actively consuming resources.
Memory: When many processes are running and using up available memory, the system will slow down as processes get paged or swapped out to make room for other processes to run. When the time slice is exhausted, that task may have to be written out to the paging device to make way for the next process. Memory-utilization graphs help point out memory problems.
Paging: As mentioned above, when available memory begins to get scarce, the virtual memory system will start writing pages of real memory out to the swap device, freeing up space for active processes. Disk drives are fast, but when paging gets beyond a certain point, the system can spend all of its time shuttling pages in and out. Paging on a Linux system can also be increased by the loading of programs, as Linux “demand pages” each portion of an executable as needed.
Swapping: Swapping is much like paging. However, it migrates entire process images, consisting of many pages of memory, from real memory to the swap devices rather than the usual page-by-page mechanism normally used for paging.
Disk I/O: Linux keeps statistics on the first four disks; total I/O, reads, writes, block reads and block writes. These numbers can show uneven loading of multiple disks and show the balance of reads versus writes.
Network I/O: Network I/O can be used to diagnose problems and examine loading of the network interface(s). The statistics show traffic in and out, collisions, and errors encountered in both directions.
These charts can also help in the following instances:
The system is running jobs you aren't aware of during hours when you are not present.
Someone is logging on or remotely running commands on the system without your knowledge.
This sort of information will often show up as a spike in the charts at times when the system should have been idle. Sudden increases in activity can also be due to jobs run by crontab.
......
All listings referred to in this article are available by anonymous download in the file ftp.ssc.com/pub/lj/listings/issue56/2396.tgz.
 
David Gavin (dgavin@unifi.com) has worked in various support environments since 1977, when after COBOL training, he had the good fortune to be assigned to the TSO (Time Sharing Option) support group. From there he moved to MVS technical support, to VM and to UNIX. He has worked with UNIX from mainframes to desktops, baby-sitting Microsoft systems only when he couldn't avoid it. He started using Linux back when it meant downloading twenty-five disks over a 2400 BAUD dial-up line.
Don't forget to use collectl
On May 6th, 2008 Mark Seger (not verified) says:
Even though this is a pretty old article it seemed that there should be a reference to collectl for completeness. http://collectl.sourceforge.net/
-mark
 
http://sourceforge.net/projects/collectl/files/ 
Collectl is a light-weight performance monitoring tool capable of reporting interactively as well as logging to disk. It reports statistics on cpu, disk, infiniband, lustre, memory, network, nfs, process, quadrics, slabs and more in easy to read format.