文本处理工具系列(一)：文本的查看、分析、统计和文本过滤工具grep

来源：互联网发布：java 内存泄漏编辑：程序博客网时间：2024/05/17 09:23

1、文本的查看、统计和分析工具

<1>文本查看工具 cat tac rev more less head tail cut paste

cat

-A：显示所有隐藏的控制符

-n：显示行号

-s：压缩连续的空行至一行

[root@centos6 testdir]# cat -A -n f4     1a  $     2^I$     3b$

paste：合并两个文件同行号的列到一行

paste [OPTION]... [FILE]...

-d 分隔符:指定分隔符，默认用TAB

-s : 所有行合成一行显示

[root@centos6 testdir]# paste -d* f1 f2*CentOS release  6.8(Final)*CentOS release  6.8(Final)Kernel \r on an \m*Kernel \r on an \m\l*\l\n*\n\t*\t[root@centos6 testdir]# paste -s f1 f2CentOS release  6.8(Final)Kernel \r on an \m\l\n\tCentOS release  6.8(Final)Kernel \r on an \m\l\n\t

tac：纵向反向查看内容

rev：横向反向查看内容

[root@centos6 testdir]# tac f1abccba[root@centos6 testdir]# rev f1abccba[root@centos6 testdir]# cat f1abcabc

more：文本查看工具

less：more的升级版，功能更加强大，man命令使用的分页器

翻页

space: 向下翻一页

b：向上翻一页

ctrl+d：向下翻半页

ctrl+u：向上翻半页

enter：向下翻一行

命令

!：可临时执行命令

查找

/ KEYWORD：向上搜索

n：同向搜索

N：反向搜索

? KEYWORD：向下搜索

n：同向搜索

N：反向搜索

head：默认查看前十行

-#：查看文本的前几行

tail：默认查看后十行

-#：查看文本的后十行

注意：这俩者组合可查看特定的一行

监控日志：tail -n 0 -f /var/log/messages &

查看后台程序：jobs

调出后台程序：fg1

cut [OPTION]... [FILE]...

-d 分隔符：指明分隔符

-f：指明截取字段

#: 第#个字段

#,#：离散的多个字段，例如1,3,6

#-#：连续的多个字段, 例如1-6

混合使用：1-3,7

<2>文本统计工具 wc sort uniq

wc [OPTION]... [FILE]...

常用选项：

-l：记录行数

-w：记录单词数

-c：记录字符数

sort [OPTION]... [FILE]...

命令功能：排序

常用选项

-n：按照数字大小排序

-r：反向排序

-t：指定分隔符

-k：指定排序的字段

uniq [OPTION]... [INPUT [OUTPUT]]

命令功能：统计

常用选项

-c：显示相邻的行重复的次数

-d: 仅显示相邻重复过的行

-u: 仅显示相邻不曾重复的行

[root@centos6 testdir]# cat f1aaca[root@centos6 testdir]# uniq -d f1a[root@centos6 testdir]# uniq -u f1ca

注意：常用sort | uniq -c组合统计重复的行

<3>文本分析工具

diff FILE1 FILE2

[root@centos6 testdir]# cat f1abc[root@centos6 testdir]# cat f2a[root@centos6 testdir]# diff f1 f22,3d1< b< c[root@centos6 testdir]# diff f2 f11a2,3> b> c

2、文本过滤工具---grep和egrep

<1>介绍正则表达式

一种通过匹配模式，对文本进行行过滤的工具，在很多文本编辑器里，正则表达式通常被用来检索、替换那些符合某个模式的文本。由于正则表达式主要应用对象是文本，因此它在各种文本编辑器场合都有应用，小到著名编辑器EditPlus，大到Microsoft Word、Visual Studio等大型编辑器在各种编程语言中应用广泛，入门较难，但只要正真理解，你就会发现它的强大。grep是应用正则表达式的工具，egrep只是在grep的基础上做了简单的升级，并无实质性差异。

<2>grep用法格式

grep [OPTIONS] PATTERN [FILE...]

常用选项

--color=auto: 对匹配到的文本着色显示；

-v: 显示不能够被pattern匹配到的行；

-o：仅显示匹配到的字符

-i: 忽略字符大小写

-q：静默模式，常用于条件判断

-A #: after, 后#行

[root@centos6 testdir]# grep -A1 ^root /etc/passwdroot:x:0:0:root:/root:/bin/bashbin:x:1:1:bin:/bin:/sbin/nologin

-B #: before, 前#行

-C #：context, 前后各#行

-E：使用ERE

<3>正则表达式和扩展正则表达式匹配模式

注：空白处无意义

分类

符号

ERE

BRE

字符匹配

单个字符

字符匹配

其前面的字符任意次

字符匹配

任意长度任意字符

字符匹配

[]

指定范围内任意一个字符

字符匹配

[^]

指定范围外任意一个字符

字符匹配

[0-9]

任意一个数字

字符匹配

[a-z]

任意一个小写字母

字符匹配

[A-Z]

任意一个大写字母

字符匹配

[[:digit:]]

任意一个数字

字符匹配

[[:lower:]]

任意一个小写字母

字符匹配

[[:upper:]]

任意一个大写字母

字符匹配

[[:alpha:]]

任意一个字母

字符匹配

[[:alnum:]]

任意一个字母或数字

字符匹配

[[:space:]]

任意一个空格符

字符匹配

[[:punct:]]

任意一个标点符号

位置锚定

锚定行首

位置锚定

锚定行尾

位置锚定

锚定词首

位置锚定

锚定词尾

次数匹配

\?；?

前面的字符0次或1次

次数匹配

\+；+

前面的字符至少1次

次数匹配

m次，格式\{m\}

m次，格式{m}

次数匹配

至少m次，格式\{m,\}

至少m次，格式{m}

次数匹配

至多n次，格式\{0,n\}

至多n次，格式{0,n}

次数匹配

m次到n次，格式\{m,n\}

m次到n次，格式{m,n}

分组引用

\1，\2分组，格式\(\)

\1，\2分组，格式()

关于分组引用的一个应用示例

[root@centos7 tmp]# cat f1He likes his lover.He loves his liker.She likes her lover.She loves her lover.[root@centos7 tmp]# grep "\(l..e\).*\1" f1She loves her lover.

取行首的的第一个字母的方法

[root@localhost /tmp]#cat f1abcdefg[root@localhost /tmp]#grep "^." f1abcdefg[root@localhost /tmp]#grep -o "^." f1a[root@localhost /tmp]#cut -c1 f1a

正则表达式强化

1、查出分区空间使用率的最大百分比值

[root@centos7 ~]# df | grep "^/dev" | grep -v "cdrom$" | tr ' ' ':' | tr -s ':' | cut -d: -f5 | sort   1%4%73%

2、查出用户UID最大值的用户名、UID及shell类型

[root@centos7 ~]# cat /etc/passwd | sort -n -t: -k3 | cut -d: -f1,3,7 | tail -1nfsnobody:65534:/sbin/nologin

3、查出/tmp的权限，以数字方式显示

[root@centos7 ~]# stat /tmp | head -4 | tail -1 | tr '(/)' ':' | tr ' ' ':' | tr -s ':' | cut -d: -f20777

或者

[root@centos7 ~]# stat /tmp | grep "^A.*)$" | tr ' ' '\n' | head -2 | tail -1 | tr -cd '[:digit:]'0777

4、统计当前连接本机的每个远程主机IP的连接数(包括端口号)，并按从大到小排序

[root@centos7 ~]# netstat -tan | grep "^tcp\>" | tr ' ' '*' | tr -s '*' | cut -d* -f4 | uniq -c | sort -r      3 10.1.0.17:22      1 192.168.122.1:53      1 127.0.0.1:631      1 127.0.0.1:25      1 0.0.0.0:22

5、显示/proc/meminfo文件中以大小s开头的行；(要求：使用两种方式)

方法一

[root@centos7 ~]# grep "^[sS]" /proc/meminfo SwapCached:         6928 kBSwapTotal:       2097148 kBSwapFree:        2051836 kBShmem:             20884 kBSlab:             150348 kBSReclaimable:      84320 kBSUnreclaim:        66028 kB

方法二

[root@centos7 ~]# grep -i "^s" /proc/meminfoSwapCached:         6928 kBSwapTotal:       2097148 kBSwapFree:        2051836 kBShmem:             20884 kBSlab:             150348 kBSReclaimable:      84320 kBSUnreclaim:        66028 kB

6、显示/etc/passwd文件中不以/bin/bash结尾的行

[root@centos7 ~]# grep -v "^/bin/bash$" /etc/passwd

7、显示用户rpc默认的shell程序

[root@centos7 ~]# grep "^rpc\>" /etc/passwd | cut -d: -f1,3,7rpc:32:/sbin/nologin

8、找出/etc/passwd中的两位或三位数，必须是正整数

[root@centos7 ~]# grep -E "\<1[0-9]{1,2}\>" /etc/passwd

9、显示/etc/grub2.cfg文件中，至少以一个空白字符开头的且后面存非空白字符的行

[root@centos7 ~]# grep "^[[:space:]]\+[^[:space:]]\+.*" /etc/grub2.cfg

10、找出“netstat -tan”命令的结果中以‘LISTEN’后跟任意多个空白字符结尾的行

[root@centos7 ~]# netstat -tan | grep -E "LISTEN[[:space:]]*$" tcp        0      0 192.168.122.1:53        0.0.0.0:*               LISTEN     tcp        0      0 0.0.0.0:22              0.0.0.0:*               LISTEN     tcp        0      0 127.0.0.1:631           0.0.0.0:*               LISTEN     tcp        0      0 127.0.0.1:25            0.0.0.0:*               LISTEN     tcp6       0      0 :::22                   :::*                    LISTEN     tcp6       0      0 ::1:631                 :::*                    LISTEN     tcp6       0      0 ::1:25                  :::*                    LISTEN

11、添加用户bash、testbash、basher以及nologin(其shell为/sbin/nologin),而后找出/etc/passwd文件中用户名同shell名的行

[root@centos7 ~]# grep -E "^\<(.*)\>.*\<\1\>$" /etc/passwdsync:x:5:0:sync:/sbin:/bin/syncshutdown:x:6:0:shutdown:/sbin:/sbin/shutdownhalt:x:7:0:halt:/sbin:/sbin/haltnologin:x:4346:4346::/home/nologin:/sbin/nologin

12、显示三个用户root、mage、wang的用户名、UID和默认shell

[root@centos7 ~]# grep -E "^(root|mage|wang)\>" /etc/passwd | cut -d: -f1,3,7root:0:/bin/bashmage:4347:/bin/bashwang:4348:/bin/bash

13、找出/etc/rc.d/init.d/functions文件中行首为某单词(包括下划线)后面跟一个小括号的行

grep -E "^[[:alpha:]_]+\(\)" /etc/rc.d/init.d/functions

14、使用egrep取出/etc/rc.d/init.d/functions中其基名

[root@centos7 ~]# echo /etc/rc.d/init.d/functions | grep -E -o "[^/]+/?$"functions

15、使用egrep取出上面路径的目录名

[root@centos7 ~]# echo /etc/rc.d/initd/function | grep -E -o "^/.*/"/etc/rc.d/initd/

16、统计以root身份登录的每个远程主机IP地址的登录次数

[root@centos7 ~]# last | grep -E "^root\>.*([[:digit:]]{1,3}.){3}[[:digit:]]{1,3}" | tr -s ' ' | cut -d' ' -f3 | sort | uniq -c     75 10.1.250.29      1 172.18.19.143

或者

[root@centos7 ~]# last | grep -E "^root\>.*pts" | tr -s ' ' | cut -d' ' -f3 | sort | uniq -c     75 10.1.250.29      1 172.18.19.143

17、显示ifconfig命令结果中所有IPv4地址

[root@centos7 ~]# ifconfig |grep -E -o  '(([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])\.){3}([0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])'10.1.0.17255.255.0.010.1.255.255127.0.0.1255.0.0.0192.168.122.1255.255.255.0192.168.122.255

18、取出文本中的全部身份证号

[root@centos6 testdir]# grep -o -E "1[0-9]{17}" f1 123456789123456789

19、取出文本中的全部手机号码

[root@centos6 testdir]# grep -o -E "1[0-9]{10}" f215335699718

20、取出文本中所有的邮箱地址

[root@centos6 testdir]# grep -o -E "([0-9]{5,}|[a-z])@([1-9]{1,}|[a-z]{1,}).com" f39687765@qq.com12345@yahu.como@sina.como@163.com

附一个取目录名的方法，大家帮忙解释一下哈！

[root@centos7 ~]# echo /etc/rc.d/initd/function/ | grep -E -o "^/.*/\b"/etc/rc.d/initd/

0 0