awk 总结

来源：互联网发布：2017年软考程序员答案编辑：程序博客网时间：2024/04/30 06:40

http://www.gnu.org/software/gawk/manual/gawk.html

1。 shell quoting rules

Quoted items can be concatenated with nonquoted items as well as with other quoted items. The shell turns everything into one argument for the command.

Preceding any single character with a backslash (‘\’) quotes that character. The shell removes the backslash and passes the quoted character on to the command

Single quotes protect everything between the opening and closing quotes. The shell does no interpretation of the quoted text, passing it on verbatim to the command. It is impossible to embed a single quote inside single-quoted text.

Double quotes protect most things between the opening and closing quotes. The shell does at least variable and command substitution on the quoted text.

If you really need both single and double quotes in your awk program, it is probably best to move it into a separate file, where the shell won't be part of the picture, and you can say what you mean.

2. 运行

awk 'program' input-file1 input-file2 ...

awk -f program-file input-file1 input-file2 ...

3. program的格式

program = pattern {action}

3. 正则表达式

Escape Sequence

\\
A literal backslash, ‘\’.

\a
The “alert” character, Ctrl-g, ASCII code 7 (BEL). (This usually makes some sort of audible noise.)

\b
Backspace, Ctrl-h, ASCII code 8 (BS).

\f
Formfeed, Ctrl-l, ASCII code 12 (FF).

\n
Newline, Ctrl-j, ASCII code 10 (LF).

\r
Carriage return, Ctrl-m, ASCII code 13 (CR).

\t
Horizontal TAB, Ctrl-i, ASCII code 9 (HT).

\v
Vertical tab, Ctrl-k, ASCII code 11 (VT).

\nnn
The octal value nnn, where nnn stands for 1 to 3 digits between ‘0’ and ‘7’. For example, the code for the ASCII ESC (escape) character is ‘\033’.

\xhh...
The hexadecimal value hh, where hh stands for a sequence of hexadecimal digits (‘0’–‘9’, and either ‘A’–‘F’ or ‘a’–‘f’). Like the same construct in ISO C, the escape sequence continues until the first nonhexadecimal digit is seen. (c.e.) However, using more than two hexadecimal digits produces undefined results. (The ‘\x’ escape sequence is not allowed in POSIX awk.)

\/
A literal slash (necessary for regexp constants only). This sequence is used when you want to write a regexp constant that contains a slash. Because the regexp is delimited by slashes, you need to escape the slash that is part of the pattern, in order to tell awk to keep processing the rest of the regexp.

\"
A literal double quote (necessary for string constants only). This sequence is used when you want to write a string constant that contains a double quote. Because the string is delimited by double quotes, you need to escape the quote that is part of the string, in order to tell awk to keep processing the rest of the string.

Regular Expression Operators

\
This is used to suppress the special meaning of a character when matching. For example, ‘\$’ matches the character ‘$’.

^
This matches the beginning of a string. For example, ‘^@chapter’ matches ‘@chapter’ at the beginning of a string and can be used to identify chapter beginnings in Texinfo source files. The ‘^’ is known as an anchor, because it anchors the pattern to match only at the beginning of the string.
It is important to realize that ‘^’ does not match the beginning of a line embedded in a string. The condition is not true in the following example:

          if ("line1\nLINE 2" ~ /^L/) ...

$
This is similar to ‘^’, but it matches only at the end of a string. For example, ‘p$’ matches a record that ends with a ‘p’. The ‘$’ is an anchor and does not match the end of a line embedded in a string. The condition in the following example is not true:
          if ("line1\nLINE 2" ~ /1$/) ...

. (period)
This matches any single character, including the newline character. For example, ‘.P’ matches any single character followed by a ‘P’ in a string. Using concatenation, we can make a regular expression such as ‘U.A’, which matches any three-character sequence that begins with ‘U’ and ends with ‘A’.
In strict POSIX mode (see Options), ‘.’ does not match the nul character, which is a character with all bits equal to zero. Otherwise, nul is just another character. Other versions of awk may not be able to match the nul character.

[...]
This is called a bracket expression.15 It matches any one of the characters that are enclosed in the square brackets. For example, ‘[MVX]’ matches any one of the characters ‘M’, ‘V’, or ‘X’ in a string. A full discussion of what can be inside the square brackets of a bracket expression is given in Bracket Expressions.

[^ ...]
This is a complemented bracket expression. The first character after the ‘[’ must be a ‘^’. It matches any characters except those in the square brackets. For example, ‘[^awk]’ matches any character that is not an ‘a’, ‘w’, or ‘k’.

|
This is the alternation operator and it is used to specify alternatives. The ‘|’ has the lowest precedence of all the regular expression operators. For example, ‘^P|[[:digit:]]’ matches any string that matches either ‘^P’ or ‘[[:digit:]]’. This means it matches any string that starts with ‘P’ or contains a digit.
The alternation applies to the largest possible regexps on either side.

(...)
Parentheses are used for grouping in regular expressions, as in arithmetic. They can be used to concatenate regular expressions containing the alternation operator, ‘|’. For example, ‘@(samp|code)\{[^}]+\}’ matches both ‘@code{foo}’ and ‘@samp{bar}’. (These are Texinfo formatting control sequences. The ‘+’ is explained further on in this list.)

*
This symbol means that the preceding regular expression should be repeated as many times as necessary to find a match. For example, ‘ph*’ applies the ‘*’ symbol to the preceding ‘h’ and looks for matches of one ‘p’ followed by any number of ‘h’s. This also matches just ‘p’ if no ‘h’s are present.
The ‘*’ repeats the smallest possible preceding expression. (Use parentheses if you want to repeat a larger expression.) It finds as many repetitions as possible. For example, ‘awk '/$c[ad][ad]*r x$/ { print }' sample’ prints every record in sample containing a string of the form ‘(car x)’, ‘(cdr x)’, ‘(cadr x)’, and so on. Notice the escaping of the parentheses by preceding them with backslashes.

+
This symbol is similar to ‘*’, except that the preceding expression must be matched at least once. This means that ‘wh+y’ would match ‘why’ and ‘whhy’, but not ‘wy’, whereas ‘wh*y’ would match all three of these strings. The following is a simpler way of writing the last ‘*’ example:
          awk '/$c[ad]+r x$/ { print }' sample

?
This symbol is similar to ‘*’, except that the preceding expression can be matched either once or not at all. For example, ‘fe?d’ matches ‘fed’ and ‘fd’, but nothing else.

{n}
{n,}
{n,m}
One or two numbers inside braces denote an interval expression. If there is one number in the braces, the preceding regexp is repeated n times. If there are two numbers separated by a comma, the preceding regexp is repeated n to m times. If there is one number followed by a comma, then the preceding regexp is repeated at least n times:
wh{3}y
Matches ‘whhhy’, but not ‘why’ or ‘whhhhy’.
wh{3,5}y
Matches ‘whhhy’, ‘whhhhy’, or ‘whhhhhy’, only.
wh{2,}y
Matches ‘whhy’ or ‘whhhy’, and so on.
Interval expressions were not traditionally available in awk. They were added as part of the POSIX standard to make awk and egrep consistent with each other.

Initially, because old programs may use ‘{’ and ‘}’ in regexp constants, gawk did not match interval expressions in regexps.

However, beginning with version 4.0, gawk does match interval expressions by default. This is because compatibility with POSIX has become more important to most gawk users than compatibility with old programs.

For programs that use ‘{’ and ‘}’ in regexp constants, it is good practice to always escape them with a backslash. Then the regexp constants are valid and work the way you want them to, using any version of awk.16

Using Bracket Expressions

Class   Meaning
[:alnum:]   Alphanumeric characters.
[:alpha:]   Alphabetic characters.
[:blank:]   Space and TAB characters.
[:cntrl:]   Control characters.
[:digit:]   Numeric characters.
[:graph:]   Characters that are both printable and visible. (A space is printable but not visible, whereas an ‘a’ is both.)
[:lower:]   Lowercase alphabetic characters.
[:print:]   Printable characters (characters that are not control characters).
[:punct:]   Punctuation characters (characters that are not letters, digits, control characters, or space characters).
[:space:]   Space characters (such as space, TAB, and formfeed, to name a few).
[:upper:]   Uppercase alphabetic characters.
[:xdigit:]   Characters that are hexadecimal digits.

you can write /[[:alnum:]]/ to match the alphabetic and numeric characters in your character set

Collating symbols
Multicharacter collating elements enclosed between ‘[.’ and ‘.]’. For example, if ‘ch’ is a collating element, then [[.ch.]]is a regexp that matches this collating element, whereas [ch] is a regexp that matches either ‘c’ or ‘h’.

Equivalence classes
Locale-specific names for a list of characters that are equal. The name is enclosed between ‘[=’ and ‘=]’. For example, the name ‘e’ might be used to represent all of “e,” “è,” and “é.” In this case, [[=e=]]is a regexp that matches any of ‘e’, ‘é’, or ‘è’.

gawk-Specific Regexp Operators

gawk特有的表达

\s
Matches any whitespace character. Think of it as shorthand for [[:space:]].

\S
Matches any character that is not whitespace. Think of it as shorthand for [^[:space:]].

\w
Matches any word-constituent character—that is, it matches any letter, digit, or underscore. Think of it as shorthand for [[:alnum:]_].

\W
Matches any character that is not word-constituent. Think of it as shorthand for [^[:alnum:]_].

\<
Matches the empty string at the beginning of a word. For example, /\<away/ matches ‘away’ but not ‘stowaway’.

\>
Matches the empty string at the end of a word. For example, /stow\>/ matches ‘stow’ but not ‘stowaway’.

\y
Matches the empty string at either the beginning or the end of a word (i.e., the word boundary). For example, ‘\yballs?\y’ matches either ‘ball’ or ‘balls’, as a separate word.

\B
Matches the empty string that occurs between two word-constituent characters. For example, /\Brat\B/ matches ‘crate’ but it does not match ‘dirty rat’. ‘\B’ is essentially the opposite of ‘\y’.
There are two other operators that work on buffers. In Emacs, a buffer is, naturally, an Emacs buffer. For other programs, gawk's regexp library routines consider the entire string to match as the buffer. The operators are:

\`
Matches the empty string at the beginning of a buffer (string).

\'
Matches the empty string at the end of a buffer (string).

Case Sensitivity in Matching

一种办法是全部转换成小写或大写

tolower($1) ~ /foo/ { ... }

How Much Text Matches

4. 输入文件的处理

如何分割记录

变量RS(record separator)定义了记录的分隔符

awk 'BEGIN { RS = "/" }
{ print $0 }' BBS-list

这样记录的分隔符就是‘/’。

gawk也支持将RS定义成一个正则表达式。

$echo record 1 AAAA record 2 BBBB record 3 | gawk 'BEGIN { RS = "\n|( *[[:upper:]]+ *)" } { print "Record =", $0, "and RT =", RT }'

     -| Record = record 1 and RT = AAAA
     -| Record = record 2 and RT = BBBB
     -| Record = record 3 and RT =
     -|

记录中的域

每条记录又被分成小的域。

$0 表示整条记录

$1 $2表示第一个域，第二个域。

改变某个域

awk '{ $2 = $2 - 10; print $0 }' inventory-shipped

改变了$2 为原来的$2-10。原文件inventory-shipped是不会改变的。

如何分割域

http://www.gnu.org/software/gawk/manual/gawk.html#Field-Separators

5. 用dgawk调试awk