正则匹配

来源：互联网发布：ce6.0软件编辑：程序博客网时间：2024/06/06 03:02

正则在实际中具备很高的应用价值，学习java最好的网站就是 http://download.oracle.com/javase/tutorial/essential/regex/test_harness.html

下面是一个例子，到处Runnable Jar后运行java –jar XXX.jar就能尝试各种regx了。

import java.io.Console;

import java.util.regex.Pattern;

import java.util.regex.Matcher;

public class RegexTestHarness {

    public static void main(String[] args){

        Console console = System.console();

        if (console == null) {

            System.err.println("No console.");

            System.exit(1);

        while (true) {

            Pattern pattern =

            Pattern.compile(console.readLine("%nEnter your regex: "));

            Matcher matcher =

            pattern.matcher(console.readLine("Enter input string to search: "));

            boolean found = false;

            while (matcher.find()) {

                console.format("I found the text /"%s/" starting at " +

                   "index %d and ending at index %d.%n",

                    matcher.group(), matcher.start(), matcher.end());

                found = true;

            if(!found){

                console.format("No match found.%n");

位置关系：

字符匹配表达式：

Character Classes
[abc]     a, b, or c (simple class)
[^abc]     Any character except a, b, or c (negation)
[a-zA-Z]     a through z, or A through Z, inclusive (range)
[a-d[m-p]]     a through d, or m through p: [a-dm-p] (union)
[a-z&&[def]]     d, e, or f (intersection)
[a-z&&[^bc]]     a through z, except for b and c: [ad-z] (subtraction)
[a-z&&[^m-p]]     a through z, and not m through p: [a-lq-z] (subtraction)

.csharpcode, .csharpcode pre{font-size: small;color: black;font-family: consolas, "Courier New", courier, monospace;background-color: #ffffff;/*white-space: pre;*/}.csharpcode pre { margin: 0em; }.csharpcode .rem { color: #008000; }.csharpcode .kwrd { color: #0000ff; }.csharpcode .str { color: #006080; }.csharpcode .op { color: #0000c0; }.csharpcode .preproc { color: #cc6633; }.csharpcode .asp { background-color: #ffff00; }.csharpcode .html { color: #800000; }.csharpcode .attr { color: #ff0000; }.csharpcode .alt {background-color: #f4f4f4;width: 100%;margin: 0em;}.csharpcode .lnum { color: #606060; }

空白元字符:

/s   匹配空白符，如空格、制表符和换行符
/n   匹配换行符或行末符
/r   匹配回车符
/t   匹配制表符
/f   匹配进纸符

预定义字符

Predefined Character Classes.     Any character (may or may not match line terminators)/d     A digit: [0-9]/D     A non-digit: [^0-9]/s     A whitespace character: [ /t/n/x0B/f/r]/S     A non-whitespace character: [^/s]/w     A word character: [a-zA-Z_0-9]/W     A non-word character: [^/w]

/d matches all digits
/s matches spaces
/w matches word characters

Alternatively, a capital letter means the opposite:

/D matches non-digits
/S matches non-spaces
/W matches non-word characters

匹配策略:

 Quantifiers Meaning Greedy             Reluctant      Possessive X?                    X??            X?+         X, once or not at all X*                    X*?            X*+         X, zero or more times X+                    X+?            X++         X, one or more times X{n}                  X{n}?          X{n}+       X, exactly n times X{n,}                 X{n,}?         X{n,}+      X, at least n times X{n,m}                X{n,m}?        X{n,m}+     X, at least n but not more than m times

Capturing groups:

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d" "o" and "g". The portion of the input string that matches the capturing group will be saved in memory for later recall via backreferences (as discussed below in the section,Backreferences).

捕获组有两种形式

种是普通捕获组不产生歧义情况下后面简称捕获组语法规则:(expression)；

另种是命名捕获组语法规则:(?expression)或者(?'name'expression)这两种写法是等价

1、编号规则

如果没有显式为捕获组命名即没有使用命名捕获组那么需要按数字顺序来访问所有捕获组

在只有普通捕获组情况下捕获组编号是按照“(”出现顺序从左到右编号

(/d{4})-(/d{2}-(/d/d))

1 1 2 3 3 2

上面正则表达式可以用来匹配格式为yyyy-MM-dd日期为了在下表中得以区分采用了/d{2}和/d/d两种写法

还有个默认编号为0组表示是正则表达式整体

用以上正则表达式匹配串:2008-12-31

匹配结果为:

编号 命名 捕获组 匹配内容

0 (/d{4})-(/d{2}-(/d/d)) 2008-12-31

1 (/d{4}) 2008

2 (/d{2}-(/d/d)) 12-31

3 (/d/d) 31

如果对组进行了显式命名即命名捕获组那么捕获内容可以通过组名称来引用

但是如果正则表达式中既使用了普通捕获组也使用了命名捕获组那么捕获组编号就要特别注意编号规则是先对普通捕获组进行编号再对命名捕获

组进行编号

(/d{4})-(?<date>/d{2}-(/d/d))

1 1 3 2 23

用以上正则表达式匹配串:2008-12-31

匹配结果为:

编号 命名 捕获组 匹配内容

0 (/d{4})-(/d{2}-(/d/d)) 2008-12-31

1 (/d{4}) 2008

2 (/d/d) 31

3 date (?<date>/d{2}-(/d/d)) 12-31

2、捕获组引用

对捕获组引用般有以下几种

a) 正则表达式中对前面捕获组捕获内容进行引用称为反向引用

b) 正则表达式中(?(表达式)true|false)条件表达式

c) 在中对捕获组捕获内容引用

反向引用

对于普通捕获组引用语法规则为:/k通常简写为/num其中num是十进制数字即捕获组编号

对于命名捕获组引用语法规则为:/k或者/k'name'

Boundary Matchers

 Boundary Matchers ^      The beginning of a line $      The end of a line /b      A word boundary /B      A non-word boundary /A      The beginning of the input /G      The end of the previous match /Z      The end of the input but for the final terminator, if any /z      The end of the input

To check if a pattern begins and ends on a word boundary (as opposed to a substring within a longer string), just use /b on either side; for example, /bdog/b   Enter your regex: /bdog/bEnter input string to search: The dog plays in the yard.I found the text "dog" starting at index 4 and ending at index 7. Enter your regex: /bdog/bEnter input string to search: The doggie plays in the yard.No match found.To match the expression on a non-word boundary, use /B instead: Enter your regex: /bdog/BEnter input string to search: The dog plays in the yard.No match found. Enter your regex: /bdog/BEnter input string to search: The doggie plays in the yard.I found the text "dog" starting at index 4 and ending at index 7.To require the match to occur only at the end of the previous match, use /G: Enter your regex: dog Enter input string to search: dog dogI found the text "dog" starting at index 0 and ending at index 3.I found the text "dog" starting at index 4 and ending at index 7. Enter your regex: /Gdog Enter input string to search: dog dogI found the text "dog" starting at index 0 and ending at index 3.Here the second example finds only one match, because the second occurrence of "dog" does not start at the end of the previous match.

JAVA Pattern Class

Flags

 Constant                Equivalent Embedded Flag Expression Pattern.CANON_EQ                           None Pattern.CASE_INSENSITIVE                   (?i) Pattern.COMMENTS                           (?x) Pattern.MULTILINE                          (?m) Pattern.DOTALL                             (?s) Pattern.LITERAL                            None Pattern.UNICODE_CASE                       (?u) Pattern.UNIX_LINES                         (?d)

Using the `matches(String,CharSequence)` Method

The Pattern class defines a convenient matches method that allows you to quickly check if a pattern is present in a given input string. As with all public static methods, you should invoke matches by its class name, such as Pattern.matches("//d","1");. In this example, the method returnstrue, because the digit "1" matches the regular expression /d.

Using the `split(String)` Method

import java.util.regex.Pattern;import java.util.regex.Matcher;public class SplitDemo {    private static final String REGEX = ":";    private static final String INPUT = "one:two:three:four:five";        public static void main(String[] args) {        Pattern p = Pattern.compile(REGEX);        String[] items = p.split(INPUT);        for(String s : items) {            System.out.println(s);        }    }}OUTPUT:onetwothreefourfive

another demo

import java.util.regex.Pattern;import java.util.regex.Matcher; public class SplitDemo2 {     private static final String REGEX = "//d";    private static final String INPUT = "one9two4three7four1five";     public static void main(String[] args) {        Pattern p = Pattern.compile(REGEX);        String[] items = p.split(INPUT);        for(String s : items) {            System.out.println(s);        }    }}OUTPUT: onetwothreefourfive

Index Methods

Index methods provide useful index values that show precisely where the match was found in the input string:

public int start(): Returns the start index of the previous match.
public int start(int group): Returns the start index of the subsequence captured by the given group during the previous match operation.
public int end(): Returns the offset after the last character matched.
public int end(int group): Returns the offset after the last character of the subsequence captured by the given group during the previous match operation.

Study Methods

Study methods review the input string and return a boolean indicating whether or not the pattern is found.

public boolean lookingAt(): Attempts to match the input sequence, starting at the beginning of the region, against the pattern.
public boolean find(): Attempts to find the next subsequence of the input sequence that matches the pattern.
public boolean find(int start): Resets this matcher and then attempts to find the next subsequence of the input sequence that matches the pattern, starting at the specified index.
public boolean matches(): Attempts to match the entire region against the pattern.

Replacement Methods

Replacement methods are useful methods for replacing text in an input string.

public Matcher appendReplacement(StringBuffer sb, String replacement): Implements a non-terminal append-and-replace step.
public StringBuffer appendTail(StringBuffer sb): Implements a terminal append-and-replace step.
public String replaceAll(String replacement): Replaces every subsequence of the input sequence that matches the pattern with the given replacement string.
public String replaceFirst(String replacement): Replaces the first subsequence of the input sequence that matches the pattern with the given replacement string.
public static String quoteReplacement(String s): Returns a literal replacement String for the specified String. This method produces a Stringthat will work as a literal replacement s in the appendReplacement method of the Matcher class. The String produced will match the sequence of characters in s treated as a literal sequence. Slashes ('/') and dollar signs ('$') will be given no special meaning.