Thinking in java-36 Regular expression正则表达式

来源：互联网发布：api数据接口编辑：程序博客网时间：2024/05/19 23:17

1. 什么是正则表达式？

部分内容整理自此，了解详细内容可以查看此处。
正则表达式定义了一种字符串查找的模式。
这种字符串search pattern可以是最简单的单字符、某个固定的字符串或者是某种特殊模式的复杂的表达式。
所以我们可以这样说：如果一个字符串符合了regular expression的模式，那么他就和我所要找的内容相匹配。（If a string has these things in it, then it matches what I’m looking for.)

2. 一些常见的正则表达式规则

该部分内容是可用的元字符集的概述，其实正则表达式不仅仅用于Java语言中，在其他大部分语言如Perl，Groovy等都是支持的（只是不同的语言对于正则表达式的支持略有不同）；不过该部分的内容应该算是普适性的，和语言无关。

2.1. 常见的匹配符号

Regular Expression(正则表达式) Description(意义描述) . 匹配任何字符 ^regex 找到在行首匹配的regex regex$ 找到行尾匹配的regex [abc] 一种集合定义set definition,匹配字母a, b或c [abc][vz] 集合定义，匹配a or b or c后跟着要么是v要么是z [^abc] 当跳脱符号^出现在方括号内部，则将匹配的是原本内容相反的内容 [a-d1-7] 一种范围表示，匹配的是一个字母或数字，字母范围为a-d，数字范围为1-7 X Z XZ 匹配XZ $ 判断是否跟着（follow）有行结束符号

2.2 一些元字符meta-character

下面的这些元字符有预先定义的意义，元字符的引入使得一些常见模式更易于被使用，如使用:\d 代替[0-9]。

Regular Expression(正则表达式) Description(意义描述) \d 任何数字，[0-9]的简写 \D 任何非数字，[^0-9]的简写 \s 空白符，[ \t\n\x0b\r\f]的简写 \S 非空白符, [^\s]的简写 \w 一个单字母字符，[a-zA-Z_0-9]的简写 \W 一个非单词字符，[^\w]的简写 \S+ 几个非空白字符 \b 匹配一个单词边界boundary，这里的单词字符是[a-zA-Z0-9_]

2.3 量词 quantifier

量词定义了某个元素出现的次数。这些符号’?’,’*’, ‘+’ and ‘{}’定义了正则表达式所出现的次数。

Regular Expression Description Examples * 出现0次或更多次，是{0,}缩写如X*匹配0或多个X字母；.*匹配任何字符串序列 + 出现1次或多次，是{1,}的缩写如X+匹配一个或多个X字母 ? 出现0次或1次，是{0,1}的缩写如X？匹配的是0或一个X字符 {X} 出现了X次如\d{3}查找3个数字；.{10}查找任何长度大小为10的字符序列 {X,Y} 出现了[X,Y]次 \d{1,4},意味着数字出现至少一次，至多4次 *? 在一个量词后的 ‘?’将使得该量词成为惰性量词，它试图找到最小匹配并在找到最小匹配后停止搜索

2.4 分组和后向引用Group & back-reference

可以在正则表达式中使用分组。语法规则是：我们使用 ()来控制分组。通过我们可以访问分组，比如0 ,1,2,其中0代表整个正则表达式，1代表第一个括号的内容，以此类推。

package com.fqy.blog;import org.junit.Test;public class StrRegex {    public static final String EXAMPLE_TEST = "The extra space is for testing removal  , hha .";    @Test    public void testBackRef() {        // Removes whitespace between a word character and . or ,        String pattern = "(\\w)(\\s+)([\\.,])";        System.out.println(EXAMPLE_TEST.replaceAll(pattern, "$1$3"));        // Extract the text between the two title elements        String titleStr = "<Title title1 in> Header</title>";        pattern = "(?i)(<title.*?>)(.+?)(</title>)";// </title>        String updated = titleStr.replaceAll(pattern, "$2");        System.out.println(updated);    }}//Running result:The extra space is for testing removal, hha. Header

2.5 负向前negative look ahead

负向前，是通过(?!pattern). 比如，下面的语法会匹配’a’ 如果 ‘a’后跟的不是’b’.

a(?!b)

2.6 指定正则表达式模式

可以添加模式修饰符在正则表达式的开始处。也可以指定多种模式，具体实现方法是将其复合在一起，e.g.(?ismx) .
- (?i)：使得正则表达式忽略大小写， case insensitive.
- (?s): 单行模式，使得’.’ 匹配所有的字符，包括换行符。
- (?m)：多行匹配模式，使得 ‘^’ 与’$’匹配字符串的行首和行尾。

2.7 关于 ‘\’

java字符串中的 \是一种转义字符，这意味着’\’在java中有预先定义的含义。我们必须使用 \定义一个.。比如想要使用 \w, 我们需要使用\w ; 我们想要使用’\’时，我们必须使用\\.

3. String类中关于regular expression 的应用

boolean java.lang.String.matches(String regex)
String[] java.lang.String.split(String regex)
String java.lang.String.replaceFirst(String regex, String replacement)
String java.lang.String.replaceAll(String regex, String replacement)

这里写图片描述

package com.fqy.blog;import org.junit.Test;public class StrRegex {    @Test    public void testBackRef() {        // Removes whitespace between a word character and . or ,        final String EXAMPLE_TEST = "The extra space is for testing removal  , hha .";        String pattern = "(\\w)(\\s+)([\\.,])";        System.out.println(EXAMPLE_TEST.replaceAll(pattern, "$1$3"));        // Extract the text between the two title elements        String titleStr = "<Title title1 in> Header</title>";        pattern = "(?i)(<title.*?>)(.+?)(</title>)";// </title>        String updated = titleStr.replaceAll(pattern, "$2");        System.out.println(updated);    }    @Test    public void strRegTest() {        String EXAMPLE_TEST = "This is my small example " + "string which I'm going to " + "use for pattern matching.";        System.out.println(EXAMPLE_TEST.matches("\\w.*")); // True        String[] splitString = (EXAMPLE_TEST.split("\\s+"));        System.out.println(splitString.length);// should be 14        for (String string : splitString) {            System.out.print(string + " ");        }        System.out.println();        // replace all whitespace with tabs        System.out.println(EXAMPLE_TEST.replaceAll("\\s+", "\t"));    }}

//Running result:
true
14
This is my small example string which I’m going to use for pattern matching.
This is my small example string which I’m going to use for pattern matching.

另一些例子：

public class StringMatcher {    // returns true if the string matches exactly "true"    public boolean isTrue(String s){        return s.matches("true");    }    // returns true if the string matches exactly "true" or "True"    public boolean isTrueVersion2(String s){        return s.matches("[tT]rue");    }    // returns true if the string matches exactly "true" or "True"    // or "yes" or "Yes"    public boolean isTrueOrYes(String s){        return s.matches("[tT]rue|[yY]es");    }    // returns true if the string contains exactly "true"    public boolean containsTrue(String s){        return s.matches(".*true.*");    }    // returns true if the string contains of three letters    public boolean isThreeLetters(String s){        return s.matches("[a-zA-Z]{3}");        // simpler from for//      return s.matches("[a-Z][a-Z][a-Z]");    }    // returns true if the string does not have a number at the beginning    public boolean isNoNumberAtBeginning(String s){        return s.matches("^[^\\d].*");    }    // returns true if the string contains a arbitrary number of characters except b    public boolean isIntersection(String s){        return s.matches("([\\w&&[^b]])*");    }    // returns true if the string contains a number less than 300    public boolean isLessThenThreeHundred(String s){        return s.matches("[^0-9]*[12]?[0-9]{1,2}[^0-9]*");    }}import org.junit.Before;import org.junit.Test;import static org.junit.Assert.assertFalse;import static org.junit.Assert.assertTrue;public class StringMatcherTest {    private StringMatcher m;    @Before    public void setup(){        m = new StringMatcher();    }    @Test    public void testIsTrue() {        assertTrue(m.isTrue("true"));        assertFalse(m.isTrue("true2"));        assertFalse(m.isTrue("True"));    }    @Test    public void testIsTrueVersion2() {        assertTrue(m.isTrueVersion2("true"));        assertFalse(m.isTrueVersion2("true2"));        assertTrue(m.isTrueVersion2("True"));;    }    @Test    public void testIsTrueOrYes() {        assertTrue(m.isTrueOrYes("true"));        assertTrue(m.isTrueOrYes("yes"));        assertTrue(m.isTrueOrYes("Yes"));        assertFalse(m.isTrueOrYes("no"));    }    @Test    public void testContainsTrue() {        assertTrue(m.containsTrue("thetruewithin"));    }    @Test    public void testIsThreeLetters() {        assertTrue(m.isThreeLetters("abc"));        assertFalse(m.isThreeLetters("abcd"));    }    @Test    public void testisNoNumberAtBeginning() {        assertTrue(m.isNoNumberAtBeginning("abc"));        assertFalse(m.isNoNumberAtBeginning("1abcd"));        assertTrue(m.isNoNumberAtBeginning("a1bcd"));        assertTrue(m.isNoNumberAtBeginning("asdfdsf"));    }    @Test    public void testisIntersection() {        assertTrue(m.isIntersection("1"));        assertFalse(m.isIntersection("abcksdfkdskfsdfdsf"));        assertTrue(m.isIntersection("skdskfjsmcnxmvjwque484242"));    }    @Test    public void testLessThenThreeHundred() {        assertTrue(m.isLessThenThreeHundred("288"));        assertFalse(m.isLessThenThreeHundred("3288"));        assertFalse(m.isLessThenThreeHundred("328 8"));        assertTrue(m.isLessThenThreeHundred("1"));        assertTrue(m.isLessThenThreeHundred("99"));        assertFalse(m.isLessThenThreeHundred("300"));    }}

4. 之前Sting版本并没有进行性能的优化，Java中提供了优化版本的 Pattern & Matcher

java.util.regex.Pattern
java.util.regex.Matcher

Pattern: 通过Pattern类定义正则表达式。
Matcher: 通过Pattern对象可以创建一个给定的字符串的Matcher对象，通过Matcher对象可以对字符串进行正则表达式的操作。

package com.fqy.blog;import java.util.regex.Matcher;import java.util.regex.Pattern;import org.junit.Test;public class StrRegex {    @Test    public void strPatternMatcher() {        String EXAMPLE_TEST = "This is my small example string which I'm going to use for pattern matching.";        Pattern pattern = Pattern.compile("\\w+");        // in case you would like to ignore case sensitivity,        // you could use this statement:        // Pattern pattern = Pattern.compile("\\s+", Pattern.CASE_INSENSITIVE);        Matcher matcher = pattern.matcher(EXAMPLE_TEST);        // check all occurance        while (matcher.find()) {            System.out.print("Start index: " + matcher.start());            System.out.print(" End index: " + matcher.end() + " ");            System.out.println(matcher.group());        }        // now create a new pattern and matcher to replace whitespace with tabs        Pattern replace = Pattern.compile("\\s+");        Matcher matcher2 = replace.matcher(EXAMPLE_TEST);        System.out.println(matcher2.replaceAll("\t"));    }}//Running results:Start index: 0 End index: 4 ThisStart index: 5 End index: 7 isStart index: 8 End index: 10 myStart index: 11 End index: 16 smallStart index: 17 End index: 24 exampleStart index: 25 End index: 31 stringStart index: 32 End index: 37 whichStart index: 38 End index: 39 IStart index: 40 End index: 41 mStart index: 42 End index: 47 goingStart index: 48 End index: 50 toStart index: 51 End index: 54 useStart index: 55 End index: 58 forStart index: 59 End index: 66 patternStart index: 67 End index: 75 matchingThis    is  my  small   example string  which   I'm going   to  use for pattern matching.

阅读全文

0 0