Note1: Basic Text Processing

来源:互联网 发布:五线谱输入软件 编辑:程序博客网 时间:2024/05/18 00:26

Regular Expressions: Disjunctions

Online regular expression test

1. Letters inside square brackets []

Pattern

Matches

[wW]oodchuck

Woodchuck, woodchuck

[1234567890] 

Any digit


2. Ranges [ A - Z ]

Pattern

Matches

[A-Z]

An upper case letter

Drenched Blossoms

[a-z]

A lower case letter

my beans were impatient

[0-9]

A single digit

Chapter 1: Down the Rabbit Hole


3. Negations [^Ss]  

    Carat means negation only when first in []

Pattern

Matches

[^A-Z]

Not an upper case letter

Oyfn pripetchik

[^Ss

Neither ‘S’ nor ‘s’

I have no exquisite reason”

[^e^]

Neither e nor ^

Look here

a^b

The pattern a carat b

Look up a^bnow


4. The pipe | for disjunction

Pattern

Matches

groundhog|woodchuck

yours|mine

yours   mine

a|b|c

= [abc]

[gG]roundhog|[Ww]oodchuck



Regular Expressions: ? * + .

Pattern

Matches

colou?r

Optional previous char

color    colour

oo*h!

0 or more of previous char

oh!ooh! ooohooooh!

o+h!

1 or more of previous char

oh!ooh! ooohooooh!

baa+

baa baaa baaaa  baaaaa

beg.n

begin begun begun beg3n


Anchors: ^ $

^ : start with

$: end with

Pattern

Matches

^[A-Z]

Palo Alto

^[^A-Za-z]

1    Hello”

\.$

The end.

.$

The end? The end!


Two kinds of errors:

Type I: matching strings that we should not have matched     False positives

Type II: not matching things that we should have matched     False negatives

Two antagonistic efforts:

Increasing accuracy or precision (minimize false positives)

Increasing coverage or recall (minimizing false negatives)


Word Tokenization






0 0