regex(compilation flag, sub(), non-greedy match)

来源：互联网发布：smart forfour 知乎编辑：程序博客网时间：2024/05/21 05:17

http://www.thegeekstuff.com/2014/07/python-regex-examples/ 转载

6. Group by Number Using match.group

As I mentioned earlier, match objects come in very handy when working with grouping.

Grouping is the ability to address certain sub-parts of the entire regex match. We can define a group as a piece of the regular expression search string, and then individually address the corresponding content that was matched by this piece.

Let’s look at an example to see how this works:

  >>> contactInfo = 'Doe, John: 555-1212'

The string I just created resembles a snippet taken out of someones address book. We can match the line with a regular expression like this one:

  >>> re.search(r'\w+, \w+: \S+', contactInfo)  <_sre.SRE_Match object at 0xb74e1ad8<

By surrounding certain parts of the regular expression in parentheses (the ‘(‘ and ‘)’ characters), we can group the content and then work with these individual groups.

  >>> match = re.search(r'(\w+), (\w+): (\S+)', contactInfo)

These groups can be fetched using the match object’s group() method. The groups are addressable numerically in the order that they appear, from left to right, in the regular expression (starting with group 1):

  >>> match.group(1)  'Doe'  >>> match.group(2)  'John'  >>> match.group(3)  '555-1212'

The reason that the group numbering starts with group 1 is because group 0 is reserved to hold the entire match (we saw this earlier when we were learning about the match() and search() methods)

  >>> match.group(0)  'Doe, John: 555-1212'

7. Grouping by Name Using match.group

Sometimes, especially when a regular expression has a lot of groups, it is impractical to address each group by its number. Python also allows you to assign a name to a group using the following syntax:

  >>> match = re.search(r'(?P<last>\w+), (?P<first>\w+): (?P<phone>\S+)', contactInfo)

When can still fetch the grouped content using the group() method, but this time specifying the names we assigned the groups instead of the numbering we used before:

  >>> match.group('last')  'Doe'  >>> match.group('first')  'John'  >>> match.group('phone')  '555-1212'

This makes for much more explicit and readable code. You can imagine that as the regular expression became more and more complicated, understanding what was being captured by a grouping would get harder and harder. Assigning names to your groups explicitly informs you and your readers of your intentions.

Grouping can be used with the findall() method too, even though it doesn’t return match objects. Instead, findall() will return a list of tuples, where the Nth element of each tuple corresponds to the Nth group of the regex pattern:

  >>> re.findall(r'(\w+), (\w+): (\S+)', contactInfo)  [('Doe', 'John', '555-1212')]

However, named grouping doesn’t work when using the findall() method.

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

http://www.thegeekstuff.com/2014/07/advanced-python-regex/

1. Working with Multi-line Strings

There are a couple of scenarios that may arise when you are working with a multi-line string (separated by newline characters – ‘\n’). One case is that you may want to match something that spans more than one line. Consider this snippet of html:

  >>> paragraph = \  ... '''  ... <p>  ... This is a paragraph.  ... It has multiple lines.  ... </p>  ... '''  >>>

We may want to grab the entire paragraph tag (contents and all). We would expect to find this. However, as we see below, this did not work.

  >>> re.search(r'<p>.*</p>', paragraph)  >>>

The problem with this regular expression search is that, by default, the ‘.’ special character does not match newline characters.

There is an easy fix for this though. The ‘re’ packages query methods can optionally accept some predefined flags which modify how special characters behave.

The re.DOTALL flag tells python to make the ‘.’ special character match all characters, including newline characters. Let’s try it out:

  >>> match = re.search(r'<p>.*</p>', paragraph, re.DOTALL)  >>> match.group(0)  '<p>\nThis is a paragraph.\nIt has multiple lines.\n</p>'  >>>

Perfect, using the re.DOTALL flag, we can match patterns that span multiple lines.

Another scenario that could arise when working with multi-line strings is that we may only want to pick out lines that start or end with a certain pattern. Using our same paragraph, we would expect to find the the third line of text (the line ‘It has multiple lines.’). However, again, as shown below, we see that this is not the case.

  >>> re.search(r'^It has.*', paragraph)  >>>

By default in python, the ‘^’ and ‘$’ special characters (these characters match the start and end of a line, respectively) only apply to the start and end of the entire string.

Thankfully, there is a flag to modify this behavior as well. The re.MULTILINE flag tells python to make the ‘^’ and ‘$’ special characters match the start or end of any line within a string. Using this flag:

  >>> match = re.search(r'^It has.*', paragraph, re.MULTILINE)  >>> match.group(0)  'It has multiple lines.'  >>>

We get the behavior we expect.

documentation:

M

MULTILINE

(^ and $ haven’t been explained yet; they’ll be introduced in section More Metacharacters.)

Usually ^ matches only at the beginning of the string, and $ matches only at the end of the string and immediately before the newline (if any) at the end of the string. When this flag is specified, ^ matches at the beginning of the string and at the beginning of each line within the string, immediately following each newline. Similarly, the $ metacharacter matches either at the end of the string and at the end of each line (immediately preceding each newline).

>>> string'\ndas is the first\nThis is the second line\n'>>> re.search(r'\w+ne$', string)<_sre.SRE_Match object; span=(37, 41), match='line'>>>> re.search(r'\w+st$', string)>>> re.search(r'\w+st$', string, re.MULTILINE)<_sre.SRE_Match object; span=(12, 17), match='first'>

2. Greedy vs Non-Greedy Matches

Sometimes, if we are not careful with the use of special characters, our regular expressions find more that we expected them to.

This is because by default, regular expressions are greedy (i.e. they will match as much as possible). Consider this next example:

  >>> htmlSnippet = '<h1>This is the Title</h1>'  >>>

If we were to write a regular expression query to pick out only the html tags from this snippet, we might first naively try the following:

  >>> re.findall(r'<.*>', htmlSnippet)  ['<h1>This is the Title</h1>']

However, we see that (perhaps unexpectedly) this matched the entire snippet.

This is a good example of how regular expressions are greedy by default, the ‘.*’ portion of the regular expression expanded as much as it possibly could while still satisfying the match. We can tell python to not be greedy (i.e. to stop expanding special characters once the smallest matching substring is found) by using a ‘?’ character after the expansion character (after the ‘*’, ‘+’, …):

  >>> re.findall(r'<.*?>', htmlSnippet)  ['<h1>', '</h1>']

By affixing the ‘*’ expansion character with a ‘?’, we are telling python to only expand to the smallest possible match, and we get the behavior we were looking for.

>>> contactInfo = 'Doe, John: 555-1212'>>> match = re.search(r'(\w+), (\w+): (\S+)', contactInfo) #(...+) 3 groups alltogether>>> match<_sre.SRE_Match object; span=(0, 19), match='Doe, John: 555-1212'>>>> match.group(1), match.group(2), match.group(3)('Doe', 'John', '555-1212')>>> strline = 'this is the first line'>>> match = re.search('\w+ne$', strline)          #...>>> match<_sre.SRE_Match object; span=(18, 22), match='line'>>>> m = re.match('([abc])+', 'abc')             #(...)+  only 1 group>>> m.groups()('c',)>>> m = re.match('(?:[abc])+', 'abc')>>> m.groups()()

3. Substitution with Regular Expressions

Another task that the re package lets you do using regular expressions is to do substitutions within a string. The sub() methods takes a regular expression and phrase just like the query methods we’ve seen so far, but we also hand in the string to replace each match with. You can do straightforward substitutions like this:

  >>> re.sub(r'\w+', 'word', 'This phrase contains 5 words')  'word word word word word'

This replaces every found word with the literal string ‘word’. You can also reference the match in the replace string using grouping (we learned about grouping in the previous article):

  >>> re.sub(r'(?P<firstLetter>\w)\w*', r'\g<firstLetter>', 'This phrase contains 5 words')  'T p c 5 w'

In this case, we capture the first letter of each word in the ‘firstLetter’ group, and then call upon it in the replace string using the ‘\g<name>’ syntax. Had we not been using named groups, we could have specified the group number instead of the group name:

  >>> re.sub(r'(\w)\w*', r'\g<1>', 'This phrase contains 5 words')  'T p c 5 w'

Sometimes, our replacement needs are more complex than what can be specified in a simple replacement string.

For this, the sub() method can also accept a replacement function instead of a replacement string literal.

The replacement function should accept a single argument, which will be a match object and return a string. The sub() method will call this function on each match found, and replace the matching content with the function’s return value.

To demonstrate this, lets write a function that will allow us to make an arbitrary string more url-friendly (i.e. we will convert all characters to lowercase and replace series of spaces with a single ‘_’ character).

  >>> def slugify(matchObj):  ...  matchString = matchObj.group(0)  ...  if matchString.isalnum():  ...    return matchString.lower()  ...  else:  ...    return '_'  ...   >>>

Our function accepts a match object and returns a string, just as is required by the sub function.

Now we can use this function to ‘slugify’ and arbitrary string. We match either a series of word characters or a series of spaces (the ‘|’ special character is essentially the OR operator for regular expressions. To be a valid match, the content must either match the pattern to the left of the ‘|’, or the pattern to the right of the ‘|’). The sub() method will pass each match object to our slugify() function:

  >>> re.sub(r'\w+|\s+', slugify, 'This iS a   NAME')  'this_is_a_name'

Notice that we pass a reference to the function object into the sub() method (i.e. we don’t invoke the function). Remember that the sub() method is going to invoke the slugify function on each match object for us.

Taking a minute to understand the flow of this last example will not only teach you how the sub() method works, but also about some fundamentals of Python. Python treats functions as first class citizens. They can be handed around just like any other object can be (in fact, functions are objects in Python).

阅读全文

0 0