python---爬虫

来源：互联网发布：北京域名备案需要多久编辑：程序博客网时间：2024/06/06 05:50

1. 基础讲解：

findall 匹配所有符合规律的内容。

Search提取第一个符合规律的内容。

Sub替换符合规律的内容，返回替换后的值。

. 代表匹配任意字符。

import re                 # re代表的是正则表达式的库a = 'abcdefg'b = re.findall('a.',a)  #输出abfor each in b:    print eachimport re                 # re代表的是正则表达式的库a = 'xxy123xx465xx789xx'b = re.findall('x..',a)  #输出xxy     xx4     xx7for each in b:    print each

*代表前一个字符0次或无限次。

import re                 # re代表的是正则表达式的库a = 'abbcdefg'b = re.findall('a.*',a)  #输出abbcdefgfor each in b:    print eachimport re                 # re代表的是正则表达式的库a = 'abcdefg'b = re.findall('a.*',a)  #输出abcdefgfor each in b:    print eachimport re                 # re代表的是正则表达式的库a = 'xxy123xx465xx789xx'b = re.findall('xx.*xx',a)  #xxy123xx465xx789xxfor each in b:    print each

？前一个字符0次或一次。
2. 贪心算法：

xx.*xx：中间有多少输出多少。import re                 # re代表的是正则表达式的库a = 'xxy123xx465xx789xx'b = re.findall('xx.*xx',a)  #xxy123xx465xx789xxfor each in b:    print each

3. 非贪心算法：
（1）xx.*?xx：

import re                 # re代表的是正则表达式的库a = 'xxy123xx465xx789xx'b = re.findall('xx.*?xx',a)  #xxy123xx  xx789xxfor each in b:    print each

（2）xx(.*?)xx：

import re                 # re代表的是正则表达式的库a = 'xxy123xx465xx789xx'b = re.findall('xx(.*?)xx',a)  #y123    789for each in b:    print each

4. search和group函数组合：把每个元素分出来：

import re                 # re代表的是正则表达式的库a = 'xxy123xx465xx789xx'b = re.search('xx(.*?)xx(.*?)xx',a).group(1)c = re.search('xx(.*?)xx(.*?)xx',a).group(2)print b     #输出y123print c     #输出465

5.sub函数的使用：替换掉相应内容：

import re                 # re代表的是正则表达式的库a = 'xxy123xx'b = re.sub('xx(.*?)xx','xx%dxx'%789,a)print b     #输出xx789xx

阅读全文

1 0