【学渣】字符匹配之有限状态自动机--应用在爬虫程序中匹配网址

来源:互联网 发布:win10软件消失 编辑:程序博客网 时间:2024/06/05 09:24

关于自动机的原理的文章已经有很多了,我就不再多说了,我觉得很多博客都写的很好

我就写一下在网址匹配方面的应用吧

其实很多人大都会选择正则表达式 

如果是有规律的匹配,应该有一个状态转移函数,但是我没有为下图找到规律,所以就用了最蠢的方法

如果是连续的输入,比如abababcrfjg这样的模式,然后输入的主字符串也是连续的  就可以写出来一个看起来不蠢的函数




随便打开一个网页,看看它的源代码

你会发现 链接大多是 <a href="//www.baidu.com/more/"这样的

然后根据上面的图

我们可以写出来一个矩阵,或者叫表

state<>ahref="/ 其他01000000000001002000000000220232222222232022422222224202225222222                                                                              这样的表格

然后就可以写程序了

#include<stdio.h>#include<string.h>void main(){int state=0;char url[102400];printf("input url:\n");gets(url);//scanf("%s",url);printf("input successfully,url is :\n%s\n",url);int len_url=strlen(url);printf("lenth is :%d\n",len_url);int flag=0;for(int i=0;i<=len_url;i++){//printf("for***[%d]**state=%d****url[%d]=%c\n",i,state,i,url[i]);if(state==0){if(url[i]=='<'){state=1;continue;}else {state=0;flag=i;continue;}}if(state==1){if(url[i]=='a'){state=2;continue;}else {state=0;flag=i;continue;}}if(state==2){if(url[i]=='h'){state=3;continue;}    if(url[i]=='>'){state=0;flag=i;continue;}else{state=2;continue;}}if(state==3){if(url[i]=='r'){state=4;continue;}    if(url[i]=='>'){state=0;flag=i;continue;}    else{state=2;continue;}}if(state==4){if(url[i]=='e'){state=5;continue;}if(url[i]=='>'){state=0;flag=i;continue;}else{state=2;continue;}}if(state==5){if(url[i]=='f'){state=6;continue;}if(url[i]=='>'){state=0;flag=i;continue;}else{state=2;continue;}}if(state==6){if(url[i]=='='){state=7;continue;}if(url[i]=='>'){state=0;flag=i;continue;}else{state=2;continue;}}if(state==7){if(url[i]=='"'){state=10;continue;}if(url[i]==' '){state=7;continue;}else{state=0;flag=i;continue;}}if(state==10){if(url[i]=='/'){state=8;continue;}if(url[i]=='"'||url[i]=='>'||url[i]=='#'){state=0;flag=i;continue;}else{state=10;continue;}}if(state==8){if(url[i]=='>'){state=0;flag=i;continue;}if(url[i]=='"')    {        state=9;    for(int j=flag;j<=i;j++)    {printf("%c",url[j+1]);}printf("\n\n");    continue;    }if(url[i]!='>'&&url[i]!='"'&&state==8){state=8;continue;}}if(state==9){state=0;continue;}else continue;}}  












0 0
原创粉丝点击