python抓价记(4)

来源：互联网发布：公司网络管理通知编辑：程序博客网时间：2024/06/06 02:05

很长时间没有更新，这次算是最终BOSS了吧。这么说也不合适，毕竟遇到的问题不一样，且就那么过吧。最后的任务：某蛋。

某蛋比较特别的是使用了图片作为价格，这种方法某东还有别的网站有用过，但是放弃了。当初某东起码还加了根红线作干扰，但是某蛋不一样，字体清晰，大小工整，间隔一致，这是最容易识别的数字图片了，没有之一。

首先打开商品网页，对着价格图片右键查看地址：http://www.newegg.cn/Common/PriceImage.aspx?PId=bfw2kS%2fniJ70ZJY%2bM%2bMAbg%3d%3d&newstyle=true ,这时候会发现PId和网址http://www.newegg.cn/Product/A04-038-55F.htm后面跟的A04-038-55F看不出有什么直接关系。

和上一篇一样，httpfox打开，追踪访问流程，有几个发现：

1. 有cookie的交互行为

2. 在商品页面上可以得到价格图片的网址

这次博主不想用urllib2了，用requests试试，原因是。。。真的很方便！

response = requests.get('http://www.newegg.cn/Product/A04-038-55F.htm')

把得到的内容打印出来，然后搜索“新蛋价”三个字，会发现后面有一段：

<input type="hidden" id="omHiddenPrice" value="15190.00" />

这就是我们想抓的价格，GAME OVER！

。。。

。。

。

就这么结束了么？其实真的可以结束了。博主为了学习图像识别，继续研究了下去。

在python里处理图像，肯定会用到PIL(Python Imaging Library)库，这个库的功能非常强大，什么旋转，图像增强，滤镜，应有尽有。

回到任务，需要做的第一件事是正确的切割图片，把图片下载下来，设置好选区，可以用win7的回头工具轻松看到近似的大小：

多切割几次会发现图像就是20x30的，切割完成后将图片转换为灰度图像，这里采用的是ITU-R 601的算法。转换完成后进行对比度增强后二值化，这是为了今后作判断比较方便。要识别数字有很多种方法，可以统计直方图，也可以通过像素位置判断，这里使用了后者。

这时候需要提供几组网址供代码学习，必须覆盖从0～9的所有数字，博主的做法是将四个坐标像素信息以及对应的数字存为json格式写入文件，真正要分析价格的时候再过来查询。以下是代码片段：

def get_image(url, image_path):    response = requests.get(url)    pattern = re.compile(r'<span class="clf76"><em>¥ </em><strong class="fs30"><img src=\'(.+)\' /></strong></span>')    price_url = re.findall(pattern, response.text)[0]    urllib.urlretrieve(price_url, image_path)def process_image(image_path):    im = Image.open(image_path)        width = im.size[0]    digital_num = width/19-3    im_buf = im.convert('L')    enhancer = ImageEnhance.Contrast(im_buf)    im_buf = enhancer.enhance(100)    im_buf = im_buf.convert('1')        return im_bufdef get_pixel(price_image):    left  = 1    upper = 0    right = 20    lower = 30    hist_list = []    width = price_image.size[0]    digital_num = width/19-3        for i in range(digital_num):        im_sub = price_image.crop((left+i*(right-left), upper, right+i*(right-left), lower))        pixel_list = []                pixel_list.append(im_sub.getpixel((4, 25)))        pixel_list.append(im_sub.getpixel((8, 20)))        pixel_list.append(im_sub.getpixel((16, 25)))        pixel_list.append(im_sub.getpixel((15, 26)))        hist_list.append(pixel_list)    return hist_list

def ne_parse(url):    path = 'C:\\Users\\wind\\Desktop\\price.png'    get_image(url, path)    hist_list = get_pixel(process_image(path))    os.system('del price.png -s')    return hist_listif __name__ == '__main__':    url = 'http://www.newegg.cn/Product/A04-038-55F.htm'    #study()    hist_record = open('hist_record.txt', 'r')    digital_dict = {}    digital_dict = json.load(hist_record)    dd = eval(str(digital_dict))    hist_record.close()    price = ''    for item in ne_parse(url):        for info in dd:            if item[0] == dd[info][0] and item[1] == dd[info][1] and item[2] == dd[info][2] and item[3] == dd[info][3]:                price += info                break    print price

这里

pixel_list.append(im_sub.getpixel((4, 25)))

(4, 25)就是样本点的坐标了，为什么选这些点？都是经验值。

实际上网上有很多比价网站可以做类似的工作，但是博主还是去做了这件事，因为从中可以学到各方面的知识，尽管不够深入，但是却足够用，就像福尔摩斯的抽屉，永远只装有用的东西。

0 0