Python:多线程、消息队列编程

来源：互联网发布：软件如何设计接口编辑：程序博客网时间：2024/06/01 21:06

用多线程来做文件读写、网络交互，以大大提高效率，实测速度从一个小时25分钟

python 移除python

终于搞定了多线程，很爽。

这个让我很烦

Exception in thread Thread-3 (most likely raised during interpreter shutdown):Exception in thread Thread-1 (most likely raised during interpreter shutdown):

线程死了。

还有什么任务结束太多次的

Exception in thread Thread-4:

Traceback (most recent call last):

File "C:\Python26\lib\threading.py", line 532, in __bootstrap_inner

self.run()

File "D:\workspace-python\crawler\��߳�python��\MoreThread\test��߳�4\��ھ��վ.py", line 69, in run

self.queue.task_done()

File "C:\Python26\lib\Queue.py", line 64, in task_done

raise ValueError('task_done() called too many times')

ValueError: task_done() called too many times

就属这种最讨厌，没有原因的多线程：

Exception in thread Thread-8 (most likely raised during interpreter shutdown):

种种的种种，让我开发多线程遇到挫折。但是最终，被我用代码给客服了。从此以后，在遇到多线程的思路、乃至python的解决方案，那也都可以加快速度的解决了。

好了闲话不多说，研究了一天多线程和队列，在此，希望可以稍微看一下理论，毕竟理论是代码的领路人嘛，当然理论我是看的这个：

https://www.ibm.com/developerworks/aix/library/au-threadingpython/

这个人说的还是蛮清楚的，而且例子啥的也能很好的体会到

其实又是一步步走过来，坑的很多，但是都克服了。先粘贴全部代码，然后分析一下。

#coding:utf-8

'''

Created on 2017年7月11日

coding=UTF-8

@author: lishouzhuang

'''

import Queue

import threading

import urllib2

import time

from BeautifulSoup import BeautifulSoup

import ConfigParser

from __builtin__ import str

config = ConfigParser.ConfigParser()

config.readfp(open("F:\\python\\project\\pythonUrl.properties"),"rb")

allUrlListPath = config.get("step3","allUrlListPath")

allHtmlPath = config.get("step3","allHtmlPath")

filename =allUrlListPath

list=[]

def readfile(filename):

with open(filename,'r') as f:

for line in f.readlines():

linestr = line.strip()

linestrlist = linestr.split("\r")

# print linestrlist

list.append(linestrlist[0])

# print(list)

# print list.__len__()

readfile(filename)

# print 'end'

# print list

print '-'*100

queue = Queue.Queue()

out_queue = Queue.Queue()

#该线程用于一步一步将读取到的页面url装在此处的queue

class ThreadUrl(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, queue, out_queue):

threading.Thread.__init__(self)

self.queue = queue

self.out_queue = out_queue

def run(self):

while True:

host = self.queue.get()

p=1

while p<=42:

# 如果是首页，直接这样，例如 http://www.wandoujia.com/category/5029_716

#视频的第一页

strResult = host.split("/")[-1]

# print 'now copy this url : ',strResult

key = host+'/'+str(p)

key2 = strResult+"_"+str(p)

result_url=getPage(key)

p=p+1

x = dict(filename=key2, html=result_url)

print '将',x['filename'],'放到下载文件队列中' #输出字典内容很爽吧

self.out_queue.put(x)

try:

self.queue.task_done()

except Exception as e:

print e

class DatamineThread(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, out_queue):

threading.Thread.__init__(self)

self.out_queue = out_queue

def run(self):

while True:

#grabs host from queue

chunk = self.out_queue.get()

# print chunk

filename = chunk['filename']

print '拿到url：',filename

html = chunk['html']

# print html

txt=allHtmlPath+filename+'_'+'.html'

f = open(txt,"w")

f.write(html)

#signals to queue job is done

try:

self.out_queue.task_done()

except Exception as e:

print e

def getPage(url):

try:

headers = {'User-Agent':'Chrome 17.0 – MACUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'}

request = urllib2.Request(url,headers=headers)

response = urllib2.urlopen(request)

return response.read()

#递归复制页面

except urllib2.URLError,e:

print 'you have an error : ',str(e)

print '---------------------------------------------------'

print 'error !','you need to rerun!! ' ,url

#因为经常出现you have an error : <urlopen error [Errno 11004] getaddrinfo failed>错误，所以我们可能需要一个重跑机制

#这里我们在跑一次，这次如果还失败，那就循环跑他

rerun = getPage(url)#在跑一次

print 'now is rerun ', url

rerunLength = rerun.__len__()

print 'rerun is ok , no continue!!!'

print rerunLength

return rerun

start = time.time()

def main():

#populate queue with data

print list.__len__()

for host in list:

# print host

queue.put(host)

#spawn a pool of threads, and pass them queue instance

for i in range(5):

t = ThreadUrl(queue, out_queue)

t.setDaemon(True)

t.start()

for i in range(5):

dt = DatamineThread(out_queue)

dt.setDaemon(True)

dt.start()

#wait on the queue until everything has been processed

queue.join()

out_queue.join()

main()

print "Elapsed Time: %s" % (time.time() - start)

首先老规矩，导入配置文件路径，从配置文件中获取配置的所有145url队列，在此，我们引入了queue的队列类库，该类库提供先入先出的策略，正适合用于本场景的队列。

至于关于queue的思想和本多线程的思想，开头给的页面写的也很清楚了，如果还有不清楚，文末会再次提一下的。

#coding:utf-8

import Queue

import threading

import urllib2

import time

from BeautifulSoup import BeautifulSoup

import ConfigParser

from __builtin__ import str

config = ConfigParser.ConfigParser()

config.readfp(open("F:\\python\\project\\pythonUrl.properties"),"rb")

allUrlListPath = config.get("step3","allUrlListPath")

allHtmlPath = config.get("step3","allHtmlPath")

filename =allUrlListPath

这里没什么好说的吧，依然是读取配置文件，如果有一点点疑惑的，可以看一下我前面的文章，但是我建议还是去网上看看吧，很多的。

list=[]

def readfile(filename):

with open(filename,'r') as f:

for line in f.readlines():

linestr = line.strip()

linestrlist = linestr.split("\r")

list.append(linestrlist[0])

readfile(filename)

这里是读取到所有的文本里面的url，按照行读取出来，撞到一个list的列表中（python的脚法，可变数组，在java里面直接叫collection好了），这个没什么难度啊，容易理解。个人感觉从无到有的过程比较爽，那么，你现在看到的这种解决问题方式，可能不是你所认同的，但是，一步一步从无到有到现在，我想到的办法，不一定是最完美的，但是他解决了问题，所以，如果有更好的解决办法，也可以告诉我，哈哈，我们要加强交流嘛。

queue = Queue.Queue()

out_queue = Queue.Queue()

#该线程用于一步一步将读取到的页面url装在此处的queue

class ThreadUrl(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, queue, out_queue):

threading.Thread.__init__(self)

self.queue = queue

self.out_queue = out_queue

def run(self):

while True:

hostTuple = self.queue.get()

val1 = hostTuple['val1']#http://www.wandoujia.com/category/6008_906

val2 = hostTuple['val2']#http://www.wandoujia.com/category/6008_906/1

strResult = val1.split("/")[-1]#6008_906

intP = val2.split("/")[-1]#1

key = val2

key2 = strResult+"_"+intP

result_url=getPage(key)

x = dict(filename=key2, html=result_url)

print '将',x['filename'],'放到下载文件队列中' #输出字典内容很爽吧

self.out_queue.put(x)

try:

self.queue.task_done()

except Exception as e:

print e

可以看到，我是顺序解释代码的，这里已经出现了两个队列了，什么用呢？一个是要用来放置所以制作好的需要访问的url，另一个我是用来放所有准备输出的队列。他们里面装的是什么呢？在python里面叫字典，就是可以放key/value形式的数据，和java的map很类似，并且和hashmap类似，都是key不会重复的，value那就无所谓了。值得注意的是，我们知道hashset是可以重复的，而hashmap是会覆盖value元素的（好了，打住，这是python，不是java）总之，稍后我们要取出key来使用的，所以，放字典比较好。

这个线程，是用来获取加工后的所有url线程的，首先我们给入参数threading.Thread，初始化

def __init__(self, queue, out_queue):

threading.Thread.__init__(self)

self.queue = queue

self.out_queue = out_queue

然后写run方法，我们看一下逻辑，

def run(self):

while True:

hostTuple = self.queue.get()

val1 = hostTuple['val1']#http://www.wandoujia.com/category/6008_906

val2 = hostTuple['val2']#http://www.wandoujia.com/category/6008_906/1

strResult = val1.split("/")[-1]#6008_906

intP = val2.split("/")[-1]#1

key = val2

key2 = strResult+"_"+intP

result_url=getPage(key)

x = dict(filename=key2, html=result_url)

print '将',x['filename'],'放到下载文件队列中' #输出字典内容很爽吧

self.out_queue.put(x)

try:

self.queue.task_done()

except Exception as e:

print e

注释很清楚了，我们令函数循环，从第一个队列中，也就是所有url的队列中取出字典hostTuple，拿到key和value，我们吧也就是对应的val1，val2，我们吧key做一个取出最后元素的操作，字符串的split操作，然后value做一个凭借，取出这个val里的序号，到这里你可能不明白为什么这个queue队咧里面有这些数据，别着急在后面后呢。

继续，可以看到在这里我拿到了页面，然后继续讲页面和key2作为key又重新组成了字典，这个字典我一次放到了一个新的queue队列里，叫out_queue，相比你已经知道了，他就是我将来要写到文件系统里的队列。

好了，下面我们看看写文件的线程吧。

class DatamineThread(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, out_queue):

threading.Thread.__init__(self)

self.out_queue = out_queue

def run(self):

while True:

#grabs host from queue

chunk = self.out_queue.get()

# print chunk

filename = chunk['filename']

print '拿到url：',filename

html = chunk['html']

# print html

txt=allHtmlPath+filename+'_'+'.html'

f = open(txt,"w")

f.write(html)

#signals to queue job is done

try:

self.out_queue.task_done()

except Exception as e:

print e

这个线程是写入文件系统的线程，初始化之后，绑定输出的队列，取出字典内的元素，取出字典内的，key和value，key用来作为文件名，value直接写入。然后队列的任务关闭。

再接着，看一下用urllib2获取页面吧。

def getPage(url):

try:

headers = {'User-Agent':'Chrome 17.0 – MACUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'}

request = urllib2.Request(url,headers=headers)

response = urllib2.urlopen(request)

return response.read()

#递归复制页面

except urllib2.URLError,e:

print 'you have an error : ',str(e)

print '---------------------------------------------------'

print 'error !','you need to rerun!! ' ,url

#因为经常出现you have an error : <urlopen error [Errno 11004] getaddrinfo failed>错误，所以我们可能需要一个重跑机制

#这里我们在跑一次，这次如果还失败，那就循环跑他

rerun = getPage(url)#在跑一次

print 'now is rerun ', url

rerunLength = rerun.__len__()

print 'rerun is ok , no continue!!!'

print rerunLength

return rerun

那这个类没有任何改动，在前面文章也说了，主要做一个重跑的递归，当然递归次数我没有做优化，哈哈，懒惰。一般最多两次，所以，也几乎不用管递归次数。

最后，看一下调度的main方法。

start = time.time()

def main():

#populate queue with data

print list.__len__()

for host in list:

p=1

while p<=42:

thisUrl = host+'/'+str(p)

# val1 http://www.wandoujia.com/category/6008_906

# val2 http://www.wandoujia.com/category/6008_906/1

x = dict(val1=host, val2=thisUrl)

queue.put(x)

p=p+1

#spawn a pool of threads, and pass them queue instance

for i in range(5):

t = ThreadUrl(queue, out_queue)

t.setDaemon(True)

t.start()

for i in range(5):

dt = DatamineThread(out_queue)

dt.setDaemon(True)

dt.start()

#wait on the queue until everything has been processed

queue.join()

out_queue.join()

main()

print "Elapsed Time: %s" % (time.time() - start)

引入time，一会儿要看时间的。好，看一下第一段，是从list里面拿到这145个url，这个就是145个分类了，每个分类我们要进去42个分页，那么，装到字典里，字典放到第一个队列queue里，让p+1循环。

接下来，我们启动循环，第一类的线程，咱们跑5个线程，并且开启守护线程，第二类的线程咱们也跑5个，然后程序搞定了。测试一下，

移除生成文件

就像我所说的，速度真的是快，从原来的一个半小时到现在的30分钟，提高了很多时间了，线程数的个数我还没有测，可能会更快，但是这样就够了。

那关于理论的事情，再说一下吧，api之类的。

queue的一些基本的方法：

task_done()

意味着之前入队的一个任务已经完成。由队列的消费者线程调用。每一个get()调用得到一个任务，接下来的task_done()调用告诉队列该任务已经处理完毕。

如果当前一个join()正在阻塞，它将在队列中的所有任务都处理完时恢复执行（即每一个由put()调用入队的任务都有一个对应的task_done()调用）。

join()

阻塞调用线程，直到队列中的所有任务被处理掉。

只要有数据被加入队列，未完成的任务数就会增加。当消费者线程调用task_done()（意味着有消费者取得任务并完成任务），未完成的任务数就会减少。当未完成的任务数降到0，join()解除阻塞。

put(item[, block[, timeout]])

将item放入队列中。

如果可选的参数block为True且timeout为空对象（默认的情况，阻塞调用，无超时）。
如果timeout是个正整数，阻塞调用进程最多timeout秒，如果一直无空空间可用，抛出Full异常（带超时的阻塞调用）。
如果block为False，如果有空闲空间可用将数据放入队列，否则立即抛出Full异常

其非阻塞版本为

put_nowait

等同于

put(item, False)

get([block[, timeout]])

从队列中移除并返回一个数据。block跟timeout参数同

put

方法

其非阻塞方法为｀get_nowait()｀相当与

get(False)

empty()

如果队列为空，返回True,反之返回False

import threading

首先导入threading 模块，这是使用多线程的前提。

threads = []

t1 = threading.Thread(target=music,args=(u'爱情买卖',))

threads.append(t1)

　　创建了threads数组，创建线程t1,使用threading.Thread()方法，在这个方法中调用music方法target=music，args方法对music进行传参。把创建好的线程t1装到threads数组中。

　　接着以同样的方式创建线程t2，并把t2也装到threads数组。

for t in threads:

　　t.setDaemon(True)

　　t.start()

最后通过for循环遍历数组。（数组被装载了t1和t2两个线程）

setDaemon()

　　setDaemon(True)将线程声明为守护线程，必须在start() 方法调用之前设置，如果不设置为守护线程程序会被无限挂起。子线程启动后，父线程也继续执行下去，当父线程执行完最后一条语句print "all over %s" %ctime()后，没有等待子线程，直接就退出了，同时子线程也一同结束。

start()

开始线程活动。

移除python

终于搞定了多线程，很爽。

这个让我很烦

Exception in thread Thread-3 (most likely raised during interpreter shutdown):Exception in thread Thread-1 (most likely raised during interpreter shutdown):

线程死了。

还有什么任务结束太多次的

Exception in thread Thread-4:

Traceback (most recent call last):

File "C:\Python26\lib\threading.py", line 532, in __bootstrap_inner

self.run()

File "D:\workspace-python\crawler\��߳�python��\MoreThread\test��߳�4\��ھ��վ.py", line 69, in run

self.queue.task_done()

File "C:\Python26\lib\Queue.py", line 64, in task_done

raise ValueError('task_done() called too many times')

ValueError: task_done() called too many times

就属这种最讨厌，没有原因的多线程：

Exception in thread Thread-8 (most likely raised during interpreter shutdown):

好了闲话不多说，研究了一天多线程和队列，在此，希望可以稍微看一下理论，毕竟理论是代码的领路人嘛，当然理论我是看的这个：

https://www.ibm.com/developerworks/aix/library/au-threadingpython/

这个人说的还是蛮清楚的，而且例子啥的也能很好的体会到

其实又是一步步走过来，坑的很多，但是都克服了。先粘贴全部代码，然后分析一下。

#coding:utf-8

'''

Created on 2017年7月11日

coding=UTF-8

@author: lishouzhuang

'''

import Queue

import threading

import urllib2

import time

from BeautifulSoup import BeautifulSoup

import ConfigParser

from __builtin__ import str

config = ConfigParser.ConfigParser()

config.readfp(open("F:\\python\\project\\pythonUrl.properties"),"rb")

allUrlListPath = config.get("step3","allUrlListPath")

allHtmlPath = config.get("step3","allHtmlPath")

filename =allUrlListPath

list=[]

def readfile(filename):

with open(filename,'r') as f:

for line in f.readlines():

linestr = line.strip()

linestrlist = linestr.split("\r")

# print linestrlist

list.append(linestrlist[0])

# print(list)

# print list.__len__()

readfile(filename)

# print 'end'

# print list

print '-'*100

queue = Queue.Queue()

out_queue = Queue.Queue()

#该线程用于一步一步将读取到的页面url装在此处的queue

class ThreadUrl(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, queue, out_queue):

threading.Thread.__init__(self)

self.queue = queue

self.out_queue = out_queue

def run(self):

while True:

host = self.queue.get()

p=1

while p<=42:

# 如果是首页，直接这样，例如 http://www.wandoujia.com/category/5029_716

#视频的第一页

strResult = host.split("/")[-1]

# print 'now copy this url : ',strResult

key = host+'/'+str(p)

key2 = strResult+"_"+str(p)

result_url=getPage(key)

p=p+1

x = dict(filename=key2, html=result_url)

print '将',x['filename'],'放到下载文件队列中' #输出字典内容很爽吧

self.out_queue.put(x)

try:

self.queue.task_done()

except Exception as e:

print e

class DatamineThread(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, out_queue):

threading.Thread.__init__(self)

self.out_queue = out_queue

def run(self):

while True:

#grabs host from queue

chunk = self.out_queue.get()

# print chunk

filename = chunk['filename']

print '拿到url：',filename

html = chunk['html']

# print html

txt=allHtmlPath+filename+'_'+'.html'

f = open(txt,"w")

f.write(html)

#signals to queue job is done

try:

self.out_queue.task_done()

except Exception as e:

print e

def getPage(url):

try:

headers = {'User-Agent':'Chrome 17.0 – MACUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'}

request = urllib2.Request(url,headers=headers)

response = urllib2.urlopen(request)

return response.read()

#递归复制页面

except urllib2.URLError,e:

print 'you have an error : ',str(e)

print '---------------------------------------------------'

print 'error !','you need to rerun!! ' ,url

#因为经常出现you have an error : <urlopen error [Errno 11004] getaddrinfo failed>错误，所以我们可能需要一个重跑机制

#这里我们在跑一次，这次如果还失败，那就循环跑他

rerun = getPage(url)#在跑一次

print 'now is rerun ', url

rerunLength = rerun.__len__()

print 'rerun is ok , no continue!!!'

print rerunLength

return rerun

start = time.time()

def main():

#populate queue with data

print list.__len__()

for host in list:

# print host

queue.put(host)

#spawn a pool of threads, and pass them queue instance

for i in range(5):

t = ThreadUrl(queue, out_queue)

t.setDaemon(True)

t.start()

for i in range(5):

dt = DatamineThread(out_queue)

dt.setDaemon(True)

dt.start()

#wait on the queue until everything has been processed

queue.join()

out_queue.join()

main()

print "Elapsed Time: %s" % (time.time() - start)

至于关于queue的思想和本多线程的思想，开头给的页面写的也很清楚了，如果还有不清楚，文末会再次提一下的。

#coding:utf-8

import Queue

import threading

import urllib2

import time

from BeautifulSoup import BeautifulSoup

import ConfigParser

from __builtin__ import str

config = ConfigParser.ConfigParser()

config.readfp(open("F:\\python\\project\\pythonUrl.properties"),"rb")

allUrlListPath = config.get("step3","allUrlListPath")

allHtmlPath = config.get("step3","allHtmlPath")

filename =allUrlListPath

这里没什么好说的吧，依然是读取配置文件，如果有一点点疑惑的，可以看一下我前面的文章，但是我建议还是去网上看看吧，很多的。

list=[]

def readfile(filename):

with open(filename,'r') as f:

for line in f.readlines():

linestr = line.strip()

linestrlist = linestr.split("\r")

list.append(linestrlist[0])

readfile(filename)

queue = Queue.Queue()

out_queue = Queue.Queue()

#该线程用于一步一步将读取到的页面url装在此处的queue

class ThreadUrl(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, queue, out_queue):

threading.Thread.__init__(self)

self.queue = queue

self.out_queue = out_queue

def run(self):

while True:

hostTuple = self.queue.get()

val1 = hostTuple['val1']#http://www.wandoujia.com/category/6008_906

val2 = hostTuple['val2']#http://www.wandoujia.com/category/6008_906/1

strResult = val1.split("/")[-1]#6008_906

intP = val2.split("/")[-1]#1

key = val2

key2 = strResult+"_"+intP

result_url=getPage(key)

x = dict(filename=key2, html=result_url)

print '将',x['filename'],'放到下载文件队列中' #输出字典内容很爽吧

self.out_queue.put(x)

try:

self.queue.task_done()

except Exception as e:

print e

这个线程，是用来获取加工后的所有url线程的，首先我们给入参数threading.Thread，初始化

def __init__(self, queue, out_queue):

threading.Thread.__init__(self)

self.queue = queue

self.out_queue = out_queue

然后写run方法，我们看一下逻辑，

def run(self):

while True:

hostTuple = self.queue.get()

val1 = hostTuple['val1']#http://www.wandoujia.com/category/6008_906

val2 = hostTuple['val2']#http://www.wandoujia.com/category/6008_906/1

strResult = val1.split("/")[-1]#6008_906

intP = val2.split("/")[-1]#1

key = val2

key2 = strResult+"_"+intP

result_url=getPage(key)

x = dict(filename=key2, html=result_url)

print '将',x['filename'],'放到下载文件队列中' #输出字典内容很爽吧

self.out_queue.put(x)

try:

self.queue.task_done()

except Exception as e:

print e

好了，下面我们看看写文件的线程吧。

class DatamineThread(threading.Thread):

"""Threaded Url Grab"""

def __init__(self, out_queue):

threading.Thread.__init__(self)

self.out_queue = out_queue

def run(self):

while True:

#grabs host from queue

chunk = self.out_queue.get()

# print chunk

filename = chunk['filename']

print '拿到url：',filename

html = chunk['html']

# print html

txt=allHtmlPath+filename+'_'+'.html'

f = open(txt,"w")

f.write(html)

#signals to queue job is done

try:

self.out_queue.task_done()

except Exception as e:

print e

再接着，看一下用urllib2获取页面吧。

def getPage(url):

try:

headers = {'User-Agent':'Chrome 17.0 – MACUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11'}

request = urllib2.Request(url,headers=headers)

response = urllib2.urlopen(request)

return response.read()

#递归复制页面

except urllib2.URLError,e:

print 'you have an error : ',str(e)

print '---------------------------------------------------'

print 'error !','you need to rerun!! ' ,url

#因为经常出现you have an error : <urlopen error [Errno 11004] getaddrinfo failed>错误，所以我们可能需要一个重跑机制

#这里我们在跑一次，这次如果还失败，那就循环跑他

rerun = getPage(url)#在跑一次

print 'now is rerun ', url

rerunLength = rerun.__len__()

print 'rerun is ok , no continue!!!'

print rerunLength

return rerun

最后，看一下调度的main方法。

start = time.time()

def main():

#populate queue with data

print list.__len__()

for host in list:

p=1

while p<=42:

thisUrl = host+'/'+str(p)

# val1 http://www.wandoujia.com/category/6008_906

# val2 http://www.wandoujia.com/category/6008_906/1

x = dict(val1=host, val2=thisUrl)

queue.put(x)

p=p+1

#spawn a pool of threads, and pass them queue instance

for i in range(5):

t = ThreadUrl(queue, out_queue)

t.setDaemon(True)

t.start()

for i in range(5):

dt = DatamineThread(out_queue)

dt.setDaemon(True)

dt.start()

#wait on the queue until everything has been processed

queue.join()

out_queue.join()

main()

print "Elapsed Time: %s" % (time.time() - start)

接下来，我们启动循环，第一类的线程，咱们跑5个线程，并且开启守护线程，第二类的线程咱们也跑5个，然后程序搞定了。测试一下，

移除生成文件

就像我所说的，速度真的是快，从原来的一个半小时到现在的30分钟，提高了很多时间了，线程数的个数我还没有测，可能会更快，但是这样就够了。

那关于理论的事情，再说一下吧，api之类的。

queue的一些基本的方法：

task_done()

意味着之前入队的一个任务已经完成。由队列的消费者线程调用。每一个get()调用得到一个任务，接下来的task_done()调用告诉队列该任务已经处理完毕。

如果当前一个join()正在阻塞，它将在队列中的所有任务都处理完时恢复执行（即每一个由put()调用入队的任务都有一个对应的task_done()调用）。

join()

阻塞调用线程，直到队列中的所有任务被处理掉。

put(item[, block[, timeout]])

将item放入队列中。

如果可选的参数block为True且timeout为空对象（默认的情况，阻塞调用，无超时）。
如果timeout是个正整数，阻塞调用进程最多timeout秒，如果一直无空空间可用，抛出Full异常（带超时的阻塞调用）。
如果block为False，如果有空闲空间可用将数据放入队列，否则立即抛出Full异常

其非阻塞版本为

put_nowait

等同于

put(item, False)

get([block[, timeout]])

从队列中移除并返回一个数据。block跟timeout参数同

put

方法

其非阻塞方法为｀get_nowait()｀相当与

get(False)

empty()

如果队列为空，返回True,反之返回False

import threading

首先导入threading 模块，这是使用多线程的前提。

threads = []

t1 = threading.Thread(target=music,args=(u'爱情买卖',))

threads.append(t1)

　　接着以同样的方式创建线程t2，并把t2也装到threads数组。

for t in threads:

　　t.setDaemon(True)

　　t.start()

最后通过for循环遍历数组。（数组被装载了t1和t2两个线程）

setDaemon()

start()

开始线程活动。

阅读全文

0 0