Python 并行计算

来源：互联网发布：js数组去重排序编辑：程序博客网时间：2024/06/04 17:51

Python 并行计算

一、实验说明

本实验介绍 Python 并行计算能够用到的工具。

1. 环境登录

无需密码自动登录，系统用户名shiyanlou

2. 环境介绍

本课程实验环境使用Spyder。首先打开terminal，然后输入以下命令：

git clone https://github.com/jrjohansson/scientific-python-lecturesspyder -w scientific-python-lectures

关于Spyder的使用可参考文档：https://pythonhosted.org/spyder/

本实验基本在控制台下进行，可关闭其余窗口，只保留控制台。

二、实验内容

multiprocessing

Python 内置进程库，用于并发计算

import multiprocessingimport osimport timeimport numpydef task(args):    print("PID =", os.getpid(), ", args =", args)            return os.getpid(), argstask("test")=> PID = 28995 , args = testpool = multiprocessing.Pool(processes=4)result = pool.map(task, [1,2,3,4,5,6,7,8])=> PID = 29006 , args = 1   PID = 29009 , args = 4   PID = 29007 , args = 2   PID = 29008 , args = 3   PID = 29006 , args = 6   PID = 29009 , args = 5   PID = 29007 , args = 8   PID = 29008 , args = 7result=> [(29006, 1),    (29007, 2),    (29008, 3),    (29009, 4),    (29009, 5),    (29006, 6),    (29008, 7),    (29007, 8)]

multiprocessing 在处理高并行任务时进程间不需要相互通信，除了将初始化数据送到进程池的时候和收集结果的时候。

IPython parallel

IPython 自带一个通用的并行计算环境,它建立在 IPython 引擎与控制器之上。使用它前要先开启 IPython 集群引擎，最简便的方法是使用 ipcluster 命令：

$ ipcluster start -n 4

为了在 Python 程序中使用 IPython 集群，我们首先要创建一个 IPython.parallel.Client 实例：

import numpyimport matplotlib.pyplot as pltfrom IPython.parallel import Clientcli = Client()

使用 ids 属性，我们可以检索集群中 IPython 引擎的 id 列表：

cli.ids=> [0, 1, 2, 3]

每一个引擎都准备好了运行任务，我们可以单独地选择引擎来跑我们的代码：

def getpid():    “”” return the unique ID of the current process “””    import os    return os.getpid()

first try it on the notebook process

getpid()

=> 28995

run it on one of the engines

cli[0].apply_sync(getpid)

=> 30181

run it on ALL of the engines at the same time

cli[:].apply_sync(getpid)

=> [30181, 30182, 30183, 30185]

最简单的调度一个函数到不同引擎上方法是使用装饰函数：

@view.parallel(block=True)

view 应该是一个引擎池。一旦绑定了装饰函数，我们就可以使用 map 方法调度函数了。

示例：

dview = cli[:]@dview.parallel(block=True)def dummy_task(delay):    """ a dummy task that takes 'delay' seconds to finish """    import os, time    t0 = time.time()    pid = os.getpid()    time.sleep(delay)    t1 = time.time()    return [pid, t0, t1]

# generate random delay times for dummy tasksdelay_times = numpy.random.rand(4)

现在我们调用 map 方法：

dummy_task.map(delay_times)=> [[30181, 1395044753.2096598, 1395044753.9150908],    [30182, 1395044753.2084103, 1395044753.4959202],    [30183, 1395044753.2113762, 1395044753.6453338],    [30185, 1395044753.2130392, 1395044754.1905618]]

现在加入更多任务并可视化任务运行的过程：

%matplotlib qtdef visualize_tasks(results):    res = numpy.array(results)    fig, ax = plt.subplots(figsize=(10, res.shape[1]))    yticks = []    yticklabels = []    tmin = min(res[:,1])    for n, pid in enumerate(numpy.unique(res[:,0])):        yticks.append(n)        yticklabels.append("%d" % pid)        for m in numpy.where(res[:,0] == pid)[0]:            ax.add_patch(plt.Rectangle((res[m,1] - tmin, n-0.25),                         res[m,2] - res[m,1], 0.5, color="green", alpha=0.5))    ax.set_ylim(-.5, n+.5)    ax.set_xlim(0, max(res[:,2]) - tmin + 0.)    ax.set_yticks(yticks)    ax.set_yticklabels(yticklabels)    ax.set_ylabel("PID")    ax.set_xlabel("seconds")delay_times = numpy.random.rand(64)result = dummy_task.map(delay_times)visualize_tasks(result)

很漂亮，但注意观察，在最后的时候出现了一个引擎空置另一个引擎却还有多个任务要处理的现象。IPython 并行环境提供了很多可选的 “view”，我们之前用的是 “direct view” ，现在我们试试 “balanced view”：

lbview = cli.load_balanced_view()@lbview.parallel(block=True)def dummy_task_load_balanced(delay):    """ a dummy task that takes 'delay' seconds to finish """    import os, time    t0 = time.time()    pid = os.getpid()    time.sleep(delay)    t1 = time.time()    return [pid, t0, t1]result = dummy_task_load_balanced.map(delay_times)visualize_tasks(result)

MPI

当进程间需要更多的通信的时候，就需要更加复杂的解决方案了，比如MPI 与 OpenMP。MPI基于并行进程库 library/protocol。我们可以通过包 mpi4py 来使用它：http://mpi4py.scipy.org/

让我们先加载它：

from mpi4py import MPI

运行 MPI 程序之前，首先需要运行 mpirun -n N 命令， N 代表进程数量。

注意到 IPython 并行环境同样支持 MPI，但是为了方便入门，我们在下面的例子里只使用 mpi4py 与 mpirun

例1

%%file mpitest.pyfrom mpi4py import MPIcomm = MPI.COMM_WORLDrank = comm.Get_rank()if rank == 0:   data = [1.0, 2.0, 3.0, 4.0]   comm.send(data, dest=1, tag=11)elif rank == 1:   data = comm.recv(source=0, tag=11)print "rank =", rank, ", data =", data!mpirun -n 2 python mpitest.pyrank = 0 , data = [1.0, 2.0, 3.0, 4.0]rank = 1 , data = [1.0, 2.0, 3.0, 4.0]

例2

将 numpy 数组从一个进程传到另一个进程：

%%file mpi-numpy-array.pyfrom mpi4py import MPIimport numpycomm = MPI.COMM_WORLDrank = comm.Get_rank()if rank == 0:   data = numpy.random.rand(10)   comm.Send(data, dest=1, tag=13)elif rank == 1:   data = numpy.empty(10, dtype=numpy.float64)   comm.Recv(data, source=0, tag=13)print "rank =", rank, ", data =", data!mpirun -n 2 python mpi-numpy-array.pyrank = 0 , data = [ 0.71397658  0.37182268  0.25863587  0.08007216  0.50832534  0.80038331  0.90613024  0.99535428  0.11717776  0.48353805]rank = 1 , data = [ 0.71397658  0.37182268  0.25863587  0.08007216  0.50832534  0.80038331  0.90613024  0.99535428  0.11717776  0.48353805]

例3：矩阵－向量乘法

# prepare some random dataN = 16A = numpy.random.rand(N, N)numpy.save("random-matrix.npy", A)x = numpy.random.rand(N)numpy.save("random-vector.npy", x)%%file mpi-matrix-vector.pyfrom mpi4py import MPIimport numpycomm = MPI.COMM_WORLDrank = comm.Get_rank()p = comm.Get_size()def matvec(comm, A, x):    m = A.shape[0] / p    y_part = numpy.dot(A[rank * m:(rank+1)*m], x)    y = numpy.zeros_like(x)    comm.Allgather([y_part,  MPI.DOUBLE], [y, MPI.DOUBLE])    return yA = numpy.load("random-matrix.npy")x = numpy.load("random-vector.npy")y_mpi = matvec(comm, A, x)if rank == 0:    y = numpy.dot(A, x)    print(y_mpi)    print "sum(y - y_mpi) =", (y - y_mpi).sum()!mpirun -n 4 python mpi-matrix-vector.py[ 6.40342716  3.62421625  3.42334637  3.99854639  4.95852419  6.13378754  5.33319708  5.42803442  5.12403754  4.87891654  2.38660728  6.72030412  4.05218475  3.37415974  3.90903001  5.82330226]sum(y - y_mpi) = 0.0

例4：向量元素求和

# prepare some random dataN = 128a = numpy.random.rand(N)numpy.save("random-vector.npy", a)%%file mpi-psum.pyfrom mpi4py import MPIimport numpy as npdef psum(a):    r = MPI.COMM_WORLD.Get_rank()    size = MPI.COMM_WORLD.Get_size()    m = len(a) / size    locsum = np.sum(a[r*m:(r+1)*m])    rcvBuf = np.array(0.0, 'd')    MPI.COMM_WORLD.Allreduce([locsum, MPI.DOUBLE], [rcvBuf, MPI.DOUBLE], op=MPI.SUM)    return rcvBufa = np.load("random-vector.npy")s = psum(a)if MPI.COMM_WORLD.Get_rank() == 0:    print "sum =", s, ", numpy sum =", a.sum()!mpirun -n 4 python mpi-psum.pysum = 64.948311241 , numpy sum = 64.948311241

OpenMP

OpenMP 是一个标准的被广泛使用的基于线程的并行计算 API，只可惜无法直接在Python中使用。因为CPython的实现中用到了 GIL（全局解释器锁）。使得同时运行多个Python线程成了不可能的任务。也因此线程无法在 Python 中做并行计算，除非 Python 只是用来包装编译型语言的代码的。所以只能选择类 Python语法的 Cython 来做 OpenMP 计算了。

import multiprocessingN_core = multiprocessing.cpu_count()print("This system has %d cores" % N_core)This system has 4 cores

下面是一个简单的例子演示 OpenMP 的使用：

%load_ext cythonmagic%%cython -f -c-fopenmp --link-args=-fopenmp -c-gcimport cythoncimport numpyfrom cython.parallel import prange, parallelcimport openmpdef cy_openmp_test():    cdef int n, N    # release GIL so that we can use OpenMP    with nogil, parallel():        N = openmp.omp_get_num_threads()        n = openmp.omp_get_thread_num()        with gil:            print("Number of threads %d: thread number %d" % (N, n))cy_openmp_test()Number of threads 12: thread number 0Number of threads 12: thread number 10Number of threads 12: thread number 8Number of threads 12: thread number 4Number of threads 12: thread number 7Number of threads 12: thread number 3Number of threads 12: thread number 2Number of threads 12: thread number 1Number of threads 12: thread number 11Number of threads 12: thread number 9Number of threads 12: thread number 5Number of threads 12: thread number 6

示例: 矩阵向量乘法

# prepare some random dataN = 4 * N_coreM = numpy.random.rand(N, N)x = numpy.random.rand(N)y = numpy.zeros_like(x)

让我们先看一眼矩阵－向量乘法在 Cython 中的实现：

%%cythoncimport cythoncimport numpyimport numpy@cython.boundscheck(False)@cython.wraparound(False)def cy_matvec(numpy.ndarray[numpy.float64_t, ndim=2] M,               numpy.ndarray[numpy.float64_t, ndim=1] x,               numpy.ndarray[numpy.float64_t, ndim=1] y):    cdef int i, j, n = len(x)    for i from 0 <= i < n:        for j from 0 <= j < n:            y[i] += M[i, j] * x[j]    return y

# check that we get the same resultsy = numpy.zeros_like(x)cy_matvec(M, x, y)numpy.dot(M, x) - yarray([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])%timeit numpy.dot(M, x)100000 loops, best of 3: 2.93 µs per loop%timeit cy_matvec(M, x, y)100000 loops, best of 3: 5.4 µs per loop

Cython 的实现比 numpy.dot 慢，但是不是慢很多，因此只要我们使用 OpenMP，性能就可能赶超 numpy.dot ：

%%cython -f -c-fopenmp --link-args=-fopenmp -c-gcimport cythoncimport numpyfrom cython.parallel import parallelcimport openmp@cython.boundscheck(False)@cython.wraparound(False)def cy_matvec_omp(numpy.ndarray[numpy.float64_t, ndim=2] M,                   numpy.ndarray[numpy.float64_t, ndim=1] x,                   numpy.ndarray[numpy.float64_t, ndim=1] y):    cdef int i, j, n = len(x), N, r, m    # release GIL, so that we can use OpenMP    with nogil, parallel():        N = openmp.omp_get_num_threads()        r = openmp.omp_get_thread_num()        m = n / N        for i from 0 <= i < m:            for j from 0 <= j < n:                y[r * m + i] += M[r * m + i, j] * x[j]    return y

# check that we get the same resultsy = numpy.zeros_like(x)cy_matvec_omp(M, x, y)numpy.dot(M, x) - yarray([ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])%timeit numpy.dot(M, x)100000 loops, best of 3: 2.95 µs per loop%timeit cy_matvec_omp(M, x, y)1000 loops, best of 3: 209 µs per loop

就目前的问题规模而言，Cython+OpenMP 仍旧是慢，因为开销都花在了 OpenMP 与线程的连接上了。

N_vec  = numpy.arange(25, 2000, 25) * N_coreduration_ref = numpy.zeros(len(N_vec))duration_cy = numpy.zeros(len(N_vec))duration_cy_omp = numpy.zeros(len(N_vec))for idx, N in enumerate(N_vec):    M = numpy.random.rand(N, N)    x = numpy.random.rand(N)    y = numpy.zeros_like(x)    t0 = time.time()    numpy.dot(M, x)    duration_ref[idx] = time.time() - t0    t0 = time.time()    cy_matvec(M, x, y)    duration_cy[idx] = time.time() - t0    t0 = time.time()    cy_matvec_omp(M, x, y)    duration_cy_omp[idx] = time.time() - t0fig, ax = plt.subplots(figsize=(12, 6))ax.loglog(N_vec, duration_ref, label='numpy')ax.loglog(N_vec, duration_cy, label='cython')ax.loglog(N_vec, duration_cy_omp, label='cython+openmp')ax.legend(loc=2)ax.set_yscale("log")ax.set_ylabel("matrix-vector multiplication duration")ax.set_xlabel("matrix size");

对于规模更大的问题 cython+OpenMP 组合的性能超过了 numpy.dot

加速比：

((duration_ref / duration_cy_omp)[-10:]).mean()3.0072232987815148

显然是可以做的更好的，加速比的极限是：

N_core12

OpenCL

OpenCL 是异构计算的一种API，异构计算，举个例子就是使用GPU进行数值计算。 pyopencl 包允许 OpenCL 代码在计算单元上用 Python 编译加载运行。这是很好的一种使用 OpenCL 的方式，因为耗时计算最好由编译过后的代码在计算单元上完成，Python在其中只作为控制语言工作。

%%file opencl-dense-mv.pyimport pyopencl as climport numpyimport time

# problem sizen = 10000# platformplatform_list = cl.get_platforms()platform = platform_list[0]# devicedevice_list = platform.get_devices()device = device_list[0]if False: print("Platform name:" + platform.name) print("Platform version:" + platform.version) print("Device name:" + device.name) print("Device type:" + cl.device_type.to_string(device.type)) print("Device memory: " + str(device.global_mem_size//1024//1024) + ' MB') print("Device max clock speed:" + str(device.max_clock_frequency) + ' MHz') print("Device compute units:" + str(device.max_compute_units))# contextctx = cl.Context([device]) # or we can use cl.create_some_context()# command queuequeue = cl.CommandQueue(ctx)# kernelKERNEL_CODE = """//// Matrix-vector multiplication: r = m * v//#define N %(mat_size)d__kernelvoid dmv_cl(__global float *m, __global float *v, __global float *r){ int i, gid = get_global_id(0); r[gid] = 0; for (i = 0; i < N; i++) { r[gid] += m[gid * N + i] * v[i]; }}"""kernel_params = {"mat_size": n}program = cl.Program(ctx, KERNEL_CODE % kernel_params).build()# dataA = numpy.random.rand(n, n)x = numpy.random.rand(n, 1)# host buffersh_y = numpy.empty(numpy.shape(x)).astype(numpy.float32)h_A = numpy.real(A).astype(numpy.float32)h_x = numpy.real(x).astype(numpy.float32)# device buffersmf = cl.mem_flagsd_A_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=h_A)d_x_buf = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=h_x)d_y_buf = cl.Buffer(ctx, mf.WRITE_ONLY, size=h_y.nbytes)# execute OpenCL codet0 = time.time()event = program.dmv_cl(queue, h_y.shape, None, d_A_buf, d_x_buf, d_y_buf)event.wait()cl.enqueue_copy(queue, h_y, d_y_buf)t1 = time.time()print "opencl elapsed time =", (t1-t0)# Same calculation with numpyt0 = time.time()y = numpy.dot(h_A, h_x)t1 = time.time()print "numpy elapsed time =", (t1-t0)# see if the results are the sameprint "max deviation =", numpy.abs(y-h_y).max()Overwriting opencl-dense-mv.py!python opencl-dense-mv.py/usr/local/lib/python2.7/dist-packages/pyopencl-2012.1-py2.7-linux-x86_64.egg/pyopencl/__init__.py:36: CompilerWarning: Non-empty compiler output encountered. Set the environment variable PYOPENCL_COMPILER_OUTPUT=1 to see more. "to see more.", CompilerWarning)opencl elapsed time = 0.0188570022583numpy elapsed time = 0.0755031108856max deviation = 0.0136719

License

该作品在知识共享许可协议3.0 下许可授权。

0 0

Python 并行计算

Python 并行计算

一、实验说明

1. 环境登录

2. 环境介绍

二、实验内容

multiprocessing

IPython parallel

first try it on the notebook process

run it on one of the engines

run it on ALL of the engines at the same time

延伸阅读

MPI

例1

例2

例3：矩阵－向量乘法

例4：向量元素求和

延伸阅读

OpenMP

示例: 矩阵向量乘法

延伸阅读

OpenCL

延伸阅读

License