试用 PGStrom
来源:互联网 发布:仓储出库数据流程图 编辑:程序博客网 时间:2024/05/22 07:52
Postgres2015全国用户大会将于11月20至21日在北京丽亭华苑酒店召开。本次大会嘉宾阵容强大,国内顶级PostgreSQL数据库专家将悉数到场,并特邀欧洲、俄罗斯、日本、美国等国家和地区的数据库方面专家助阵:
- Postgres-XC项目的发起人铃木市一(SUZUKI Koichi)
- Postgres-XL的项目发起人Mason Sharp
- pgpool的作者石井达夫(Tatsuo Ishii)
- PG-Strom的作者海外浩平(Kaigai Kohei)
- Greenplum研发总监姚延栋
- 周正中(德哥), PostgreSQL中国用户会创始人之一
- 汪洋,平安科技数据库技术部经理
- ……
PGStrom是一个使用GPU进行并行计算的custom scan provider插件,架构如下:
从WIKI上的文档来看,性能提升非常可观。JOIN的表越多,提升效果越明显。
需要安装cuda7.0的驱动,以及toolkit。
参考 https://developer.nvidia.com/cuda-downloads
安装过程遇到一个问题,libcuda.so没有放在Makefile指定的-L中。
gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -g -O2 -fpic src/gpuinfo.c -Wall -DPGSTROM_DEBUG=1 -O0 -DCMD_GPUINGO_PATH=\"/app_data/digoal/pgsql9.5/bin/gpuinfo\" -I /usr/local/cuda/include -L /usr/local/cuda/lib64 -lcuda -o src/gpuinfo/usr/bin/ld: cannot find -lcudacollect2: ld returned 1 exit statusmake: *** [src/gpuinfo] Error 1
这个修改一下可以解决。
digoal-> cp /usr/local/cuda-7.0/lib64/stubs/libcuda.so /usr/local/cuda-7.0/lib64/digoal-> ll /usr/localtotal 92Kdrwxr-xr-x 2 root root 4.0K Jun 9 11:35 bindrwxr-xr-x 139 root root 4.0K Apr 14 18:40 clonescriptsdrwxr-xr-x 4 root root 4.0K Apr 14 19:05 csflrwxrwxrwx 1 root root 19 Aug 14 16:42 cuda -> /usr/local/cuda-7.0drwxr-xr-x 17 root root 4.0K Apr 15 16:37 cuda-6.5drwxr-xr-x 17 root root 4.0K Aug 14 16:42 cuda-7.0
安装用到了PostgreSQL 9.5 alpha 2以及最新的PGStrom代码。
安装完后,通过gpuinfo可以看到当前的GPU信息:
digoal-> gpuinfoCUDA Runtime version: 7.0.0NVIDIA Driver version: 346.59Number of devices: 2--------Device Identifier: 0Device Name: Tesla K40mGlobal memory size: 11519MBMaximum number of threads per block: 1024Maximum block dimension X: 1024Maximum block dimension Y: 1024Maximum block dimension Z: 64Maximum grid dimension X: 2147483647Maximum grid dimension Y: 65535Maximum grid dimension Z: 65535Maximum shared memory available per block in bytes: 49152KBMemory available on device for __constant__ variables: 65536bytesWarp size in threads: 32Maximum number of 32-bit registers available per block: 65536Typical clock frequency in kilohertz: 745000KHZNumber of multiprocessors on device: 15Specifies whether there is a run time limit on kernels: 0Device is integrated with host memory: falseDevice can map host memory into CUDA address space: trueCompute mode (See CUcomputemode for details): defaultDevice can possibly execute multiple kernels concurrently: trueDevice has ECC support enabled: truePCI bus ID of the device: 2PCI device ID of the device: 0Device is using TCC driver model: falsePeak memory clock frequency in kilohertz: 3004000KHZGlobal memory bus width in bits: 384Size of L2 cache in bytes: 1572864bytesMaximum resident threads per multiprocessor: 2048Number of asynchronous engines: 2Device shares a unified address space with the host: trueMajor compute capability version number: 3Minor compute capability version number: 5Device supports stream priorities: trueDevice supports caching globals in L1: trueDevice supports caching locals in L1: trueMaximum shared memory available per multiprocessor: 49152bytesMaximum number of 32bit registers per multiprocessor: 65536Device can allocate managed memory on this system: trueDevice is on a multi-GPU board: falseUnique id for a group of devices on the same multi-GPU board: 0--------Device Identifier: 1Device Name: Tesla K40mGlobal memory size: 11519MBMaximum number of threads per block: 1024Maximum block dimension X: 1024Maximum block dimension Y: 1024Maximum block dimension Z: 64Maximum grid dimension X: 2147483647Maximum grid dimension Y: 65535Maximum grid dimension Z: 65535Maximum shared memory available per block in bytes: 49152KBMemory available on device for __constant__ variables: 65536bytesWarp size in threads: 32Maximum number of 32-bit registers available per block: 65536Typical clock frequency in kilohertz: 745000KHZNumber of multiprocessors on device: 15Specifies whether there is a run time limit on kernels: 0Device is integrated with host memory: falseDevice can map host memory into CUDA address space: trueCompute mode (See CUcomputemode for details): defaultDevice can possibly execute multiple kernels concurrently: trueDevice has ECC support enabled: truePCI bus ID of the device: 3PCI device ID of the device: 0Device is using TCC driver model: falsePeak memory clock frequency in kilohertz: 3004000KHZGlobal memory bus width in bits: 384Size of L2 cache in bytes: 1572864bytesMaximum resident threads per multiprocessor: 2048Number of asynchronous engines: 2Device shares a unified address space with the host: trueMajor compute capability version number: 3Minor compute capability version number: 5Device supports stream priorities: trueDevice supports caching globals in L1: trueDevice supports caching locals in L1: trueMaximum shared memory available per multiprocessor: 49152bytesMaximum number of 32bit registers per multiprocessor: 65536Device can allocate managed memory on this system: trueDevice is on a multi-GPU board: falseUnique id for a group of devices on the same multi-GPU board: 1
加载PGStrom
# vi $PGDATA/postgresql.confshared_preload_libraries = '$libdir/pg_strom'digoal-> pg_ctl restart -m fastwaiting for server to shut down.... doneserver stoppedserver startingLOG: CUDA Runtime version: 7.0.0LOG: NVIDIA driver version: 346.59LOG: GPU0 Tesla K40m (2880 CUDA cores, 745MHz), L2 1536KB, RAM 11519MB (384bits, 3004MHz), capability 3.5LOG: GPU1 Tesla K40m (2880 CUDA cores, 745MHz), L2 1536KB, RAM 11519MB (384bits, 3004MHz), capability 3.5LOG: NVRTC - CUDA Runtime Compilation vertion 7.0LOG: redirecting log output to logging collector processHINT: Future log output will appear in directory "pg_log".
试用:
postgres=# create extension pg_strom;CREATE EXTENSIONpostgres=# create table t1(c1 int,c2 int);CREATE TABLEpostgres=# create table t2(c1 int,c2 int);CREATE TABLEpostgres=# create table t3(c1 int,c2 int);CREATE TABLEpostgres=# insert into t1 select generate_series(1,10000000),1;INSERT 0 10000000postgres=# insert into t2 select generate_series(1,10000000),1;INSERT 0 10000000postgres=# insert into t3 select generate_series(1,10000000),1;INSERT 0 10000000postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1;QUERY PLAN--------------------------------------------------------------------------------------------------------------------------------------------------Aggregate (cost=169247.71..169247.72 rows=1 width=0) (actual time=686.566..686.567 rows=1 loops=1)Output: pgstrom.count((pgstrom.nrows()))Buffers: shared hit=44275-> Custom Scan (GpuPreAgg) (cost=1000.00..145747.99 rows=22 width=4) (actual time=679.224..686.552 rows=28 loops=1)Output: pgstrom.nrows()Bulkload: On (density: 100.00%)Reduction: NoGroupFeatures: format: tuple-slot, bulkload: unsupportedBuffers: shared hit=44275-> Custom Scan (BulkScan) on public.t1 (cost=0.00..144247.77 rows=9999977 width=0) (actual time=13.184..354.634 rows=10000000 loops=1)Output: c1, c2Features: format: heap-tuple, bulkload: supportedBuffers: shared hit=44275Planning time: 0.117 msExecution time: 845.541 ms(15 rows)postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1,t2,t3 where t1.c1=t2.c1 and t2.c1=t3.c1;WARNING: 01000: failed on cuCtxSynchronize: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: pgstrom_release_gpucontext, cuda_control.c:974WARNING: 01000: failed on cuCtxSynchronize: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: pgstrom_release_gpucontext, cuda_control.c:974WARNING: 01000: AbortTransaction while in ABORT stateLOCATION: AbortTransaction, xact.c:2471ERROR: XX000: failed on cuEventElapsedTime: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: gpujoin_task_complete, gpujoin.c:3404ERROR: XX000: failed on cuMemFree: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: gpuMemFreeAll, cuda_control.c:713postgres=# explain (analyze,verbose,costs,buffers,timing) select count(*) from t1 natural join t2;WARNING: 01000: failed on cuStreamDestroy: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: pgstrom_cleanup_gputask_cuda_resources, cuda_control.c:1718WARNING: 01000: failed on cuStreamDestroy: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: pgstrom_cleanup_gputask_cuda_resources, cuda_control.c:1718WARNING: 01000: failed on cuCtxSynchronize: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: pgstrom_release_gpucontext, cuda_control.c:974WARNING: 01000: failed on cuCtxSynchronize: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: pgstrom_release_gpucontext, cuda_control.c:974WARNING: 01000: AbortTransaction while in ABORT stateLOCATION: AbortTransaction, xact.c:2471ERROR: XX000: failed on cuMemFree: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: __gpuMemFree, cuda_control.c:645ERROR: XX000: failed on cuMemFree: CUDA_ERROR_ASSERT - device-side assert triggeredLOCATION: gpuMemFreeAll, cuda_control.c:701
JOIN 遇到以上问题. 没研究过CUDA,
/** pgstrom_cleanup_gputask_cuda_resources** it clears a common cuda resources; assigned on cb_task_process*/voidpgstrom_cleanup_gputask_cuda_resources(GpuTask *gtask){CUresult rc;if (gtask->cuda_stream){rc = cuStreamDestroy(gtask->cuda_stream);if (rc != CUDA_SUCCESS)elog(WARNING, "failed on cuStreamDestroy: %s", errorText(rc));}gtask->cuda_index = 0;gtask->cuda_context = NULL;gtask->cuda_device = 0UL;gtask->cuda_stream = NULL;gtask->cuda_module = NULL;}void__gpuMemFree(GpuContext *gcontext, int cuda_index, CUdeviceptr chunk_addr){GpuMemHead *gm_head;GpuMemBlock *gm_block;GpuMemChunk *gm_chunk;GpuMemChunk *gm_prev;GpuMemChunk *gm_next;dlist_node *dnode;dlist_iter iter;CUresult rc;int index;。。。。。。rc = cuMemFree(gm_block->block_addr);if (rc != CUDA_SUCCESS)elog(ERROR, "failed on cuMemFree: %s", errorText(rc));
[参考]
1. https://wiki.postgresql.org/wiki/PGStrom
0 0
- 试用 PGStrom
- 使用 PGStrom 2 (GPU JOIN, BulkScan, GpuPreAgg, ...)
- 试用
- 试用
- 试用
- 试用
- 试用
- 试用
- 试用
- 试用
- 试用blog
- blog试用
- CVS试用
- 试用NetBeans
- 试用一下
- 试用BLOG
- 新手试用
- 试用BLOG
- useful function & operator & custom operator for Row and Array Comparisons
- 粗糙集的概念和一些例子
- Windows服务编写原理及探讨【3】
- 黑马程序员-hashCode()的作用
- Linux下使用C语言进行检测按键的输入
- 试用 PGStrom
- springMVC的HandlerInterceptor拦截器
- 黑马程序员一一Java基础语法(四)
- Windows服务编写原理及探讨【4】
- 使用 PGStrom 2 (GPU JOIN, BulkScan, GpuPreAgg, ...)
- Android串口例子
- 【特种兵Excel教程】如何用Excel制作数据透视表?
- 二叉堆实现二
- Oracle FORALL & PostgreSQL ? op any|all (ARRAY)