Functional GPU Algorithm
来源:互联网 发布:查看占用80端口的进程 编辑:程序博客网 时间:2024/05/18 16:35
1、压缩步骤:
(1)PREDICATE:对输入的每个元素运行一个判定;
(2)创建一个大小与输入一致的扫入数组,判定为真,在数组中则为1,否则为0;
(3)SCAN:对扫入数组进行不包含加和扫描,这个输出结果是压缩数组的分散地址;
(4)SCATTER:对于每个输入的元素,如果判定为真,那么把输入元素以扫出数组中的地址分散到输出数组。
2、分配空间策略:
分配最大空间给中间数组的缺点:浪费空间;需要扫描大量中间数组。
正确的做法:为输入元素申请一张分配请求的列表,然后取回一个位置来写入你的请求。
IN后面的1表示它想要写入一个输出到地址0,接着0表示不写入,1表示写入一个输出到地址1,2表示写入两个输出到地址2和3…….
·扫描是最快GPU排序的核心
3、分段扫描
有时候一个应用需要很多小的扫描,没有必要为每一个小扫描启动一次内核。这时把它们打包到一个大的数组中,然后分段扫描即可。
图中所示为分段不包含加和扫描。下面又分配了一个数组,1对应的为段首,0为非段首。
4、稀疏矩阵/密集向量乘法(sparse matrix/dense vector multiplication,SpMv)
想找到一个有效的方法,将稀疏矩阵中的0剔除(节省空间,减少计算),让剩余的元素与向量相乘。
压缩稀疏矩阵表示法:用3个向量一起表示稀疏矩阵
VALUE:记录所有非零元素
COLUMN:元素所在列
ROWPTR:元素个数不包含加和(也就是每行第一个元素所在的位置)
(1)取值向量和行指针向量->共同用来创建一个值向量的分段表示。
(2)用列索引取值->创建一个相应的列表(需要乘的向量的,如下图所示)
(3)讲列表与值向量的分段表示相乘
(4)分段不包含加和
5、排序
有效的并行算法的特点:让硬件保持忙碌;能够做有限分支;能够选择访问合并内存。
【奇偶分类(冒泡算法)】步骤:
(1)偶数用红色表示,奇数用黑色表示,两两配对;
(2)判断每对的顺序对否,不对则交换对内元素;
(3)重复(1)和(2),直到整个顺序正确。
复杂度:STEP为O(n)(最复杂的情况下,一个元素要从最左端交换到最右端);WORK为O(n²)(n步乘以STEP的复杂度n)
【合并排序】
可以并行实现许多小的任务,我们唯一要做的是将两个有序列表合成一个。
复杂度:STEP为O(logn),WORK为O(nlogn)。
串行合并:对两个序列进行合并的时候,输入为两个序列,输出为一个序列。将两个序列的头进行比较,将小的那个元素输出,并将该序列向前挪一个元素。但是这样串行的合并方法不太有效。
并行合并:每个元素为一个线程,独立的找出自己在顺序表中应处的位置。所有的元素并行执行。每个元素花费logn步,且并行,那么STEP和WORK复杂度均为O(logn)。
归并排序最终剩下一条非常大的归并会带来SM的大量空闲。解决方法:尝试分解庞大的归并任务为小的,每一个都用不同的SM并行地单独处理。
如上图,归并排序分为3个阶段:第一个阶段用一个线程块,这里问题的数量远大于SM的数量;第二个阶段用一个线程块解决归并问题,这里的问题数与SM数相当;第三个阶段用所有线程块解决单个问题,把单个问题分解为多个问题,以使SM忙碌。
第三个阶段:取需要合并的列表中每第256(举例)个元素作为分解器,将两个列表的分解器排序之后得到分解器列表,将列表中每两个分解器之间的元素发送到同一SM合并。这样可以保证每个SM的任务不会太大。
【排序网】
无关算法:无论输入什么,总按相同步骤执行。适用于GPU。
排序网的双调排序如下图,对于一个随机序列:
(1)第一列的两个方框,分别把序列的前部和后部排序为两个有序序列;
(2)第二列的方框是将两个有序序列分别比较,分为一个较大的序列和一个较小的序列;
(3)第三列对这两个序列分别排序。
复杂度:STEP为O(logn)
【基数排序】
因为基数排序不是比较排序,且基于位数,所以十分适合结合GPU使用。
如下图所示,将数字用二进制表示,从最低位开始,将最低位为0的挪到前面(C语言代码(i&1)==0),为1的挪到后面;然后从右数第二位开始,进行同上述一样的操作,直到所有的位进行完,即可得到正确结果。
复杂度: O(kn),k表示二进制的位数,n表示需要排序的项目数。
【快速排序】
选择第一个数字为主元素,将所有数字分为三组,分别为<、=、>主元素的。然后再选择每组第一个数字为主元素,递归重复上述操作。
因为旧的GPU不支持这种递归操作,所以需要特别设计这种算法到GPU的映射:分段。
【键值排序】
以上所说的所有排序都是对键进行排序,如果对键-值排序,需要将键和值一起排序,但是不需要将对象的整个数据结构拿来排序,只需要对其指针排序即可。
7、作业:消除照片中的红眼
步骤:
(1)模板计算。为每个像素计算一个分数,用来估计该像素属于红眼的可能性(归一化互相关),这个操作可以自然地表达成模板操作
(2)排序。根据分数对像素排序,分数高的可能属于红眼的像素。
(3)映射。对分数高的像素的发红度进行归约。
1. #include "reference_calc.cpp"
2. #include "utils.h"
3. #include <float.h>
4. #include <math.h>
5. #include <stdio.h>
6.
7. #include "utils.h"
8.
9.
10. /* Red Eye Removal
11. ===============
12.
13. For this assignment we are implementing red eye removal. This is
14. accomplished by first creating a score for every pixel that tells us how
15. likely it is to be a red eye pixel. We have already done this for you - you
16. are receiving the scores and need to sort them in ascending order so that we
17. know which pixels to alter to remove the red eye.
18.
19. Note: ascending order == smallest to largest
20.
21. Each score is associated with a position, when you sort the scores, you must
22. also move the positions accordingly.
23.
24. Implementing Parallel Radix Sort with CUDA
25. ==========================================
26.
27. The basic idea is to construct a histogram on each pass of how many of each
28. "digit" there are. Then we scan this histogram so that we know where to put
29. the output of each digit. For example, the first 1 must come after all the
30. 0s so we have to know how many 0s there are to be able to start moving 1s
31. into the correct position.
32.
33. 1) Histogram of the number of occurrences of each digit
34. 2) Exclusive Prefix Sum of Histogram
35. 3) Determine relative offset of each digit
36. For example [0 0 1 1 0 0 1]
37. -> [0 1 0 1 2 3 2]
38. 4) Combine the results of steps 2 & 3 to determine the final
39. output location for each element and move it there
40.
41. LSB Radix sort is an out-of-place sort and you will need to ping-pong values
42. between the input and output buffers we have provided. Make sure the final
43. sorted results end up in the output buffer! Hint: You may need to do a copy
44. at the end.
45.
46. */
47.
48.
49.
50. const int MAX_THREADS_PER_BLOCK = 512;
51.
52.
53.
54. /*---------------------------------------------------------------------------------*/
55.
56.
57. ///////////////////////////////////////////////////////
58. //--------------------- KERNELS ---------------------//
59. ///////////////////////////////////////////////////////
60. __global__ void split_array(unsigned int* d_inputVals, unsigned int* d_splitVals,
61. const size_t numElems, unsigned int mask,
62. unsigned int ibit)
63. {
64.
65. int array_idx = blockIdx.x*blockDim.x + threadIdx.x;
66. if (array_idx >= numElems) return;
67.
68. // Split based on whether inputVals digit is 1 or 0:
69. d_splitVals[array_idx] = !(d_inputVals[array_idx] & mask);
70.
71. }
72.
73.
74. __global__ void blelloch_scan_single_block(unsigned int* d_in_array,
75. const size_t numBins,
76. unsigned normalization=0)
77. /*
78. Computes the blelloch exclusive scan for a cumulative distribution function of a
79. histogram, one block at a time.
80.
81. \Params:
82. * d_in_array - input array of histogram values in each bin. Gets converted
83. to cdf by the end of the function.
84. * numBins - number of bins in the histogram (Must be < 2*MAX_THREADS_PER_BLOCK)
85. * normalization - constant value to add to all bins
86. (when doing full exclusive sum scan over multiple blocks).
87. */
88. {
89.
90. int thid = threadIdx.x;
91.
92. extern __shared__ float temp_array[];
93.
94. // Make sure that we do not read from undefined part of array if it
95. // is smaller than the number of threads that we gave defined. If
96. // that is the case, the final values of the input array are
97. // extended to zero.
98. if (thid < numBins) temp_array[thid] = d_in_array[thid];
99. else temp_array[thid] = 0;
100. if( (thid + numBins/2) < numBins)
101. temp_array[thid + numBins/2] = d_in_array[thid + numBins/2];
102. else temp_array[thid + numBins/2] = 0;
103.
104. __syncthreads();
105.
106. // Part 1: Up Sweep, reduction
107. // Iterate log_2(numBins) times, and each element adds value 'stride'
108. // elements away to its own value.
109. int stride = 1;
110. for (int d = numBins>>1; d > 0; d>>=1) {
111.
112. if (thid < d) {
113. int neighbor = stride*(2*thid+1) - 1;
114. int index = stride*(2*thid+2) - 1;
115.
116. temp_array[index] += temp_array[neighbor];
117. }
118. stride *=2;
119. __syncthreads();
120. }
121. // Now set last element to identity:
122. if (thid == 0) temp_array[numBins-1] = 0;
123.
124. // Part 2: Down sweep
125. // Iterate log(n) times. Each thread adds value stride elements away to
126. // its own value, and sets the value stride elements away to its own
127. // previous value.
128. for (int d=1; d<numBins; d *= 2) {
129. stride >>= 1;
130. __syncthreads();
131.
132. if(thid < d) {
133. int neighbor = stride*(2*thid+1) - 1;
134. int index = stride*(2*thid+2) - 1;
135.
136. float t = temp_array[neighbor];
137. temp_array[neighbor] = temp_array[index];
138. temp_array[index] += t;
139. }
140. }
141.
142. __syncthreads();
143.
144. if (thid < numBins) d_in_array[thid] = temp_array[thid] + normalization;
145. if ((thid + numBins/2) < numBins)
146. d_in_array[thid + numBins/2] = temp_array[thid + numBins/2] + normalization;
147.
148. }
149.
150.
151. __global__ void compute_outputPos(const unsigned int* d_inputVals,
152. unsigned int* d_outputVals,
153. unsigned int* d_outputPos, unsigned int* d_tVals,
154. const unsigned int* d_splitVals,
155. const unsigned int* d_cdf, const unsigned int totalFalses,
156. const unsigned int numElems)
157. {
158.
159. int thid = threadIdx.x;
160. int global_id = blockIdx.x*blockDim.x + thid;
161. if (global_id >= numElems) return;
162.
163. d_tVals[global_id] = global_id - d_cdf[global_id] + totalFalses;
164.
165. unsigned int scatter = (!(d_splitVals[global_id]) ?
166. d_tVals[global_id] : d_cdf[global_id] );
167. d_outputPos[global_id] = scatter;
168.
169. }
170.
171.
172. __global__ void do_scatter(unsigned int* d_outputVals, const unsigned int* d_inputVals,
173. unsigned int* d_outputPos,
174. unsigned int* d_inputPos,
175. unsigned int* d_scatterAddr,
176. const unsigned int numElems)
177. {
178.
179. int global_id = blockIdx.x*blockDim.x + threadIdx.x;
180. if(global_id >= numElems) return;
181.
182. d_outputVals[d_outputPos[global_id]] = d_inputVals[global_id];
183. d_scatterAddr[d_outputPos[global_id]] = d_inputPos[global_id];
184.
185. }
186.
187.
188. ///////////////////////////////////////////////////////////
189. //--------------------- END KERNELS ---------------------//
190. ///////////////////////////////////////////////////////////
191.
192.
193.
194. void full_blelloch_exclusive_scan(unsigned int* d_binScan, const size_t totalNumElems)
195. /*
196. NOTE: blelloch_scan_single_block() does an exclusive sum scan over
197. an array (balanced tree) of size 2*MAX_THREADS_PER_BLOCK, by
198. performing the up and down sweep of the scan in shared memory (which
199. is limited in size).
200.
201. In order to scan over an entire array of size >
202. 2*MAX_THREADS_PER_BLOCK, we employ the following procedure:
203.
204. 1) Compute total number of blocks of size 2*MAX_THREADS_PER_BLOCK
205. 2) Loop over each block and compute a partial array of number
206. of bins: 2*MAX_THREADS_PER_BLOCK
207. 3) Give this partial array to blelloch_scan_single_block() and let
208. it return the sum scan.
209. 4) Now, one has a full array of partial sum scans, and then we take the
210. last element of the j-1 block and add it to each element of the jth
211. block.
212.
213. \Params:
214. * d_binScan - starts out as the "histogram" or in this case, the
215. split_array that we will perform an exclusive scan over.
216. * totalNumElems - total number of elements in the d_binScan array to
217. perform an exclusive scan over.
218. */
219. {
220.
221. int nthreads = MAX_THREADS_PER_BLOCK;
222. int nblocksTotal = (totalNumElems/2 - 1) / nthreads + 1;
223. int partialBins = 2*nthreads;
224. int smSize = partialBins*sizeof(unsigned);
225.
226. // Need a balanced d_binScan array so that on final block, correct
227. // values are given to d_partialBinScan.
228. // 1. define balanced bin scan
229. // 2. set all values to zero
230. // 3. copy all of binScan into binScanBalanced.
231. unsigned int* d_binScanBalanced;
232. unsigned int balanced_size = nblocksTotal*partialBins*sizeof(unsigned);
233. checkCudaErrors(cudaMalloc((void**)&d_binScanBalanced, balanced_size));
234. checkCudaErrors(cudaMemset(d_binScanBalanced, 0, balanced_size));
235. checkCudaErrors(cudaMemcpy(d_binScanBalanced, d_binScan,
236. totalNumElems*sizeof(unsigned),
237. cudaMemcpyDeviceToDevice));
238.
239. unsigned int* d_partialBinScan;
240. checkCudaErrors(cudaMalloc((void**)&d_partialBinScan, partialBins*sizeof(unsigned)));
241.
242. unsigned int* normalization = (unsigned*)malloc(sizeof(unsigned));
243. unsigned int* lastVal = (unsigned*)malloc(sizeof(unsigned));
244. for (unsigned iblock = 0; iblock < nblocksTotal; iblock++) {
245. unsigned offset = iblock*partialBins;
246.
247. // Copy binScan Partition into partialBinScan
248. checkCudaErrors(cudaMemcpy(d_partialBinScan, (d_binScanBalanced + offset),
249. smSize, cudaMemcpyDeviceToDevice));
250.
251. if (iblock > 0) {
252. // get normalization - final value in last cdf bin + last value in original
253. checkCudaErrors(cudaMemcpy(normalization, (d_binScanBalanced + (offset-1)),
254. sizeof(unsigned), cudaMemcpyDeviceToHost));
255. checkCudaErrors(cudaMemcpy(lastVal, (d_binScan + (offset-1)),
256. sizeof(unsigned), cudaMemcpyDeviceToHost));
257. *normalization += (*lastVal);
258. } else *normalization = 0;
259.
260. blelloch_scan_single_block<<<1, nthreads, smSize>>>(d_partialBinScan,
261. partialBins,
262. *normalization);
263. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
264.
265. // Copy partialBinScan back into binScanBalanced:
266. checkCudaErrors(cudaMemcpy((d_binScanBalanced+offset), d_partialBinScan, smSize,
267. cudaMemcpyDeviceToDevice));
268.
269. }
270.
271. // ONE BLOCK WORKING HERE!!!
272. // binScanBalanced now needs to be copied into d_binScan!
273. checkCudaErrors(cudaMemcpy(d_binScan,d_binScanBalanced,totalNumElems*sizeof(unsigned),
274. cudaMemcpyDeviceToDevice));
275.
276. free(normalization);
277. free(lastVal);
278. checkCudaErrors(cudaFree(d_binScanBalanced));
279. checkCudaErrors(cudaFree(d_partialBinScan));
280.
281. }
282.
283.
284. void compute_scatter_addresses(const unsigned int* d_inputVals,
285. unsigned int* d_outputVals,
286. unsigned int* d_inputPos,
287. unsigned int* d_outputPos,
288. unsigned int* d_scatterAddr,
289. const unsigned int* const d_splitVals,
290. const unsigned int* const d_cdf,
291. const unsigned totalFalses,
292. const size_t numElems)
293. /*
294. Modifies d_outputVals and d_outputPos
295. */
296. {
297.
298. unsigned int* d_tVals;
299. checkCudaErrors(cudaMalloc((void**)&d_tVals, numElems*sizeof(unsigned)));
300.
301. int nthreads = MAX_THREADS_PER_BLOCK;
302. int nblocks = (numElems - 1) / nthreads + 1;
303. compute_outputPos<<<nblocks, nthreads>>>(d_inputVals, d_outputVals, d_outputPos,
304. d_tVals, d_splitVals, d_cdf, totalFalses,
305. numElems);
306. // Testing purposes - REMOVE in production
307. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
308.
309. do_scatter<<<nblocks, nthreads>>>(d_outputVals, d_inputVals, d_outputPos,
310. d_inputPos, d_scatterAddr, numElems);
311. // Testing purposes - REMOVE in production
312. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
313.
314. checkCudaErrors(cudaFree(d_tVals));
315.
316. }
317.
318.
319.
320. void your_sort(unsigned int* const d_inputVals,
321. unsigned int* const d_inputPos,
322. unsigned int* const d_outputVals,
323. unsigned int* const d_outputPos,
324. const size_t numElems)
325. {
326.
327.
328.
329.
330. //-----Set up-----
331. const int numBits = 1;
332. unsigned int* d_splitVals;
333. checkCudaErrors(cudaMalloc((void**)&d_splitVals, numElems*sizeof(unsigned)));
334. unsigned int* d_cdf;
335. checkCudaErrors(cudaMalloc((void**)&d_cdf, numElems*sizeof(unsigned)));
336.
337. // d_scatterAddr keeps track of the scattered original addresses at every pass
338. unsigned int* d_scatterAddr;
339. checkCudaErrors(cudaMalloc((void**)&d_scatterAddr, numElems*sizeof(unsigned)));
340. checkCudaErrors(cudaMemcpy(d_scatterAddr, d_inputPos, numElems*sizeof(unsigned),
341. cudaMemcpyDeviceToDevice));
342.
343. // Need a global device array for blelloch scan:
344. const int nBlellochBins = 1 << unsigned(log((long double)numElems)/log((long double)2) + 0.5);
345. unsigned int* d_blelloch;
346. checkCudaErrors(cudaMalloc((void**)&d_blelloch, nBlellochBins*sizeof(unsigned)));
347. //printf(" numElems: %lu, numBlellochBins: %d \n",numElems, nBlellochBins);
348.
349. unsigned int* d_inVals = d_inputVals;
350. unsigned int* d_inPos = d_inputPos;
351. unsigned int* d_outVals = d_outputVals;
352. unsigned int* d_outPos = d_outputPos;
353.
354. // Testing purposes - also free'd at end
355. unsigned int* h_splitVals = (unsigned*)malloc(numElems*sizeof(unsigned));
356. unsigned int* h_cdf = (unsigned*)malloc(numElems*sizeof(unsigned));
357. unsigned int* h_inVals = (unsigned*)malloc(numElems*sizeof(unsigned));
358. unsigned int* h_outVals = (unsigned*)malloc(numElems*sizeof(unsigned));
359. unsigned int* h_inPos = (unsigned*)malloc(numElems*sizeof(unsigned));
360. unsigned int* h_outPos = (unsigned*)malloc(numElems*sizeof(unsigned));
361.
362.
363. // Parallel radix sort - For each pass (each bit):
364. // 1) Split values based on current bit
365. // 2) Scan values of split array
366. // 3) Compute scatter output position
367. // 4) Scatter output values using inputVals and outputPos
368. for(unsigned ibit = 0; ibit < 8 * sizeof(unsigned); ibit+=numBits) {
369.
370. checkCudaErrors(cudaMemset(d_splitVals, 0, numElems*sizeof(unsigned)));
371. checkCudaErrors(cudaMemset(d_cdf,0,numElems*sizeof(unsigned)));
372. checkCudaErrors(cudaMemset(d_blelloch,0,nBlellochBins*sizeof(unsigned)));
373.
374.
375. // Step 1: Split values on True if bit matches 0 in the given bit
376. // NOTE: mask = [1 2 4 8 ... 2147483648]
377. // [2^0, 2^1,...2^31]
378. unsigned int mask = 1 << ibit;
379. int nthreads = MAX_THREADS_PER_BLOCK;
380. int nblocks = (numElems - 1)/nthreads + 1;
381. split_array<<<nblocks, nthreads>>>(d_inVals, d_splitVals, numElems,
382. mask, ibit);
383. // Testing purposes - REMOVE in production
384. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
385.
386. checkCudaErrors(cudaMemcpy(d_cdf, d_splitVals, numElems*sizeof(unsigned),
387. cudaMemcpyDeviceToDevice));
388.
389. // Step 2: Scan values of split array:
390. // Uses Blelloch exclusive scan
391. full_blelloch_exclusive_scan(d_cdf, numElems);
392. // STEP 2 --> WORKING!!! VERIFIED FOR ALL STEPS!
393.
394.
395. // Step 3: compute scatter addresses
396. // Get totalFalses:
397. unsigned totalFalses = 0;
398. checkCudaErrors(cudaMemcpy(h_splitVals, d_splitVals + (numElems-1), sizeof(unsigned),
399. cudaMemcpyDeviceToHost));
400. checkCudaErrors(cudaMemcpy(h_cdf, d_cdf + (numElems -1), sizeof(unsigned),
401. cudaMemcpyDeviceToHost));
402. totalFalses = h_splitVals[0] + h_cdf[0];
403. compute_scatter_addresses(d_inVals, d_outVals, d_inPos, d_outPos, d_scatterAddr,
404. d_splitVals, d_cdf, totalFalses, numElems);
405.
406. // swap pointers:
407. std::swap(d_inVals, d_outVals);
408. std::swap(d_inPos, d_scatterAddr);
409.
410. }
411.
412. // Do we need this?
413. checkCudaErrors(cudaMemcpy(d_outputVals, d_inputVals, numElems*sizeof(unsigned),
414. cudaMemcpyDeviceToDevice));
415. checkCudaErrors(cudaMemcpy(d_outputPos, d_inputPos, numElems*sizeof(unsigned),
416. cudaMemcpyDeviceToDevice));
417.
418. // Put scatter addresses (->inPos) into d_outputVals;
419. checkCudaErrors(cudaMemcpy(d_outputPos, d_inPos, numElems*sizeof(unsigned),
420. cudaMemcpyDeviceToDevice));
421.
422. checkCudaErrors(cudaFree(d_splitVals));
423. checkCudaErrors(cudaFree(d_cdf));
424. checkCudaErrors(cudaFree(d_blelloch));
425.
426. free(h_splitVals);
427. free(h_cdf);
428. free(h_inVals);
429. free(h_outVals);
430. free(h_inPos);
431. free(h_outPos);
432. }- Functional GPU Algorithm
- 【STL】<algorithm><numeric><functional> 中的常用算法
- Functional
- C++之基于STL的基本学生信息系统开发(vector/algorithm/functional)
- Algorithm
- Algorithm
- algorithm
- algorithm
- algorithm
- algorithm
- algorithm
- Algorithm
- Algorithm
- algorithm
- Algorithm
- Algorithm
- algorithm
- algorithm
- vue axios-2
- tablayout依赖
- vi 和 vim 编辑器的使用
- 给单元素艺术添加动画
- 大数据平台搭建-简单说
- Functional GPU Algorithm
- Android NDK 开发:实战案例
- Linux IO多路复用之epoll网络编程(含源码)
- golang -- slice元素去重
- Github遇到Permanently added the RSA host key for IP address '192.30.252.128' to the list of known host
- Python3 执行Linux Bash命令
- 数据库(第一范式,第二范式,第三范式)
- Rightware的Kanzi界面很快你的全液晶汽车仪表盘
- Python的序列(2)-元组