Functional GPU Algorithm

来源：互联网发布：查看占用80端口的进程编辑：程序博客网时间：2024/05/18 16:35

1、压缩步骤：

（1）PREDICATE：对输入的每个元素运行一个判定；

（2）创建一个大小与输入一致的扫入数组，判定为真，在数组中则为1，否则为0；

（3）SCAN：对扫入数组进行不包含加和扫描，这个输出结果是压缩数组的分散地址；

（4）SCATTER：对于每个输入的元素，如果判定为真，那么把输入元素以扫出数组中的地址分散到输出数组。
2、分配空间策略：

分配最大空间给中间数组的缺点：浪费空间；需要扫描大量中间数组。
正确的做法：为输入元素申请一张分配请求的列表，然后取回一个位置来写入你的请求。

IN后面的1表示它想要写入一个输出到地址0，接着0表示不写入，1表示写入一个输出到地址1，2表示写入两个输出到地址2和3…….

·扫描是最快GPU排序的核心

3、分段扫描

有时候一个应用需要很多小的扫描，没有必要为每一个小扫描启动一次内核。这时把它们打包到一个大的数组中，然后分段扫描即可。

图中所示为分段不包含加和扫描。下面又分配了一个数组，1对应的为段首，0为非段首。

4、稀疏矩阵/密集向量乘法（sparse matrix/dense vector multiplication，SpMv）

想找到一个有效的方法，将稀疏矩阵中的0剔除（节省空间，减少计算），让剩余的元素与向量相乘。

压缩稀疏矩阵表示法：用3个向量一起表示稀疏矩阵

VALUE:记录所有非零元素

COLUMN：元素所在列

ROWPTR：元素个数不包含加和（也就是每行第一个元素所在的位置）

（1）取值向量和行指针向量->共同用来创建一个值向量的分段表示。

（2）用列索引取值->创建一个相应的列表（需要乘的向量的，如下图所示）

（3）讲列表与值向量的分段表示相乘

（4）分段不包含加和

5、排序

有效的并行算法的特点：让硬件保持忙碌；能够做有限分支；能够选择访问合并内存。

【奇偶分类（冒泡算法）】步骤：

（1）偶数用红色表示，奇数用黑色表示，两两配对；

（2）判断每对的顺序对否，不对则交换对内元素；

（3）重复(1)和(2)，直到整个顺序正确。

复杂度：STEP为O(n)（最复杂的情况下，一个元素要从最左端交换到最右端）；WORK为O(n²)（n步乘以STEP的复杂度n）

【合并排序】

可以并行实现许多小的任务，我们唯一要做的是将两个有序列表合成一个。

复杂度：STEP为O(logn)，WORK为O(nlogn)。

串行合并：对两个序列进行合并的时候，输入为两个序列，输出为一个序列。将两个序列的头进行比较，将小的那个元素输出，并将该序列向前挪一个元素。但是这样串行的合并方法不太有效。

并行合并：每个元素为一个线程，独立的找出自己在顺序表中应处的位置。所有的元素并行执行。每个元素花费logn步，且并行，那么STEP和WORK复杂度均为O(logn)。

归并排序最终剩下一条非常大的归并会带来SM的大量空闲。解决方法：尝试分解庞大的归并任务为小的，每一个都用不同的SM并行地单独处理。

如上图，归并排序分为3个阶段：第一个阶段用一个线程块，这里问题的数量远大于SM的数量；第二个阶段用一个线程块解决归并问题，这里的问题数与SM数相当；第三个阶段用所有线程块解决单个问题，把单个问题分解为多个问题，以使SM忙碌。

第三个阶段：取需要合并的列表中每第256（举例）个元素作为分解器，将两个列表的分解器排序之后得到分解器列表，将列表中每两个分解器之间的元素发送到同一SM合并。这样可以保证每个SM的任务不会太大。

【排序网】

无关算法：无论输入什么，总按相同步骤执行。适用于GPU。

排序网的双调排序如下图，对于一个随机序列：

（1）第一列的两个方框，分别把序列的前部和后部排序为两个有序序列；

（2）第二列的方框是将两个有序序列分别比较，分为一个较大的序列和一个较小的序列；

（3）第三列对这两个序列分别排序。

复杂度：STEP为O(logn)

【基数排序】

因为基数排序不是比较排序，且基于位数，所以十分适合结合GPU使用。

如下图所示，将数字用二进制表示，从最低位开始，将最低位为0的挪到前面（C语言代码(i&1)==0），为1的挪到后面；然后从右数第二位开始，进行同上述一样的操作，直到所有的位进行完，即可得到正确结果。

复杂度： O(kn)，k表示二进制的位数，n表示需要排序的项目数。

【快速排序】

选择第一个数字为主元素，将所有数字分为三组，分别为＜、=、＞主元素的。然后再选择每组第一个数字为主元素，递归重复上述操作。

因为旧的GPU不支持这种递归操作，所以需要特别设计这种算法到GPU的映射：分段。

【键值排序】

以上所说的所有排序都是对键进行排序，如果对键-值排序，需要将键和值一起排序，但是不需要将对象的整个数据结构拿来排序，只需要对其指针排序即可。

7、作业：消除照片中的红眼

步骤：

（1）模板计算。为每个像素计算一个分数，用来估计该像素属于红眼的可能性（归一化互相关），这个操作可以自然地表达成模板操作

（2）排序。根据分数对像素排序，分数高的可能属于红眼的像素。

（3）映射。对分数高的像素的发红度进行归约。

1. #include "reference_calc.cpp"

2. #include "utils.h"

3. #include <float.h>

4. #include <math.h>

5. #include <stdio.h>

7. #include "utils.h"

10. /* Red Eye Removal

11. ===============

12.

13. For this assignment we are implementing red eye removal. This is

14. accomplished by first creating a score for every pixel that tells us how

15. likely it is to be a red eye pixel. We have already done this for you - you

16. are receiving the scores and need to sort them in ascending order so that we

17. know which pixels to alter to remove the red eye.

18.

19. Note: ascending order == smallest to largest

20.

21. Each score is associated with a position, when you sort the scores, you must

22. also move the positions accordingly.

23.

24. Implementing Parallel Radix Sort with CUDA

25. ==========================================

26.

27. The basic idea is to construct a histogram on each pass of how many of each

28. "digit" there are. Then we scan this histogram so that we know where to put

29. the output of each digit. For example, the first 1 must come after all the

30. 0s so we have to know how many 0s there are to be able to start moving 1s

31. into the correct position.

32.

33. 1) Histogram of the number of occurrences of each digit

34. 2) Exclusive Prefix Sum of Histogram

35. 3) Determine relative offset of each digit

36. For example [0 0 1 1 0 0 1]

37. -> [0 1 0 1 2 3 2]

38. 4) Combine the results of steps 2 & 3 to determine the final

39. output location for each element and move it there

40.

41. LSB Radix sort is an out-of-place sort and you will need to ping-pong values

42. between the input and output buffers we have provided. Make sure the final

43. sorted results end up in the output buffer! Hint: You may need to do a copy

44. at the end.

45.

46. */

47.

48.

49.

50. const int MAX_THREADS_PER_BLOCK = 512;

51.

52.

53.

54. /*---------------------------------------------------------------------------------*/

55.

56.

57. ///////////////////////////////////////////////////////

58. //--------------------- KERNELS ---------------------//

59. ///////////////////////////////////////////////////////

60. __global__ void split_array(unsigned int* d_inputVals, unsigned int* d_splitVals,

61. const size_t numElems, unsigned int mask,

62. unsigned int ibit)

63. {

64.

65. int array_idx = blockIdx.x*blockDim.x + threadIdx.x;

66. if (array_idx >= numElems) return;

67.

68. // Split based on whether inputVals digit is 1 or 0:

69. d_splitVals[array_idx] = !(d_inputVals[array_idx] & mask);

70.

71. }

72.

73.

74. __global__ void blelloch_scan_single_block(unsigned int* d_in_array,

75. const size_t numBins,

76. unsigned normalization=0)

77. /*

78. Computes the blelloch exclusive scan for a cumulative distribution function of a

79. histogram, one block at a time.

80.

81. \Params:

82. * d_in_array - input array of histogram values in each bin. Gets converted

83. to cdf by the end of the function.

84. * numBins - number of bins in the histogram (Must be < 2*MAX_THREADS_PER_BLOCK)

85. * normalization - constant value to add to all bins

86. (when doing full exclusive sum scan over multiple blocks).

87. */

88. {

89.

90. int thid = threadIdx.x;

91.

92. extern __shared__ float temp_array[];

93.

94. // Make sure that we do not read from undefined part of array if it

95. // is smaller than the number of threads that we gave defined. If

96. // that is the case, the final values of the input array are

97. // extended to zero.

98. if (thid < numBins) temp_array[thid] = d_in_array[thid];

99. else temp_array[thid] = 0;

100. if( (thid + numBins/2) < numBins)

101. temp_array[thid + numBins/2] = d_in_array[thid + numBins/2];

102. else temp_array[thid + numBins/2] = 0;

103.

104. __syncthreads();

105.

106. // Part 1: Up Sweep, reduction

107. // Iterate log_2(numBins) times, and each element adds value 'stride'

108. // elements away to its own value.

109. int stride = 1;

110. for (int d = numBins>>1; d > 0; d>>=1) {

111.

112. if (thid < d) {

113. int neighbor = stride*(2*thid+1) - 1;

114. int index = stride*(2*thid+2) - 1;

115.

116. temp_array[index] += temp_array[neighbor];

117. }

118. stride *=2;

119. __syncthreads();

120. }

121. // Now set last element to identity:

122. if (thid == 0) temp_array[numBins-1] = 0;

123.

124. // Part 2: Down sweep

125. // Iterate log(n) times. Each thread adds value stride elements away to

126. // its own value, and sets the value stride elements away to its own

127. // previous value.

128. for (int d=1; d<numBins; d *= 2) {

129. stride >>= 1;

130. __syncthreads();

131.

132. if(thid < d) {

133. int neighbor = stride*(2*thid+1) - 1;

134. int index = stride*(2*thid+2) - 1;

135.

136. float t = temp_array[neighbor];

137. temp_array[neighbor] = temp_array[index];

138. temp_array[index] += t;

139. }

140. }

141.

142. __syncthreads();

143.

144. if (thid < numBins) d_in_array[thid] = temp_array[thid] + normalization;

145. if ((thid + numBins/2) < numBins)

146. d_in_array[thid + numBins/2] = temp_array[thid + numBins/2] + normalization;

147.

148. }

149.

150.

151. __global__ void compute_outputPos(const unsigned int* d_inputVals,

152. unsigned int* d_outputVals,

153. unsigned int* d_outputPos, unsigned int* d_tVals,

154. const unsigned int* d_splitVals,

155. const unsigned int* d_cdf, const unsigned int totalFalses,

156. const unsigned int numElems)

157. {

158.

159. int thid = threadIdx.x;

160. int global_id = blockIdx.x*blockDim.x + thid;

161. if (global_id >= numElems) return;

162.

163. d_tVals[global_id] = global_id - d_cdf[global_id] + totalFalses;

164.

165. unsigned int scatter = (!(d_splitVals[global_id]) ?

166. d_tVals[global_id] : d_cdf[global_id] );

167. d_outputPos[global_id] = scatter;

168.

169. }

170.

171.

172. __global__ void do_scatter(unsigned int* d_outputVals, const unsigned int* d_inputVals,

173. unsigned int* d_outputPos,

174. unsigned int* d_inputPos,

175. unsigned int* d_scatterAddr,

176. const unsigned int numElems)

177. {

178.

179. int global_id = blockIdx.x*blockDim.x + threadIdx.x;

180. if(global_id >= numElems) return;

181.

182. d_outputVals[d_outputPos[global_id]] = d_inputVals[global_id];

183. d_scatterAddr[d_outputPos[global_id]] = d_inputPos[global_id];

184.

185. }

186.

187.

188. ///////////////////////////////////////////////////////////

189. //--------------------- END KERNELS ---------------------//

190. ///////////////////////////////////////////////////////////

191.

192.

193.

194. void full_blelloch_exclusive_scan(unsigned int* d_binScan, const size_t totalNumElems)

195. /*

196. NOTE: blelloch_scan_single_block() does an exclusive sum scan over

197. an array (balanced tree) of size 2*MAX_THREADS_PER_BLOCK, by

198. performing the up and down sweep of the scan in shared memory (which

199. is limited in size).

200.

201. In order to scan over an entire array of size >

202. 2*MAX_THREADS_PER_BLOCK, we employ the following procedure:

203.

204. 1) Compute total number of blocks of size 2*MAX_THREADS_PER_BLOCK

205. 2) Loop over each block and compute a partial array of number

206. of bins: 2*MAX_THREADS_PER_BLOCK

207. 3) Give this partial array to blelloch_scan_single_block() and let

208. it return the sum scan.

209. 4) Now, one has a full array of partial sum scans, and then we take the

210. last element of the j-1 block and add it to each element of the jth

211. block.

212.

213. \Params:

214. * d_binScan - starts out as the "histogram" or in this case, the

215. split_array that we will perform an exclusive scan over.

216. * totalNumElems - total number of elements in the d_binScan array to

217. perform an exclusive scan over.

218. */

219. {

220.

221. int nthreads = MAX_THREADS_PER_BLOCK;

222. int nblocksTotal = (totalNumElems/2 - 1) / nthreads + 1;

223. int partialBins = 2*nthreads;

224. int smSize = partialBins*sizeof(unsigned);

225.

226. // Need a balanced d_binScan array so that on final block, correct

227. // values are given to d_partialBinScan.

228. // 1. define balanced bin scan

229. // 2. set all values to zero

230. // 3. copy all of binScan into binScanBalanced.

231. unsigned int* d_binScanBalanced;

232. unsigned int balanced_size = nblocksTotal*partialBins*sizeof(unsigned);

233. checkCudaErrors(cudaMalloc((void**)&d_binScanBalanced, balanced_size));

234. checkCudaErrors(cudaMemset(d_binScanBalanced, 0, balanced_size));

235. checkCudaErrors(cudaMemcpy(d_binScanBalanced, d_binScan,

236. totalNumElems*sizeof(unsigned),

237. cudaMemcpyDeviceToDevice));

238.

239. unsigned int* d_partialBinScan;

240. checkCudaErrors(cudaMalloc((void**)&d_partialBinScan, partialBins*sizeof(unsigned)));

241.

242. unsigned int* normalization = (unsigned*)malloc(sizeof(unsigned));

243. unsigned int* lastVal = (unsigned*)malloc(sizeof(unsigned));

244. for (unsigned iblock = 0; iblock < nblocksTotal; iblock++) {

245. unsigned offset = iblock*partialBins;

246.

247. // Copy binScan Partition into partialBinScan

248. checkCudaErrors(cudaMemcpy(d_partialBinScan, (d_binScanBalanced + offset),

249. smSize, cudaMemcpyDeviceToDevice));

250.

251. if (iblock > 0) {

252. // get normalization - final value in last cdf bin + last value in original

253. checkCudaErrors(cudaMemcpy(normalization, (d_binScanBalanced + (offset-1)),

254. sizeof(unsigned), cudaMemcpyDeviceToHost));

255. checkCudaErrors(cudaMemcpy(lastVal, (d_binScan + (offset-1)),

256. sizeof(unsigned), cudaMemcpyDeviceToHost));

257. *normalization += (*lastVal);

258. } else *normalization = 0;

259.

260. blelloch_scan_single_block<<<1, nthreads, smSize>>>(d_partialBinScan,

261. partialBins,

262. *normalization);

263. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

264.

265. // Copy partialBinScan back into binScanBalanced:

266. checkCudaErrors(cudaMemcpy((d_binScanBalanced+offset), d_partialBinScan, smSize,

267. cudaMemcpyDeviceToDevice));

268.

269. }

270.

271. // ONE BLOCK WORKING HERE!!!

272. // binScanBalanced now needs to be copied into d_binScan!

273. checkCudaErrors(cudaMemcpy(d_binScan,d_binScanBalanced,totalNumElems*sizeof(unsigned),

274. cudaMemcpyDeviceToDevice));

275.

276. free(normalization);

277. free(lastVal);

278. checkCudaErrors(cudaFree(d_binScanBalanced));

279. checkCudaErrors(cudaFree(d_partialBinScan));

280.

281. }

282.

283.

284. void compute_scatter_addresses(const unsigned int* d_inputVals,

285. unsigned int* d_outputVals,

286. unsigned int* d_inputPos,

287. unsigned int* d_outputPos,

288. unsigned int* d_scatterAddr,

289. const unsigned int* const d_splitVals,

290. const unsigned int* const d_cdf,

291. const unsigned totalFalses,

292. const size_t numElems)

293. /*

294. Modifies d_outputVals and d_outputPos

295. */

296. {

297.

298. unsigned int* d_tVals;

299. checkCudaErrors(cudaMalloc((void**)&d_tVals, numElems*sizeof(unsigned)));

300.

301. int nthreads = MAX_THREADS_PER_BLOCK;

302. int nblocks = (numElems - 1) / nthreads + 1;

303. compute_outputPos<<<nblocks, nthreads>>>(d_inputVals, d_outputVals, d_outputPos,

304. d_tVals, d_splitVals, d_cdf, totalFalses,

305. numElems);

306. // Testing purposes - REMOVE in production

307. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

308.

309. do_scatter<<<nblocks, nthreads>>>(d_outputVals, d_inputVals, d_outputPos,

310. d_inputPos, d_scatterAddr, numElems);

311. // Testing purposes - REMOVE in production

312. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

313.

314. checkCudaErrors(cudaFree(d_tVals));

315.

316. }

317.

318.

319.

320. void your_sort(unsigned int* const d_inputVals,

321. unsigned int* const d_inputPos,

322. unsigned int* const d_outputVals,

323. unsigned int* const d_outputPos,

324. const size_t numElems)

325. {

326.

327.

328.

329.

330. //-----Set up-----

331. const int numBits = 1;

332. unsigned int* d_splitVals;

333. checkCudaErrors(cudaMalloc((void**)&d_splitVals, numElems*sizeof(unsigned)));

334. unsigned int* d_cdf;

335. checkCudaErrors(cudaMalloc((void**)&d_cdf, numElems*sizeof(unsigned)));

336.

337. // d_scatterAddr keeps track of the scattered original addresses at every pass

338. unsigned int* d_scatterAddr;

339. checkCudaErrors(cudaMalloc((void**)&d_scatterAddr, numElems*sizeof(unsigned)));

340. checkCudaErrors(cudaMemcpy(d_scatterAddr, d_inputPos, numElems*sizeof(unsigned),

341. cudaMemcpyDeviceToDevice));

342.

343. // Need a global device array for blelloch scan:

344. const int nBlellochBins = 1 << unsigned(log((long double)numElems)/log((long double)2) + 0.5);

345. unsigned int* d_blelloch;

346. checkCudaErrors(cudaMalloc((void**)&d_blelloch, nBlellochBins*sizeof(unsigned)));

347. //printf(" numElems: %lu, numBlellochBins: %d \n",numElems, nBlellochBins);

348.

349. unsigned int* d_inVals = d_inputVals;

350. unsigned int* d_inPos = d_inputPos;

351. unsigned int* d_outVals = d_outputVals;

352. unsigned int* d_outPos = d_outputPos;

353.

354. // Testing purposes - also free'd at end

355. unsigned int* h_splitVals = (unsigned*)malloc(numElems*sizeof(unsigned));

356. unsigned int* h_cdf = (unsigned*)malloc(numElems*sizeof(unsigned));

357. unsigned int* h_inVals = (unsigned*)malloc(numElems*sizeof(unsigned));

358. unsigned int* h_outVals = (unsigned*)malloc(numElems*sizeof(unsigned));

359. unsigned int* h_inPos = (unsigned*)malloc(numElems*sizeof(unsigned));

360. unsigned int* h_outPos = (unsigned*)malloc(numElems*sizeof(unsigned));

361.

362.

363. // Parallel radix sort - For each pass (each bit):

364. // 1) Split values based on current bit

365. // 2) Scan values of split array

366. // 3) Compute scatter output position

367. // 4) Scatter output values using inputVals and outputPos

368. for(unsigned ibit = 0; ibit < 8 * sizeof(unsigned); ibit+=numBits) {

369.

370. checkCudaErrors(cudaMemset(d_splitVals, 0, numElems*sizeof(unsigned)));

371. checkCudaErrors(cudaMemset(d_cdf,0,numElems*sizeof(unsigned)));

372. checkCudaErrors(cudaMemset(d_blelloch,0,nBlellochBins*sizeof(unsigned)));

373.

374.

375. // Step 1: Split values on True if bit matches 0 in the given bit

376. // NOTE: mask = [1 2 4 8 ... 2147483648]

377. // [2^0, 2^1,...2^31]

378. unsigned int mask = 1 << ibit;

379. int nthreads = MAX_THREADS_PER_BLOCK;

380. int nblocks = (numElems - 1)/nthreads + 1;

381. split_array<<<nblocks, nthreads>>>(d_inVals, d_splitVals, numElems,

382. mask, ibit);

383. // Testing purposes - REMOVE in production

384. cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());

385.

386. checkCudaErrors(cudaMemcpy(d_cdf, d_splitVals, numElems*sizeof(unsigned),

387. cudaMemcpyDeviceToDevice));

388.

389. // Step 2: Scan values of split array:

390. // Uses Blelloch exclusive scan

391. full_blelloch_exclusive_scan(d_cdf, numElems);

392. // STEP 2 --> WORKING!!! VERIFIED FOR ALL STEPS!

393.

394.

395. // Step 3: compute scatter addresses

396. // Get totalFalses:

397. unsigned totalFalses = 0;

398. checkCudaErrors(cudaMemcpy(h_splitVals, d_splitVals + (numElems-1), sizeof(unsigned),

399. cudaMemcpyDeviceToHost));

400. checkCudaErrors(cudaMemcpy(h_cdf, d_cdf + (numElems -1), sizeof(unsigned),

401. cudaMemcpyDeviceToHost));

402. totalFalses = h_splitVals[0] + h_cdf[0];

403. compute_scatter_addresses(d_inVals, d_outVals, d_inPos, d_outPos, d_scatterAddr,

404. d_splitVals, d_cdf, totalFalses, numElems);

405.

406. // swap pointers:

407. std::swap(d_inVals, d_outVals);

408. std::swap(d_inPos, d_scatterAddr);

409.

410. }

411.

412. // Do we need this?

413. checkCudaErrors(cudaMemcpy(d_outputVals, d_inputVals, numElems*sizeof(unsigned),

414. cudaMemcpyDeviceToDevice));

415. checkCudaErrors(cudaMemcpy(d_outputPos, d_inputPos, numElems*sizeof(unsigned),

416. cudaMemcpyDeviceToDevice));

417.

418. // Put scatter addresses (->inPos) into d_outputVals;

419. checkCudaErrors(cudaMemcpy(d_outputPos, d_inPos, numElems*sizeof(unsigned),

420. cudaMemcpyDeviceToDevice));

421.

422. checkCudaErrors(cudaFree(d_splitVals));

423. checkCudaErrors(cudaFree(d_cdf));

424. checkCudaErrors(cudaFree(d_blelloch));

425.

426. free(h_splitVals);

427. free(h_cdf);

428. free(h_inVals);

429. free(h_outVals);

430. free(h_inPos);

431. free(h_outPos);

432. }

阅读全文

1 0