影像處理
histogram 等化處理時,
把
histogram 視為機率分布函數
(pdf), 經常將其積分為累積分布函數
cdf, 例如:
pdf
= { 1, 2, 3, 4, 5, 6, 7, 8}
cdf
= { 1, 3, 6, 10, 15, 21 28, 36}
程式碼為:
cdf[0] =
pdf[0];
for (int i = 1; i < N;
i++) {
cdf[i] =
cdf[i - 1] + pfd[i];
}
|
SCAN演算法有兩類, inclusive的計算形式為:
output[i[
= output[i-1] operator input[i-1]
exclusive的計算形式為:
output[i[
= output[i-1] operator input[i]
兩者差別在於目前輸出的計算總合是至前一個輸入項或是包括目前輸入項.
因為SCAN 的 data dependency 用到前一個計算結果, 乍看之下似乎很難平行化. 如果 operator
具有結合性, 結果與計算順序無關, 則可以用一些空間安排的技巧來達到平行化. 在 [1][2] 中, 介紹了Hillis Steele與 Blelloch兩種演算法, 下圖為 Hillis Steele 演算法的計算過程:
以下程式碼把資料放在同一個 block 內計算, 因目前 CUDA 一個 block最多有1024個thread, 所以陣列 a[] 最多只能有 1024 個元素. 這個實作先把 a[] 複製到 shared
memory, 然後只在 shared
memory運算, 因此不會改變原本 a[] 的內容.
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#define BlockSize 1
#define ThreadSize 19
#define ArraySize (BlockSize*ThreadSize)
__device__ void __syncthreads();
__global__ void scanHillisSteele(int *b, int *a)
{
__shared__ int x[ThreadSize];
int id = blockIdx.x *
blockDim.x + threadIdx.x;
x[threadIdx.x]
= a[id];
__syncthreads();
//wait copy compelete
for (int d = 1;
d<blockDim.x; d <<= 1)
{
if (threadIdx.x
>= d) {
x[threadIdx.x]
+= x[threadIdx.x - d];
} //keep
__syncthreads();
}
b[threadIdx.x] =
x[threadIdx.x];
}
int main()
{
int host_a[ArraySize];
int host_b[ArraySize];
int *dev_a = 0;
int *dev_b = 0;
int sum = 0;
float elapsedTime;
// setup performance meter from CUDA ----------
cudaEvent_t
start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaSetDevice(0);
cudaMalloc((void**)&dev_a, ArraySize * sizeof(int));
for (int i = 0; i < ArraySize; i++)
host_a[i]
= i + 1;
cudaMemcpy(dev_a,
host_a, ArraySize * sizeof(int), cudaMemcpyHostToDevice);
cudaMalloc((void**)&dev_b, ArraySize * sizeof(int));
//cudaMemset(dev_b, 0, ArraySize * sizeof(int));
// Run scanHillisSteele
cudaEventRecord(start,
0); //keep start time
scanHillisSteele
<< <BlockSize, ThreadSize >> > (dev_b, dev_a); //calculate
cudaEventRecord(stop,
0); //keep stop time
cudaEventSynchronize(stop);
//wait stop event
cudaEventElapsedTime(&elapsedTime,
start, stop);
cudaMemcpy(host_b,
dev_b, ArraySize * sizeof(int), cudaMemcpyDeviceToHost);
//Print result
printf("pdf:\n");
for (int i = 0; i < ArraySize; i++) {
printf("%4d ", host_a[i]);
}
printf("\n");
printf("cdf:\n");
for (int i = 0; i < ArraySize; i++) {
printf("%4d ", host_b[i]);
}
printf("\nt=%f\n\n", elapsedTime);
//cudaDeviceSynchronize();
getchar();
cudaFree(dev_a);
cudaFree(dev_b);
return 0;
}
|
以下為執行結果:
參考資料:
[1] “Parallel Prefix Sum (Scan) with
CUDA”, https://developer.nvidia.com/gpugems/GPUGems3/gpugems3_ch39.html.
[2] Adam O’Donovan, “Parallel Prefix Sum on the GPU (Scan) “, University of Maryland Institute for Advanced Computer Studies, http://users.umiacs.umd.edu/~ramani/cmsc828e_gpusci/ScanTalk.pdf
沒有留言:
張貼留言