網頁

2019年2月5日 星期二

CUDA B.5. Memory Fence Functions

B.5. Memory Fence Functions
CUDA編程模型用一種weakly-ordered memory模式,也就是說,Memory在寫入shared memory, global memory, page-locked host memory, or the memory of a peer device時不必按照程序中的順序來執行,例如:

__device__ volatile int X = 1, Y = 2;
__device__ void writeXY()
{
X = 10;

Y = 20; }
__device__ void readXY()
{
int A = X;

int B = Y; } 

有可能會發生

A=1,B=2
A=10,B=2
A=10,B=20

避免此情形發生,我們可以執行以下指令

1.void __threadfence_block()
2.void __threadfence()
3.void __threadfence_system()

void __threadfence_block()
1.在執行完__threadfence_block()之後,如果有calling thread要記憶體進行寫入動作發生之前,在剛剛呼叫__threadfence_block()之前的block裡所有thread若有執行過記憶體寫入,block裡所有thread都會"確保"有觀察寫入的全部內容。
2.在執行完__threadfence_block()之後,如果有calling thread要進行記憶體讀取的話,在剛剛呼叫__threadfence_block()之前的calling thread若有執行過記憶體讀取,讀取過的全部內容都會"確保"是有順序的。

void __threadfence()
1.__threadfence_block()對block裡所有thread的功能,__threadfence()都有。
2.在執行完__threadfence()之前,如果有calling thread要做任何一個記憶體寫入動作發生之前,會"確保"device的任何一個thread在__threadfence()之後,觀察calling thread沒有寫入全部記憶體。
注意!為了保證這個順序是對的,觀察的thread必須真正觀察記憶體而不是cached版本,因此可以利用volatile指令。

void __threadfence_system()
1.__threadfence_block()對block裡所有thread的功能,__threadfence_system()都有。
2.在執行完__threadfence_system()之後,如果有calling thread要做任何一個記憶體寫入動作發生之前,會"確保"device在__threadfence_system()之後,任何一個thread會觀察calling thread沒有寫入全部記憶體。

在前面的例子,插入一個fence function call在X = 10; and Y = 20; and between int A = X; and int B = Y之間,將會確保thread 2的A = 10和B = 20。假如thread 1和thread 2都在同一個block,用__threadfence_block()即可。假如thread 1和thread 2在同一個device但不在同一個block,就得一定要用__threadfence()。假如thread 1和thread 2不在同一個device,就得一定要用__threadfence_system()。



以下例子如果沒有在result[blockIdx.x] = partialSum和unsigned int value = atomicInc(&count, gridDim.x)之間執行__threadfence()的話,可能會發生atomicInc沒有全部count完畢,最後一個block先偷跑到float totalSum = calculateTotalSum(result)。


__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,
                volatile float* result)
// Each block sums a subset of the input array.
{
    float partialSum = calculatePartialSum(array, N);
    if (threadIdx.x == 0) {

        // Thread 0 of each block stores the partial sum
        // to global memory. The compiler will use
        // a store operation that bypasses the L1 cache
        // since the "result" variable is declared as
        // volatile. This ensures that the threads of
        // the last block will read the correct partial
        // sums computed by all other blocks.
        result[blockIdx.x] = partialSum;
        // Thread 0 makes sure that the incrementation
        // of the "count" variable is only performed after
        // the partial sum has been written to global memory.
        __threadfence();
        // Thread 0 signals that it is done.
        unsigned int value = atomicInc(&count, gridDim.x);
        // Thread 0 determines if its block is the last
        // block to be done.
        isLastBlockDone = (value == (gridDim.x - 1));
    }
    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone.
    __syncthreads();
    if (isLastBlockDone) {
        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);
        if (threadIdx.x == 0) {
            // Thread 0 of last block stores the total sum
            // to global memory and resets the count
            // varialble, so that the next kernel call
            // works properly.
            result[0] = totalSum;
            count = 0; }
        } 
    } 
}



B.6. Synchronization Functions
void __syncthreads()
1.等到thread block裡所有的threads都到達此點,而且__syncthreads()之前block裡全部的threads在做所有global和shared memory存取都是可見的。
2.__syncthreads()是用來協調相同block的threads,當一個block存取相同的shared或global memory位址時,可能有read-after-write, write-after-read或write-after-write的危險性。


__syncthreads() is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared
or global memory, there are potential read-after-write, write-after-read, or write-after- write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.
__syncthreads()用於協調同一塊的線程之間的通信。 當塊中的某些線程訪問共享中的相同地址時
或全局存儲器,對於某些存儲器訪問,存在潛在的寫後讀,寫後讀或寫後寫危險。 通過同步這些訪問之間的線程可以避免這些數據危險。





1 則留言:

  1. 請問版主void __threadfence_block()這個函式的
    1.”block裡所有thread都會"確保"有觀察寫入的全部內容“:是指在block裡面的所有thread都需要把舊的資料做更新嗎?

    回覆刪除