Chiustin: CUDA B.5. Memory Fence Functions

B.5. Memory Fence Functions
CUDA編程模型用一種weakly-ordered memory模式，也就是說，Memory在寫入shared memory, global memory, page-locked host memory, or the memory of a peer device時不必按照程序中的順序來執行，例如：

__device__ volatile int X = 1, Y = 2;
__device__ void writeXY()

{
X = 10;
Y = 20; }

__device__ void readXY()

{
int A = X;
int B = Y; }

有可能會發生

A=1,B=2

A=10,B=2

A=10,B=20

避免此情形發生，我們可以執行以下指令

1.void __threadfence_block()

2.void __threadfence()

3.void __threadfence_system()

void __threadfence_block()
1.在執行完__threadfence_block()之後，如果有calling thread要記憶體進行寫入動作發生之前，在剛剛呼叫__threadfence_block()之前的block裡所有thread若有執行過記憶體寫入，block裡所有thread都會"確保"有觀察寫入的全部內容。

2.在執行完__threadfence_block()之後，如果有calling thread要進行記憶體讀取的話，在剛剛呼叫__threadfence_block()之前的calling thread若有執行過記憶體讀取，讀取過的全部內容都會"確保"是有順序的。

void __threadfence()

1.__threadfence_block()對block裡所有thread的功能，__threadfence()都有。
2.在執行完__threadfence()之前，如果有calling thread要做任何一個記憶體寫入動作發生之前，會"確保"device的任何一個thread在__threadfence()之後，觀察calling thread沒有寫入全部記憶體。

注意!為了保證這個順序是對的，觀察的thread必須真正觀察記憶體而不是cached版本，因此可以利用volatile指令。

void __threadfence_system()
1.__threadfence_block()對block裡所有thread的功能，__threadfence_system()都有。
2.在執行完__threadfence_system()之後，如果有calling thread要做任何一個記憶體寫入動作發生之前，會"確保"device在__threadfence_system()之後，任何一個thread會觀察calling thread沒有寫入全部記憶體。

在前面的例子，插入一個fence function call在X = 10; and Y = 20; and between int A = X; and int B = Y之間，將會確保thread 2的A = 10和B = 20。假如thread 1和thread 2都在同一個block，用__threadfence_block()即可。假如thread 1和thread 2在同一個device但不在同一個block，就得一定要用__threadfence()。假如thread 1和thread 2不在同一個device，就得一定要用__threadfence_system()。

以下例子如果沒有在result[blockIdx.x] = partialSum和unsigned int value = atomicInc(&count, gridDim.x)之間執行__threadfence()的話，可能會發生atomicInc沒有全部count完畢，最後一個block先偷跑到float totalSum = calculateTotalSum(result)。

__device__ unsigned int count = 0;
__shared__ bool isLastBlockDone;
__global__ void sum(const float* array, unsigned int N,

                volatile float* result)
// Each block sums a subset of the input array.

{
    float partialSum = calculatePartialSum(array, N);
    if (threadIdx.x == 0) {

        // Thread 0 of each block stores the partial sum
        // to global memory. The compiler will use
        // a store operation that bypasses the L1 cache
        // since the "result" variable is declared as
        // volatile. This ensures that the threads of
        // the last block will read the correct partial
        // sums computed by all other blocks.
        result[blockIdx.x] = partialSum;

        // Thread 0 makes sure that the incrementation
        // of the "count" variable is only performed after
        // the partial sum has been written to global memory.
        __threadfence();

        // Thread 0 signals that it is done.

        unsigned int value = atomicInc(&count, gridDim.x);

        // Thread 0 determines if its block is the last
        // block to be done.
        isLastBlockDone = (value == (gridDim.x - 1));

}

    // Synchronize to make sure that each thread reads
    // the correct value of isLastBlockDone.
    __syncthreads();

    if (isLastBlockDone) {

        // The last block sums the partial sums
        // stored in result[0 .. gridDim.x-1]
        float totalSum = calculateTotalSum(result);

        if (threadIdx.x == 0) {

            // Thread 0 of last block stores the total sum
            // to global memory and resets the count
            // varialble, so that the next kernel call
            // works properly.

            result[0] = totalSum;

count = 0; }
}
}
}

B.6. Synchronization Functions
void __syncthreads()
1.等到thread block裡所有的threads都到達此點，而且__syncthreads()之前block裡全部的threads在做所有global和shared memory存取都是可見的。

2.__syncthreads()是用來協調相同block的threads，當一個block存取相同的shared或global memory位址時，可能有read-after-write, write-after-read或write-after-write的危險性。

__syncthreads() is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared
or global memory, there are potential read-after-write, write-after-read, or write-after- write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.
__syncthreads（）用於協調同一塊的線程之間的通信。當塊中的某些線程訪問共享中的相同地址時
或全局存儲器，對於某些存儲器訪問，存在潛在的寫後讀，寫後讀或寫後寫危險。通過同步這些訪問之間的線程可以避免這些數據危險。

Chiustin

網頁

2019年2月5日星期二

CUDA B.5. Memory Fence Functions

1 則留言:

總網頁瀏覽量

關於我自己

網頁

2019年2月5日 星期二

CUDA B.5. Memory Fence Functions

1 則留言:

2019年2月5日星期二