CUDA編程模型用一種weakly-ordered memory模式,也就是說,Memory在寫入shared memory, global memory, page-locked host memory, or the memory of a peer device時不必按照程序中的順序來執行,例如:
__device__ volatile int X = 1, Y = 2;
__device__ void writeXY()
{X = 10;
Y = 20; }
__device__ void readXY()
{int A = X;
int B = Y; }
有可能會發生
A=1,B=2
A=10,B=2
A=10,B=20
避免此情形發生,我們可以執行以下指令
1.void __threadfence_block()
2.void __threadfence()
3.void __threadfence_system()
void __threadfence_block()
1.在執行完__threadfence_block()之後,如果有calling thread要記憶體進行寫入動作發生之前,在剛剛呼叫__threadfence_block()之前的block裡所有thread若有執行過記憶體寫入,block裡所有thread都會"確保"有觀察寫入的全部內容。
2.在執行完__threadfence_block()之後,如果有calling thread要進行記憶體讀取的話,在剛剛呼叫__threadfence_block()之前的calling thread若有執行過記憶體讀取,讀取過的全部內容都會"確保"是有順序的。
1.在執行完__threadfence_block()之後,如果有calling thread要記憶體進行寫入動作發生之前,在剛剛呼叫__threadfence_block()之前的block裡所有thread若有執行過記憶體寫入,block裡所有thread都會"確保"有觀察寫入的全部內容。
void __threadfence()
1.__threadfence_block()對block裡所有thread的功能,__threadfence()都有。
2.在執行完__threadfence()之前,如果有calling thread要做任何一個記憶體寫入動作發生之前,會"確保"device的任何一個thread在__threadfence()之後,觀察calling thread沒有寫入全部記憶體。
2.在執行完__threadfence()之前,如果有calling thread要做任何一個記憶體寫入動作發生之前,會"確保"device的任何一個thread在__threadfence()之後,觀察calling thread沒有寫入全部記憶體。
注意!為了保證這個順序是對的,觀察的thread必須真正觀察記憶體而不是cached版本,因此可以利用volatile指令。
void __threadfence_system()
1.__threadfence_block()對block裡所有thread的功能,__threadfence_system()都有。
2.在執行完__threadfence_system()之後,如果有calling thread要做任何一個記憶體寫入動作發生之前,會"確保"device在__threadfence_system()之後,任何一個thread會觀察calling thread沒有寫入全部記憶體。
在前面的例子,插入一個fence function call在X = 10; and Y = 20; and between int A = X; and int B = Y之間,將會確保thread 2的A = 10和B = 20。假如thread 1和thread 2都在同一個block,用__threadfence_block()即可。假如thread 1和thread 2在同一個device但不在同一個block,就得一定要用__threadfence()。假如thread 1和thread 2不在同一個device,就得一定要用__threadfence_system()。
以下例子如果沒有在result[blockIdx.x] = partialSum和unsigned int value = atomicInc(&count, gridDim.x)之間執行__threadfence()的話,可能會發生atomicInc沒有全部count完畢,最後一個block先偷跑到float totalSum = calculateTotalSum(result)。
B.6. Synchronization Functions
void __syncthreads()
1.等到thread block裡所有的threads都到達此點,而且__syncthreads()之前block裡全部的threads在做所有global和shared memory存取都是可見的。
void __threadfence_system()
1.__threadfence_block()對block裡所有thread的功能,__threadfence_system()都有。
2.在執行完__threadfence_system()之後,如果有calling thread要做任何一個記憶體寫入動作發生之前,會"確保"device在__threadfence_system()之後,任何一個thread會觀察calling thread沒有寫入全部記憶體。
在前面的例子,插入一個fence function call在X = 10; and Y = 20; and between int A = X; and int B = Y之間,將會確保thread 2的A = 10和B = 20。假如thread 1和thread 2都在同一個block,用__threadfence_block()即可。假如thread 1和thread 2在同一個device但不在同一個block,就得一定要用__threadfence()。假如thread 1和thread 2不在同一個device,就得一定要用__threadfence_system()。
以下例子如果沒有在result[blockIdx.x] = partialSum和unsigned int value = atomicInc(&count, gridDim.x)之間執行__threadfence()的話,可能會發生atomicInc沒有全部count完畢,最後一個block先偷跑到float totalSum = calculateTotalSum(result)。
__device__ unsigned int count = 0; __shared__ bool isLastBlockDone; __global__ void sum(const float* array, unsigned int N,
volatile float* result) // Each block sums a subset of the input array.
{ float partialSum = calculatePartialSum(array, N); if (threadIdx.x == 0) {
// Thread 0 of each block stores the partial sum // to global memory. The compiler will use // a store operation that bypasses the L1 cache // since the "result" variable is declared as // volatile. This ensures that the threads of // the last block will read the correct partial // sums computed by all other blocks. result[blockIdx.x] = partialSum;
// Thread 0 makes sure that the incrementation // of the "count" variable is only performed after // the partial sum has been written to global memory. __threadfence();
// Thread 0 signals that it is done.
unsigned int value = atomicInc(&count, gridDim.x);
// Thread 0 determines if its block is the last // block to be done. isLastBlockDone = (value == (gridDim.x - 1));
}
}
}
}
// Synchronize to make sure that each thread reads // the correct value of isLastBlockDone. __syncthreads();
if (isLastBlockDone) {
// The last block sums the partial sums // stored in result[0 .. gridDim.x-1] float totalSum = calculateTotalSum(result);
if (threadIdx.x == 0) {
// Thread 0 of last block stores the total sum
// to global memory and resets the count
// varialble, so that the next kernel call
// works properly.
result[0] = totalSum;
count = 0;
}
}
}
}
B.6. Synchronization Functions
void __syncthreads()
1.等到thread block裡所有的threads都到達此點,而且__syncthreads()之前block裡全部的threads在做所有global和shared memory存取都是可見的。
2.__syncthreads()是用來協調相同block的threads,當一個block存取相同的shared或global memory位址時,可能有read-after-write, write-after-read或write-after-write的危險性。
__syncthreads() is used to coordinate communication between the threads of the same block. When some threads within a block access the same addresses in shared
or global memory, there are potential read-after-write, write-after-read, or write-after- write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.
__syncthreads()用於協調同一塊的線程之間的通信。 當塊中的某些線程訪問共享中的相同地址時
或全局存儲器,對於某些存儲器訪問,存在潛在的寫後讀,寫後讀或寫後寫危險。 通過同步這些訪問之間的線程可以避免這些數據危險。
or global memory, there are potential read-after-write, write-after-read, or write-after- write hazards for some of these memory accesses. These data hazards can be avoided by synchronizing threads in-between these accesses.
__syncthreads()用於協調同一塊的線程之間的通信。 當塊中的某些線程訪問共享中的相同地址時
或全局存儲器,對於某些存儲器訪問,存在潛在的寫後讀,寫後讀或寫後寫危險。 通過同步這些訪問之間的線程可以避免這些數據危險。
請問版主void __threadfence_block()這個函式的
回覆刪除1.”block裡所有thread都會"確保"有觀察寫入的全部內容“:是指在block裡面的所有thread都需要把舊的資料做更新嗎?