Chiustin: CUDA

2019年2月3日星期日

CUDA

// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N],
float C[N][N])
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
if (i < N && j < N)
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation
dim3 threadsPerBlock(16, 16);
dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
...
}

https://blog.csdn.net/u012033124/article/details/70792877

https://yq.aliyun.com/articles/444192?spm=a2c4e.11153940.blogcont437440.16.2f4aaf64F2T67U

cudaMalloc() cudaMallocPitch() cudaMalloc3D()
https://yq.aliyun.com/articles/444192?spm=a2c4e.11153940.blogcont444205.14.42e71425hoeObL

shared memory
https://yq.aliyun.com/articles/444205?spm=a2c4e.11153940.blogcont444192.14.188634e15E3RtT
https://blog.csdn.net/full_speed_turbo/article/details/73733290

cudaHostAlloc（）和cudaFreeHost（）
https://yq.aliyun.com/articles/444231?spm=a2c4e.11153940.blogcont444205.15.5aa614255iM6tp
https://yq.aliyun.com/articles/448549?spm=a2c4e.11153940.blogcont444231.15.d1fc5034TYekrN

stream
3.2.5.5. Streams
https://yq.aliyun.com/articles/448557?spm=a2c4e.11153940.blogcont448554.14.28344c6dV7YOto
https://yq.aliyun.com/articles/448561?spm=a2c4e.11153940.blogcont448557.17.2c52d49dVV9Scq
https://yq.aliyun.com/articles/448566?spm=a2c4e.11153940.blogcont448561.15.4e907b5ezol2m0
https://yq.aliyun.com/articles/484099?spm=a2c4e.11153940.blogcont484097.17.76b66293aEsr2M

multipleGPU
https://yq.aliyun.com/articles/448571?spm=a2c4e.11153940.blogcont448566.17.1e614788cIHHoR
https://yq.aliyun.com/articles/448576?spm=a2c4e.11153940.blogcont448571.13.30d852afZP1di4

Texture
https://yq.aliyun.com/articles/448580?spm=a2c4e.11153940.blogcont448576.16.4ebf63040D2e77
https://yq.aliyun.com/articles/448584?spm=a2c4e.11153940.blogcont448580.14.7f87cd4bHTF7ZU
https://yq.aliyun.com/articles/460312?spm=a2c4e.11153940.blogcont460310.12.473b2f49JS6MDy
https://yq.aliyun.com/articles/471830?spm=a2c4e.11153940.blogcont471829.16.112938b3O8iK8A
https://yq.aliyun.com/articles/471831?spm=a2c4e.11153940.blogcont471830.14.26e13209b9Exln
https://yq.aliyun.com/articles/471833?spm=a2c4e.11153940.blogcont471831.15.11a22111JCojTg

Surface Memory
https://yq.aliyun.com/articles/460313?spm=a2c4e.11153940.blogcont460312.16.70e27216o0ECHA
https://yq.aliyun.com/articles/460355?spm=a2c4e.11153940.blogcont460313.15.59016e33CVlIhQ
https://yq.aliyun.com/articles/471835?spm=a2c4e.11153940.blogcont471833.16.44f347d0aJFIGt
https://yq.aliyun.com/articles/471840?spm=a2c4e.11153940.blogcont471835.15.51882fadx8OSyx

與OpenGL互操作性
https://yq.aliyun.com/articles/460359?spm=a2c4e.11153940.blogcont460355.15.4e464750Cq0KhG

Direct3D互操作性
https://yq.aliyun.com/articles/460363?spm=a2c4e.11153940.blogcont460359.14.6a2b73b6QeWWcV

__align __
https://yq.aliyun.com/articles/463136?spm=a2c4e.11153940.blogcont463135.17.42c31b192cve2G

_fdividef(x,y) rsqrtf() sinf(x) cosf(x) tanf(x) sincosf(x) 3.141592653589793f
https://yq.aliyun.com/articles/467251?spm=a2c4e.11153940.blogcont463137.14.6e6d1e7cASdFfX
https://docs.nvidia.com/cuda/cuda-math-api/modules.html#modules
http://www.cplusplus.com/doc/tutorial/constants/

extern __shared__ __restrict__
https://yq.aliyun.com/articles/467260?spm=a2c4e.11153940.blogcont467254.18.768e5c37fWyzGi
https://www.itread01.com/content/1544596582.html

__threadfence_block() __threadfence() __threadfence_system()
https://yq.aliyun.com/articles/467266?spm=a2c4e.11153940.blogcont467260.18.75ad33c2Kq3B2f

__syncthreads() __syncthreads_count(int predicate) __syncthreads_and(int predicate)
__syncthreads_or(int predicate) __syncwarp(unsigned mask=0xffffffff)
https://yq.aliyun.com/articles/471829?spm=a2c4e.11153940.blogcont469057.15.76b12503xox7AI

__ldg atomicAdd() atomicAdd_system() atomicAdd_block()
https://yq.aliyun.com/articles/474404?spm=a2c4e.11153940.blogcont471840.17.5ce65ba0DOLjFO
https://yq.aliyun.com/articles/474407?spm=a2c4e.11153940.blogcont474404.14.588e720dS642ER
https://yq.aliyun.com/articles/474408?spm=a2c4e.11153940.blogcont474407.12.39865848yNv7FY

Warp Match Warp Shuffle
https://yq.aliyun.com/articles/474409?spm=a2c4e.11153940.blogcont474408.13.368d798bHn9bQs

bcast scan4() warpReduce()
https://yq.aliyun.com/articles/474410?spm=a2c4e.11153940.blogcont474409.14.78a84e35xe8b0w
https://yq.aliyun.com/articles/474412?spm=a2c4e.11153940.blogcont474410.17.263f7661imqauw

__prof_trigger
https://yq.aliyun.com/articles/474414?spm=a2c4e.11153940.blogcont474412.17.6faaeef6JkKvoN

printf()
https://yq.aliyun.com/articles/474416?spm=a2c4e.11153940.blogcont474414.18.28806b5ajPLbeK

void* malloc void* memcpy void* memset
https://yq.aliyun.com/articles/479275?spm=a2c4e.11153940.blogcont474416.16.1e0a54a3qnJWkr
https://yq.aliyun.com/articles/479277?spm=a2c4e.11153940.blogcont479275.19.b33a14aeUyX6om

__launch_bounds __() maxThreadsPerBlock() minBlocksPerMultiprocessor()
https://yq.aliyun.com/articles/479279?spm=a2c4e.11153940.blogcont479277.16.2d63e72fUMEoBY

cooperative_groups
https://yq.aliyun.com/articles/479280?spm=a2c4e.11153940.blogcont479279.16.94025904Av0cxE

thread_block
https://yq.aliyun.com/articles/479281?spm=a2c4e.11153940.blogcont479280.20.3ed33005DzECgg

cg::coalesced_group
https://yq.aliyun.com/articles/484085?spm=a2c4e.11153940.blogcont479291.17.302b35e8gEFQge

grid_group
https://yq.aliyun.com/articles/484087?spm=a2c4e.11153940.blogcont484085.18.71af55c5q63JYF

CUDA Dynamic Parallelism
D.1. Introduction
https://yq.aliyun.com/articles/484090?spm=a2c4e.11153940.blogcont484087.15.2fce1461iUJifm
D.2.1.1. Parent and Child Gridshttps://yq.aliyun.com/articles/484092?spm=a2c4e.11153940.blogcont484090.19.5e44102aBBlwGm
D.2.1.5. Ordering and Concurrencyhttps://yq.aliyun.com/articles/484094?spm=a2c4e.11153940.blogcont484092.17.bc5f299a8BGZxY
D.3.1.1. Device-Side Kernel Launch
https://yq.aliyun.com/articles/484097?spm=a2c4e.11153940.blogcont484094.16.69f149cbZtauoH
D.3.1.6.3. Shared Memory Variable Declarations
https://yq.aliyun.com/articles/484101?spm=a2c4e.11153940.blogcont484099.16.a2945f3a5o0NFK
D.3.1.2. Streams
https://yq.aliyun.com/articles/484099?spm=a2c4e.11153940.blogcont484097.17.76b66293aEsr2M
D.3.1.6.1. Device and Constant Memory
https://yq.aliyun.com/articles/484101?spm=a2c4e.11153940.blogcont484099.16.4e265f3aV8oPZu
D.3.1.7. API Errors and Launch Failures
https://yq.aliyun.com/articles/484103?spm=a2c4e.11153940.blogcont484101.14.767b2d03UcvCUw
D.3.3.1. Including Device Runtime API in CUDA Code
https://yq.aliyun.com/articles/484106?spm=a2c4e.11153940.blogcont484103.18.3215bfd6Lsbfkg
D.4.3. Implementation Restrictions and Limitations
https://yq.aliyun.com/articles/486308?spm=a2c4e.11153940.blogcont484106.18.1ba91458umVT2z
E. Mathematical Functions
https://yq.aliyun.com/articles/486309?spm=a2c4e.11153940.blogcont486308.18.66bc5a8b3QuRyV
F. C/C++ Language Support
https://yq.aliyun.com/articles/486310?spm=a2c4e.11153940.blogcont486309.14.a34674f4T716E2
F.3.3. Qualifiers
https://yq.aliyun.com/articles/486311?spm=a2c4e.11153940.blogcont486310.15.5a566a59McSA6W
F.3.9.3. Function Parameters F.3.9.4. Static Variables within Function F.3.9.5. Function Pointers F.3.9.6. Function Recursion F.3.9.7. Friend Functions F.3.9.8. Operator Function F.3.10. Classes
https://yq.aliyun.com/articles/486313?spm=a2c4e.11153940.blogcont486312.17.4e2d7106OAStzs
F.3.11. Templates F.3.12. Trigraphs and Digraphs F.3.13. Const-qualified variables F.3.14. Long Double F.3.15. Deprecation Annotation
https://yq.aliyun.com/articles/486317?spm=a2c4e.11153940.blogcont486313.16.286b5ee5OW5dgN

grid block分配
https://www.itread01.com/content/1541990407.html

cudaMallocPitch and cudaMemcpy2D
https://stackoverflow.com/questions/35771430/cudamallocpitch-and-cudamemcpy2d

避免Strict Aliasing
void foo(const float* __restrict__ a, const float* __restrict__ b, float* __restrict__ c)
https://www.oschina.net/question/234345_52682

沒有留言:

張貼留言

訂閱：張貼留言 (Atom)