qshinoの日記

Powershell関係と徒然なこと

CUDA

実行環境

host

cpu側の関数スレッドスレッド

device

gpu側の関数、スレッド

kernel

gpu側の関数、スレッド

https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/#:~:text=In%20CUDA%2C%20the%20host%20refers,many%20GPU%20threads%20in%20parallel.

メモリ確保

cudaHostAlloc()

Deviceからアクセス可能なHostメモリを確保。cudaFreeHost()で解放。

  • cudaHostAllocDefault
  • cudaHostAllocMapped
  • cudaHostAllocPortable
  • cudaHostAllocWriteCombined

Defaultはゼロで何も指定しない場合。MappedはCUDA空間でメモリ確保。後でデバイスから見たアドレスをcudaHostGetDevicePointer()で取得可能。 Portableは、全CUDAコンテキストでアクセス可能なアドレスを取得する。 WriteCombinedはWC属性のメモリ空間を確保。

Mapped, Portable, WriteCombinedは組み合わせ可能。

cudaError_t cudaHostAlloc   (   void **     ptr,
size_t  size,
unsigned int    flags    
)           
Allocates count bytes of host memory that is page-locked and accessible to the device. The driver tracks the virtual memory ranges allocated with this function and automatically accelerates calls to functions such as cudaMemcpy(). Since the memory can be accessed directly by the device, it can be read or written with much higher bandwidth than pageable memory obtained with functions such as malloc(). Allocating excessive amounts of pinned memory may degrade system performance, since it reduces the amount of memory available to the system for paging. As a result, this function is best used sparingly to allocate staging areas for data exchange between host and device.

The flags parameter enables different options to be specified that affect the allocation, as follows.

cudaHostAllocDefault: This flag's value is defined to be 0 and causes cudaHostAlloc() to emulate cudaMallocHost().
cudaHostAllocPortable: The memory returned by this call will be considered as pinned memory by all CUDA contexts, not just the one that performed the allocation.
cudaHostAllocMapped: Maps the allocation into the CUDA address space. The device pointer to the memory may be obtained by calling cudaHostGetDevicePointer().
cudaHostAllocWriteCombined: Allocates the memory as write-combined (WC). WC memory can be transferred across the PCI Express bus more quickly on some system configurations, but cannot be read efficiently by most CPUs. WC memory is a good option for buffers that will be written by the CPU and read by the device via mapped pinned memory or host->device transfers.
All of these flags are orthogonal to one another: a developer may allocate memory that is portable, mapped and/or write-combined with no restrictions.

cudaSetDeviceFlags() must have been called with the cudaDeviceMapHost flag in order for the cudaHostAllocMapped flag to have any effect.

The cudaHostAllocMapped flag may be specified on CUDA contexts for devices that do not support mapped pinned memory. The failure is deferred to cudaHostGetDevicePointer() because the memory may be mapped into other CUDA contexts via the cudaHostAllocPortable flag.

Memory allocated by this function must be freed with cudaFreeHost().

Parameters:
ptr     - Device pointer to allocated memory
size    - Requested allocation size in bytes
flags   - Requested properties of allocated memory
Returns:
cudaSuccess, cudaErrorMemoryAllocation
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaSetDeviceFlags, cudaMallocHost, cudaFreeHost

cudaMalloc()


cudaError_t cudaMalloc  (   void **     devPtr,
size_t  size     
)           
Allocates size bytes of linear memory on the device and returns in *devPtr a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. cudaMalloc() returns cudaErrorMemoryAllocation in case of failure.

Parameters:
devPtr  - Pointer to allocated device memory
size    - Requested allocation size in bytes
Returns:
cudaSuccess, cudaErrorMemoryAllocation
See also:
cudaMallocPitch, cudaFree, cudaMallocArray, cudaFreeArray, cudaMalloc3D, cudaMalloc3DArray, cudaMallocHost, cudaFreeHost, cudaHostAlloc

cudaMemcpyAsync()

cudaError_t cudaMemcpyAsync  (   void *  dst,
const void *    src,
size_t  count,
enum cudaMemcpyKind     kind,
cudaStream_t    stream   
)           
Copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudaMemcpyHostToHost, cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, or cudaMemcpyDeviceToDevice, and specifies the direction of the copy. The memory areas may not overlap. Calling cudaMemcpyAsync() with dst and src pointers that do not match the direction of the copy results in an undefined behavior.

cudaMemcpyAsync() is asynchronous with respect to the host, so the call may return before the copy is complete. It only works on page-locked host memory and returns an error if a pointer to pageable memory is passed as input. The copy can optionally be associated to a stream by passing a non-zero stream argument. If kind is cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost and the stream is non-zero, the copy may overlap with operations in other streams.

IMPORTANT NOTE: Copies with kind == cudaMemcpyDeviceToDevice are asynchronous with respect to the host, but never overlap with kernel execution.

Parameters:
dst     - Destination memory address
src     - Source memory address
count   - Size in bytes to copy
kind    - Type of transfer
stream  - Stream identifier
Returns:
cudaSuccess, cudaErrorInvalidValue, cudaErrorInvalidDevicePointer, cudaErrorInvalidMemcpyDirection
Note:
Note that this function may also return error codes from previous, asynchronous launches.
See also:
cudaMemcpy, cudaMemcpy2D, cudaMemcpyToArray, cudaMemcpy2DToArray, cudaMemcpyFromArray, cudaMemcpy2DFromArray, cudaMemcpyArrayToArray, cudaMemcpy2DArrayToArray, cudaMemcpyToSymbol, cudaMemcpyFromSymbol, cudaMemcpy2DAsync, cudaMemcpyToArrayAsync, cudaMemcpy2DToArrayAsync, cudaMemcpyFromArrayAsync, cudaMemcpy2DFromArrayAsync, cudaMemcpyToSymbolAsync, cudaMemcpyFromSymbolAsync

ソース

http://developer.download.nvidia.com/compute/cuda/2_3/toolkit/docs/online/groupCUDARTMEMORY_g217d441a73d9304c6f0ccc22ec307dba.html#g217d441a73d9304c6f0ccc22ec307dba