[ACCEPTED]-CUDA: How many concurrent threads in total?-gpgpu

Accepted answer
Score: 67

The GTX 580 can have 16 * 48 concurrent 28 warps (32 threads each) running at a time. That 27 is 16 multiprocessors (SMs) * 48 resident 26 warps per SM * 32 threads per warp = 24,576 25 threads.

Don't confuse concurrency and throughput. The 24 number above is the maximum number of threads 23 whose resources can be stored on-chip simultaneously 22 -- the number that can be resident. In CUDA terms 21 we also call this maximum occupancy. The hardware 20 switches between warps constantly to help 19 cover or "hide" the (large) latency of memory 18 accesses as well as the (small) latency 17 of arithmetic pipelines.

While each SM can 16 have 48 resident warps, it can only issue 15 instructions from a small number (on average 14 between 1 and 2 for GTX 580, but it depends 13 on the program instruction mix) of warps 12 at each clock cycle.

So you are probably 11 better off comparing throughput, which is 10 determined by the available execution units 9 and how the hardware is capable of performing 8 multi-issue. On GTX580, there are 512 FMA 7 execution units, but also integer units, special 6 function units, memory instruction units, etc, which 5 can be dual-issued to (i.e. issue independent 4 instructions from 2 warps simultaneously) in 3 various combinations.

Taking into account 2 all of the above is too difficult, though, so 1 most people compare on two metrics:

  1. Peak GFLOP/s (which for GTX 580 is 512 FMA units * 2 flops per FMA * 1544e6 cycles/second = 1581.1 GFLOP/s (single precision))
  2. Measured throughput on the application you are interested in.

The most important comparison is always measured wall-clock time on a real application.

Score: 9

There are certain traps that you can fall 23 into by doing that comparison to 2 or 4-core 22 CPUs:

  • The number of concurrent threads does 21 not match the number of threads that actually 20 run in parallel. Of course you can launch 19 24576 threads concurrently on GTX 580 but 18 the optimal value is in most cases lower.

  • A 17 2 or 4-core CPU can have arbitrary many 16 concurrent threads! Similarly as with GPU, from 15 some point adding more threads won't help, or 14 even it may slow down.

  • A "CUDA core" is a 13 single scalar processing unit, while CPU 12 core is usually a bigger thing, containing 11 for example a 4-wide SIMD unit. To compare 10 apples-to-apples, you should multiply the 9 number of advertised CPU cores by 4 to match 8 what NVIDIA calls a core.

  • CPU supports hyperthreading, which 7 allows a single core to process 2 threads 6 concurrently in a light way. Because of 5 that, an operating system may actually see 4 2 times more "logical cores" than the hardware 3 cores.

To sum it up: For a fair comparison, your 2 4-core CPU can actually run 32 "scalar threads" concurrently, because 1 of SIMD and hyperthreading.

Score: 0

I realize this is a bit late but I figured 9 I'd help out anyway. From page 10 the CUDA 8 Fermi architecture whitepaper:

Each SM features 7 two warp schedulers and two instruction 6 dispatch units, allowing two warps to be 5 issued and executed concurrently.

To me this 4 means that each SM can have 2*32=64 threads 3 running concurrently. I don't know if that 2 means that the GPU can have a total of 16*64=1024 1 threads running concurrently.

More Related questions