GPU Rendering Final Summary

  • After spending months reading all the notes I research below, I decide to put a big table to summarize all the important points on how to choose GPU combo for rendering
Key Point Description
GPU rendering doesn't use SLI GPU rendering is more like computing, each device doing its own thing and don't need to be sync-ed (ref: difference of sli and GPU computing)
GPU rendering is OK with any PCIE speed unlike SLI requiring at least x8 PCIE slot, GPU computing can be done in any speed slot, as speed only affect data asset uploading into VRAM, and it is minimum comparing to the time doing the GPU complex computing (ref: 4x slot for render, test with motherboard x4 slot, GPU computing doesn't swap vram data as often as game content, old motherboard with slow but many slots for mining, article on GPU mining at slow but many slots)
CPU max PCIE lanes limits motherboard x8 speed slot count that may affect SLI setup's max GPU device count, but for GPU computing, x4 is also fine, those GPU mining even using x1 speed slot (ref: Z170A w. x8/x8 but with lots of low PCIE slots, CPU means lanes for max SLI)
Dual CPUs setup will double max lanes count dual CPUs setup will give more PCIEs lanes, even for 5+5 GPU devices at 8x
motherboard PCIE slot count determine max GPU device that counts both PCIE Gen 3 and PCIE Gen 2 slots, and slot size not matter much if you use PCIE size convertor
motherboard PCIE slot configuration option determined slot speed setup higher price motherboard tends to offer higher slot speed configuration for stack of GPU devices
extra PLX chipset enable motherboard to can create extra PCIE lanes by multiplexing underused lanes Since once data asset is uploaded to VRAM, GPU doesn't need lanes whiling computing, so PLX chipset can allocate the x16 lanes slot to another GPU, so it lifts up the CPU max lane count limitation with possible latency (ref: PLX tech, Z170-WS at x8/x8/x8/x8, Z170X-Gaming G1 at x8/x8/x8/x8 w. PLX8747, z170 vs x99 sli, video on x99 vs z170 4-way sli, PLX used for not just max GPU count but also max x16 full speed slot, x99 with PLX)
motherboard GPU slots' in-between distance can limit GPU choices the gap between 2 GPU slot can limit the max “thickness” of GPU device (so called single-slot GPU like low-profile quadra card or normal double-slot gaming GPU), also GPU cooler need to be “blower type” like those reference card if gap too small for “Open Fan type” cooling, and even tighter space like single-slot distance may require custom water cooling. (reference: GPU thickness, GPU cooler type, GPU cooler type and gap distance, talk of cooler design, big GPU slot gap case, Z10PED8_WS 7 slot but gap for 4 GPU case, video on 7 GPU single slot mod, article on 7 GPU mod on X99-E WS)
PCIE riser can help extend tight PCIE slot to outside case for better cooling it is like a extension PCIE cable but it require case to be able to hang and hold those extends GPUs (ref: holding lots of gpu, slot size convertor cable)
Power supply must be able to feed GPU devices do a calculation on the power usage with a power calculator, like 3 GPU=850w, 2GPU=650w
External GPU device will go through motherboard chipset lanes instead of CPU lanes External GPU device using thunderbolt or usb3.1 will not affect CPU lanes usage (?)
GPU RAM size determine the Max scene size or data size but nowadays GPU quite good at memory management and now card with 8GB vram is quite normal (ref: each GPU can only access its own memory for cycle case)

GPU Practice in Real World

  • as GPU technology improves over the years, concepts about GPU are changing as well, Things used to be true may be false by now. That is why I write them down together to be a guide and proof of info.

GPU for 3D working and rendering

  • 3D working is mainly about 3D viewport performance and sometimes physics simulation
    • all about real-time interaction performance
  • 3D rendering is mainly about rendering output image or video from the 3D scene.
    • all about calculation performance
  • More about each component in Graphics card doing
    • main GPU: rasterization (drawing pixel)
    • CUDA core: raytracing

3D viewport performance

It is all about move around in 3D viewport of 3D programs

  • difference performance can occur, based on 3D software programs and the GPU type/brand combination
    • case 1: GTX780 vs K5000 GPU, slightly lower spec K5000 runs Maya 2014 viewport like 20 times better (video ref)
      • another similar one, half spec Q4000 (8min) beats GTX670 (17min) in Maya 2012 SPECapc test (link)
      • GTX Titan (14 min); K2000 (9min); Q4000 (8min); Q2000 (9min35); Q600 (10min); FX3800 (10min); FX1800 (11min) in Maya 2012 SPECapc test. (Titan video ref; K2000 video ref; Q4000 n Q2000)
      • Titan, K600-K500, W5000-W9000 in Maya 2014 SPECapc table (link); and AMD W5000 show best performance for money
      • K620 > K420 = K600 > Q410
    • case 2: GTX770 vs Q4000 GPU, GTX770 wins in 3DS Max 2013 and 4 times better (Max use directX viewport); (video ref)
      • another similar one, GTX780 beats Q2000 in 3DS Max 2012 viewport (video ref)

Changes that happening, with advent of viewport 2.0 in Maya, viewport performance can vary a lot;

  • a gaming card can perform faster with viewport 2.0 than quadro card with classic viewport
  • plus, in viewport 2.0, a lot more shading feature are supported.

note: CUDA = shading unit
ref: http://versus.com/en/nvidia-geforce-gtx-980-vs-nvidia-quadro-k6000
ref: http://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#Quadro_Kxxx_Series
ref: http://www.techpowerup.com/gpudb/2426/quadro-k6000.html

GTX G-SM cuda-TMU-ROP MHz tex fill pix fill mem drate Gb/s bit band DX - OGL port power SP DP
Titan X GM200-24 3072 192 96 1050 176 GT/s 106 GP/s 12G 7 384 336 GB/s 12-4.5 e3-DH3P 250/6 8p6p 6.6 0.2
Titan Z GK110d-30 5760 498 96 800 338 GT/s 67.6 GP/s 12G 7 384 672 GB/s 11-4.5 e3-2DHP 250/6 8p6p 8.1 2.7
Titan B GK110-15 2880 249 48 950 213 GT/s 42 GP/s 6G 7 384 336 GB/s 11-4.4 e3-2DHP 250/6 8p6p 5.1 1.7
Titan GK110-14 2688 224 48 850 187 GT/s 40 GP/s 6G 6 384 288 GB/s 11-4.5 e3-2DHP 250/6 8p6p 4.5 1.4
980 GM204-16 2048 128 64 1200 144 GT/s 72 GP/s 4G 7 256 224 GB/s 12-4.4 e3-DH3P 165/5 2p6 4.6 0.1
970 GM204-13 1664 104 56 1100 109 GT/s 58 GP/s 4G 7 256 224 GB/s 12-4.4 e3-DH3P 145/5 2p6 3.5 0.1
960 GM206-8 1024 64 32 1100 72 GT/s 36 GP/s 2G 7 128 112 GB/s 12-4.4 e3-DH3P 120/4 1p6 2.3 0.07
780 Ti GK110-15 2880 240 48 900 210 GT/s 42 GP/s 3G 7 384 336 GB/s 11-4.4 e3-2DHP 250/6 8p6p 5 0.2
780 GK110-12 2304 192 48 900 160 GT/s 41 GP/s 3G 6 384 288 GB/s 11-4.3 e3-2DHP 250/6 8p6p 3.9 0.1
770 GK104-8 1536 128 32 1050 134 GT/s 33 GP/s 2G 7 256 224 GB/s 11-4.3 e3-2DHP 230/6 8p6p 3.2 0.1
760 Ti GK104-7 1344 112 32 950 102 GT/s 29 GP/s 2G 6 256 192 GB/s 11-4.2 e3-2DHP 170/5 2p6 2.4 0.1
760 GK104-6 1152 96 32 990 94 GT/s 31 GP/s 2G 6 256 192 GB/s 11-4.3 e3-2DHP 170/5 2p6 2.2 0.09
750 Ti GM107-5 640 40 16 1050 41 GT/s 16 GP/s 2G 6 128 86 GB/s 11-4.4 e3-2DH 60/300W 1.3 0.04
750 GM107-4 512 32 16 1050 32 GT/s 16 GP/s 1G 5 128 80 GB/s 11-4.4 e3-2DH 55/300W 1.2 0.03
690 GK104d-16 3072 256 64 1000 234 GT/s 58 GP/s 4G 6 512 384 GB/s 11-4.2 e3-3DP 300/650W 2p8 5.6
680 GK104-8 1536 128 32 1059 128 GT/s 32 GP/s 2G 6 256 192 GB/s 11-4.2 e3-2DHP 195/550W 2p6 3
670 GK104-7 1344 112 32 950 102 GT/s 29 GP/s 2G 6 256 192 GB/s 11-4.2 e3-2DHP 170/5 2p6 2.4
660 Ti GK104-7 1344 112 24 950 102 GT/s 22 GP/s 2G 6 192 144 GB/s 11-4.3 e3-2DHP 150/450W 2p6 2.4
660 GK106-5 960 80 24 1000 78 GT/s 23 GP/s 2G 6 192 144 GB/s 11-4.3 e3-2DHP 140/450W 1p6 1.8
650 Ti GK106-4 768 64 16 928 59 GT/s 14 GP/s 1G 5.4 128 86 GB/s 11-4.3 e3-2DH 110/400W 1p6 1.4
650 GK107-2 384 32 16 1058 34 GT/s 17 GP/s 1G 5 128 80 GB/s 11-4.3 e3-2DH 64/4 1p6 0.8
590 GF110d-32 1024 128 96 607d 77 GT/s 58 GP/s 3G 1.7 384 327 GB/s 11-4.2 e2-3D 365/700W 2p8 2.4
580 GF110-16 512 64 48 772d 49 GT/s 37 GP/s 1.5G 2 384 192 GB/s 11-4.2 e2-2DH 244/6 8p6p 1.5
570 GF110-15 480 60 40 732d 44 GT/s 29 GP/s 1.2G 1.9 320 152 GB/s 11-4.2 e2-2DH 219/550W 2p6 1.4
560 Ti GF114-8 384 64 32 822d 52 GT/s 26 GP/s 1G 4 256 128 GB/s 11-4.1 e2-2DH 170/5 2p6 1.2
550 Ti GF116-4 192 32 24 900d 29 GT/s 21 GP/s 1G 4 192 98 GB/s 11-4.2 e2-2DH 116/4 1p6 0.7
4x
Qdr cuda-TMU-ROP MHz tex fill pix fill mem clock inter width DX-GL port power SP DP
K6000 GK110-15 2880 240 48 902 216 GT/s 54 GP/s 12G 6 384 288 GB/s 11-4.5 e3-2D2P 225/2p6 5.2 1.7
K5000 GK104-8 1536 128 32 706 90 GT/s 22 GP/s 4G 5.4 256 173 GB/s 11-4.4 e2-2D2P 122/1p6 2.1 0.09
K4000 GK106-4 768 64 24 810 52 GT/s 19 GP/s 3G 5.6 192 134 GB/s 11-4.4 e2-D2P 80/1p6 1.2
K2000 GK107-2 384 32 16 954 30 GT/s 15 GP/s 2G 4 128 64 GB/s 11-4.4 e2-D2P 51W 0.7
K600 GK107-1 192 16 16 876 14 GT/s 14 GP/s 1G3 1.7 128 29 GB/s 11-4.3 e2-DP 41W 0.3
K410 GK107-1 192 16 8 706 11 GT/s 5.6 GP/s .5G3 1.8 64 14 GB/s 11-4.4 e3-DP 38W 0.2
K5200 GK110B-12 2304 192 32 650 124 GT/s 20 GP/s 8G 6 256 192 GB/s 11-4.5 e3- 150W 2.9 0.9
K4200 GK104-7 1344 112 32 780 87 GT/s 25 GP/s 4G 5.4 256 172 GB/s 11-4.5 e2- 105W 2.1 0.08
K2200 GM107-5 640 40 16 1000 40 GT/s 16 GP/s 4G 5 128 80 GB/s 11-4.5 e2- 68W 1.2
K620 GM107-3 384 24 16 1000 24 GT/s 16 GP/s 2G 1.8 128 29 GB/s 11-4.5 e2- 45W 0.7
K420 GK107-1 192 16 16 780 12 GT/s 12 GT/s 1G 1.8 128 14 GB/s 11-4.5 e2- 41W 0.3
HD4600 80(20) 4 2 1200 10 GT/s 5.4 GP/s 1.7G 0.8 128 25 GB/s 11-4.3 45W 0.43
  • Software Display library
    • Houdini 14 (2015): OpenGL 3 with 3.3 driver (supported card list: link)
      • Houdini simulation acceleration: OpenCL (Telsa, Qx000+, GTX500+, Firepro, R7000+, cpu CL lib); VRAM 3GB+; no multi-GPU support;
    • Blender 2.73a (2015): still recommend GTX card for GPU rendering
    • Mari 2.6 (2015): OpenGL 4.0 (ref)
    • 3DS MAX 2015: viewport display driver options: Nitrous Direct3D, Nitrous Software, Legacy OpenGL (ref)
      • Nvidea IRay render (2015): use CUDA based GPU (GTX750-980,K620,K2200)
      • V-Ray RT render (2015): can use CPU, GPU OpenCL, GPU Nvidea CUDA (ref)
    • Maya 2015 (2015): “Window” > Settings/Preferences:
      • The Rendering Engine defaults to OpenGL
      • another option is DirectX 11 and default viewport to Viewport 2.0, which better optimized for GeForce card

3D Gaming

  • K2200-K6000 vs GTX 750Ti-780Ti (ref)

3D GPU rendering It is all about software using GPU for final image rendering;

  • GPU render requires more on calculation power (Shader core and clocks)than GPU viewport display drawing; thus more computation power, faster GPU renders;
  • while GPU viewport display drawing is more about driver optimization and library support.
  • case study
  • case 1: Blender 2.7 Cycle render require certain Nvidia Cuda version than AMD OpenCL to perform better (2014.05 ref)
    • User Preferences > System tab > Compute Device(s) to use CPU or GPU. Next, configure to use CPU or GPU rendering in the Render properties.
    • Open shading language (OSL) is only supported by CPU.
    • Smoke/Fire rendering is not supported on GPU.
    • 8k, 4k, 2k and 1k image textures take up 256MB, 64MB, 16MB and 4MB memory.
      • means 100 2k texture = 1.6GB
  • case 2: Vray RT GPU unbiased render
    • speed compare bet. GPU: GTX 780Ti > K6000 > GTX Titan > GTX 780 » GTX 770/680 > K5000 > GTX 670 > GTX 760 » GTX 660 » K4000.
  • case 3: FurryBall GPU rendering benchmark: http://furryball.aaa-studio.eu/products/benchmarks.html
    1. GTX980rasterize/raytrace: 3s and 37s
    2. 780Ti rasterize/raytrace: 4.6s and 27s
    3. 780 rasterize/raytrace: 5.9s and 30s
    4. GTX970rasterize/raytrace: 3.3s and 43s
    5. GTX770rasterize/raytrace: 6.3s and 66s
    6. GTX690rasterize/raytrace: 7.7s and 71s
    7. GTX960rasterize/raytrace: 5.8s and 77s
    8. K620 rasterize/raytrace: 6.1s and 80s
    9. 660Ti rasterize/raytrace: 8.5s and 81s
    10. K4200 rasterize/raytrace: 11s and 86s
    11. K2200 rasterize/raytrace: 12s and 99s
    12. K5000 rasterize/raytrace: 14s and 115s
    13. K4000 rasterize/raytrace: 21s and 128s
    14. Q4000 rasterize/raytrace: 26s and 142s
  • case 4: Arion GPU render benchmark: http://www.randomcontrol.com/arionbench
    1. 780 Ti - 2396
    2. GTX690 - 2395
  • case 5: Octane GPU render benchmark: http://render.otoy.com/octanebench/results.php
    1. Titan black: 107
    2. GTX980 : 100
    3. Titan, 780: 88
    4. GTX970 : 83
    5. K5200 : 67
    6. GTX690 : 52
    7. GTX660Ti : 43
    8. K4200 : 41
  • case 6: Nvidea offical supported GPU using render: http://www.nvidia.com/object/gpu-ray-tracing.html
  • case 7: GFXBench 5.0 result: http://gfxbench.com/result.jsp
  • case 8: octane GPU render (double card count means half the time): https://www.pugetsystems.com/labs/articles/Octane-Render-GPU-Performance-Comparison-790/
  • case 9: premier and media encoder CPU vs GPU rendering: https://www.youtube.com/watch?v=g7cQK8jFPzo
  • case 10: low CPU high GPU combo vs high CPU medium GPU combo for game rendering: https://www.youtube.com/watch?v=TScpVAGNdcI
    • depends on type of work whether GPU or CPU can finish its task faster per frame

3D GPU rendering vs CPU Rendering

  • with right software support, a cheaper GPU can render faster than more expensive CPU; GPU wins over [hardware cost/render speed]
  • while for simple render process, GPU runs faster, while for complex render calcuation, CPU runs faster, due to some characteristic of the calculation limit of platform, reference to Arnold render talk: https://youtu.be/35morxCJOIQ?t=39m41s

SLI enabled vs disabled multiple GPUs setup

Multiple GPUs setup requirement for PCIE lanes

  • for some GPU, it requires allocation of at least x8 PCIE lanes
  • for some CPU, it only give Max x16 PCIE Gen3 lanes setup, which also means two x8 PCIE Gen3 lanes setup
  • for some motherboard, it only give x16 PCIE Gen3 lanes setup option and don't allow two x8 PCIE lanes setup;
    • so called PCIe Slot Configurations as x8 x8; while some other configuration as x16 x4
  • Supported CPU for x8 x8 multiple GPUs setup:
    • from i5-4690 to i7-7700k, they all max x16 pcie lanes, and support x16 or x8 x8 configuration
    • for x99 chipset supported CPU, they can support max 28(6800k, 5820k) or 40 lanes, so (x16, x16) (x16, x8, x8) (x8x8x8x8) are possible
  • Supported Motherboard for x8 x8 multiple GPUs setup:
    • most x8 x8 support at in those medium and higer chipset and medium and higher tier motherboard
    • for Intel, those motherboard are with Z series chipset or X series chipset
    • for tier level, those motherboard are in medium to higher level offer range
      • such as asrock EXTREME level or OC Formula or Taichi or Zx, Professional, Killer
      • such as asus maximus level or sabertooth, A, E, deluxe, pro, ws
      • such as msi gaming series

Software Factors

The reason behind performance variation based on software and GPU combination:

  • difference software use different drivers for displaying
  • difference GPU type/brand designed for specific displaying library, (like OpenGL, DirectX)

Thus, the software support GPU's best display library, it will get a boost, or it just lag behind.

OpenGL vs DirectX support in GPU

How OpenGL calls

  1. When a program uses hardware-accelerated OpenGL mode, OpenGL function call is made and passed to the driver
  2. if the driver detect that acceleration is active and a specific operation has direct hardware support, then the function is passed directly to the GPU
  3. else the command will be processed and executed through standard software calls and algorithms by CPU
  • these APIs can run in CPU as well
    • Microsoft Windows and Apple OS X come with a software-based OpenGL driver.
      • However, these drivers rely heavily on the CPU to perform the rendering calculations of OpenGL (often not efficiently).
  • sepcialized in graphic calculation, and with OpenGL compatibility, GPU significantly enhance OpenGL performance upward of 3000 percent, called “Hardware Acceleration”. (ref: http://help.sketchup.com/en/article/114278)
    • in addition, you also need “GPU driver” set as enabled “Hardware Acceleration”
    • however, not all 3D drivers in gaming graphic card market are 100% OpenGL compatible (even they said so)
    • Most 3d drivers for games are not tested for 3D programs and incompatibility problems can occur and need fix from card manufacturer.
    • Disable“Hardware Acceleration” if problems occurs in 3D rendering of models or not 100% OpenGL compatible
    • Graphic card drivers are proprietary and are maintained solely by the manufacturer, and that affect quality of OpenGL performance
    • note:
      • OpenGL is originally design to work with professional applications and on special hardwares, and then move to cross-platforms; and professional graphic cards are targeted to professional applications, thus it is made with drivers to fully support OpenGL acceleration.
      • DirectX is made by Microsoft to work with windows game platforms, thus it is stayed with windows only, and game graphic card are of coursed targeted to windows DirectX acceleration drivers.
  • based on that reason,
    • if a software is made to be used on linux, mac, windows, then it will use OpenGL library to draw 3D.
      • like Adobe (win, mac), Maya, Houdini, softimage, cinema4d, blender
    • if a software is made to run only on windows, then it most likely use DirectX library to draw 3D.
      • like 3DS Max, (Maya 2015 now support DirectX in win version)

OpenCL for simulation with GPU

  • some simulation utilize Nvidea's CUDA for GPU computation, while some crossplatform and cross-hardware softwares use OpenCL for GPU computation;
    • like Houdini use OpenCL for GPU based simulation (VRAM limits resolution of simulations, like Pyro)
  • Maya nDynamics are still using CPU only; while Maya Bullet physics engine is accellerated by GPU.

Info Ref