performance.md
-
Amira Abdel-Rahman authoredAmira Abdel-Rahman authored
GPU Parallelization and Scaling
Since the dynamic model is highly parallizable (a system of ODEs), I decided to restructure the code to use GPU kernels to parallelize the computations. Since javascript/webGL isn't optimim for general computations, I decided to port the code to julia since they have a GPU kernel libraries for CUDA and openCL. Using Cuarrays it's very simple (simpler that c++) to turn arrays of any type (even structs) to Cuda arrays and kernels. 1(https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/), 2(https://nextjournal.com/sdanisch/julia-gpu-programming) and 3(http://mikeinnes.github.io/2017/08/24/cudanative.html) were the tutorials I followed to learn GPU programming.
What is neat about CuArrays and CUDAnative that you kind off get derivatives for free when combined with other forward differentiation libraries. That will be very handy in my next steps when I attempt to do search and optimization.
The CuArrays libraries also have support for parallelizing on multiple nodes with multiple GPUs which I am excited to try next.
The pseudo-code workflow is as follows:
each timeStep
foreach edge
fetch currentPosition from nodes
calculate and update strain and stresses
calculate and update internal forces/moments
foreach node
fetch corresponding internal forces/moments from edges
fetch external forces/moments
integrate currentPosition from forces
Fetching the current node positions in the updateEdges
kernel as well as fetching the internal forces/moments from the edges updateNodes
is the step that takes much of the time as allow scalar access and operations on GPU arrays which is relatively slow.
Benchmark
Older Benchmarks
Lattice Size | #nodes | #links | #steps | Total CPU (sec) | GPU Init (sec) | Total GPU (sec) | Total GPU V100 |
---|---|---|---|---|---|---|---|
4 | 300 | 960 | 10 | 3.014437901 | 12.72 | 0.0124470 | 0.0028345 |
4 | 300 | 960 | 100 | 7.700779099 | 14.28 | 0.0571308 | 0.0271445 |
4 | 300 | 960 | 200 | 12.9343908 | 12.66 | 0.1028513 | 0.0535324 |
4 | 300 | 960 | 400 | 23.8421247 | 12.48 | 0.1864829 | 0.1064613 |
4 | 300 | 960 | 1000 | 56.1430382 | 13.21 | 0.4725757 | 0.2702506 |
5 | 540 | 1,800 | 200 | 22.9971574 | 13.29 | 0.1021396 | 0.0537932 |
6 | 882 | 3,024 | 200 | 38.838537 | 12.59 | 0.1044742 | 0.0546350 |
7 | 1,344 | 4,704 | 200 | 60.9359617 | 13.11 | 0.1043413 | 0.0552334 |
8 | 1,944 | 6,912 | 200 | 87.5866625 | 13.58 | 0.1611617 | 0.0560469 |
9 | 2,700 | 9,720 | 200 | 128.7116549 | 12.35 | 0.1674361 | 0.0568261 |
10 | 3,630 | 13,200 | 200 | 173.5449189 | 14.09 | 0.2076308 | 0.0621499 |
15 | 11,520 | 43,200 | 200 | 15.56 | 0.4721120 | 0.0975230 |
The first set of benchmarks were done using my desktop (GeForce GTX 1070 Ti GPU). The graphs show that even though the CPU serial code is faster for small problems, for larger ones or more when more timesteps are needed the GPU version could offer a speedup of 865x (10 * 10 * 10 lattice with 200 timesteps).
If we look at the last row of the table, ignoring the kernel initialization (constant ~13 seconds), it took the GPU ~1 second to update 979,102 edges 200 times which is roughly 19,580,400 elements per second.
Next step would be to try it on the V100s to see the max performance, I suspect much higher performance as I am using Float64
cuda arrays and I think my personal desktop doesn't support double precision which makes it 32x slower. So I suspect I can reach minimum ~640 million elements per second.
I also have to update the visualization tunnel to try to get live visualization as well as finish benckmarking to plot the convergance of the dynamic simulation given different damping parameters.