Snippets Groups Projects

performance.md

4 years ago
ce966e61

validation, concurrent top, madcat images, tendon calc · ce966e61
Amira Abdel-Rahman authored 4 years ago

ce966e61

History

validation, concurrent top, madcat images, tendon calc
Amira Abdel-Rahman authored 4 years ago

performance.md 4.64 KiB

GPU Parallelization and Scaling

Since the dynamic model is highly parallizable (a system of ODEs), I decided to restructure the code to use GPU kernels to parallelize the computations. Since javascript/webGL isn't optimim for general computations, I decided to port the code to julia since they have a GPU kernel libraries for CUDA and openCL. Using Cuarrays it's very simple (simpler that c++) to turn arrays of any type (even structs) to Cuda arrays and kernels. 1(https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/), 2(https://nextjournal.com/sdanisch/julia-gpu-programming) and 3(http://mikeinnes.github.io/2017/08/24/cudanative.html) were the tutorials I followed to learn GPU programming.

What is neat about CuArrays and CUDAnative that you kind off get derivatives for free when combined with other forward differentiation libraries. That will be very handy in my next steps when I attempt to do search and optimization.

The CuArrays libraries also have support for parallelizing on multiple nodes with multiple GPUs which I am excited to try next.

The pseudo-code workflow is as follows:

each timeStep
  foreach edge
    fetch currentPosition from nodes
    calculate and update strain and stresses
    calculate and update internal forces/moments
  foreach node
    fetch corresponding internal forces/moments from edges
    fetch external forces/moments
    integrate currentPosition from forces

Fetching the current node positions in the updateEdges kernel as well as fetching the internal forces/moments from the edges updateNodes is the step that takes much of the time as allow scalar access and operations on GPU arrays which is relatively slow.

Benchmark

Older Benchmarks

Lattice Size	#nodes	#links	#steps	Total CPU (sec)	GPU Init (sec)	Total GPU (sec)	Total GPU V100
4	300	960	10	3.014437901	12.72	0.0124470	0.0028345
4	300	960	100	7.700779099	14.28	0.0571308	0.0271445
4	300	960	200	12.9343908	12.66	0.1028513	0.0535324
4	300	960	400	23.8421247	12.48	0.1864829	0.1064613
4	300	960	1000	56.1430382	13.21	0.4725757	0.2702506
5	540	1,800	200	22.9971574	13.29	0.1021396	0.0537932
6	882	3,024	200	38.838537	12.59	0.1044742	0.0546350
7	1,344	4,704	200	60.9359617	13.11	0.1043413	0.0552334
8	1,944	6,912	200	87.5866625	13.58	0.1611617	0.0560469
9	2,700	9,720	200	128.7116549	12.35	0.1674361	0.0568261
10	3,630	13,200	200	173.5449189	14.09	0.2076308	0.0621499
15	11,520	43,200	200		15.56	0.4721120	0.0975230

The first set of benchmarks were done using my desktop (GeForce GTX 1070 Ti GPU). The graphs show that even though the CPU serial code is faster for small problems, for larger ones or more when more timesteps are needed the GPU version could offer a speedup of 865x (10 * 10 * 10 lattice with 200 timesteps).

If we look at the last row of the table, ignoring the kernel initialization (constant ~13 seconds), it took the GPU ~1 second to update 979,102 edges 200 times which is roughly 19,580,400 elements per second.

Next step would be to try it on the V100s to see the max performance, I suspect much higher performance as I am using Float64 cuda arrays and I think my personal desktop doesn't support double precision which makes it 32x slower. So I suspect I can reach minimum ~640 million elements per second.

I also have to update the visualization tunnel to try to get live visualization as well as finish benckmarking to plot the convergance of the dynamic simulation given different damping parameters.