Skip to content
Snippets Groups Projects
Select Git revision
  • 274b945cc9717b47e50b7e1d2cd68848c2a83abc
  • master default protected
2 results

performance.md

Blame
  • GPU Parallelization and Scaling

    Since the dynamic model is highly parallizable (a system of ODEs), I decided to restructure the code to use GPU kernels to parallelize the computations. Since javascript/webGL isn't optimim for general computations, I decided to port the code to julia since they have a GPU kernel libraries for CUDA and openCL. Using Cuarrays it's very simple (simpler that c++) to turn arrays of any type (even structs) to Cuda arrays and kernels. 1(https://juliagpu.gitlab.io/CUDA.jl/tutorials/introduction/), 2(https://nextjournal.com/sdanisch/julia-gpu-programming) and 3(http://mikeinnes.github.io/2017/08/24/cudanative.html) were the tutorials I followed to learn GPU programming.

    What is neat about CuArrays and CUDAnative that you kind off get derivatives for free when combined with other forward differentiation libraries. That will be very handy in my next steps when I attempt to do search and optimization.

    The CuArrays libraries also have support for parallelizing on multiple nodes with multiple GPUs which I am excited to try next.

    The pseudo-code workflow is as follows:

    each timeStep
      foreach edge
        fetch currentPosition from nodes
        calculate and update strain and stresses
        calculate and update internal forces/moments
      foreach node
        fetch corresponding internal forces/moments from edges
        fetch external forces/moments
        integrate currentPosition from forces

    Fetching the current node positions in the updateEdges kernel as well as fetching the internal forces/moments from the edges updateNodes is the step that takes much of the time as allow scalar access and operations on GPU arrays which is relatively slow.

    Benchmark

    Older Benchmarks

    Lattice Size #nodes #links #steps Total CPU (sec) GPU Init (sec) Total GPU (sec) Total GPU V100
    4 300 960 10 3.014437901 12.72 0.0124470 0.0028345
    4 300 960 100 7.700779099 14.28 0.0571308 0.0271445
    4 300 960 200 12.9343908 12.66 0.1028513 0.0535324
    4 300 960 400 23.8421247 12.48 0.1864829 0.1064613
    4 300 960 1000 56.1430382 13.21 0.4725757 0.2702506
    5 540 1,800 200 22.9971574 13.29 0.1021396 0.0537932
    6 882 3,024 200 38.838537 12.59 0.1044742 0.0546350
    7 1,344 4,704 200 60.9359617 13.11 0.1043413 0.0552334
    8 1,944 6,912 200 87.5866625 13.58 0.1611617 0.0560469
    9 2,700 9,720 200 128.7116549 12.35 0.1674361 0.0568261
    10 3,630 13,200 200 173.5449189 14.09 0.2076308 0.0621499
    15 11,520 43,200 200 15.56 0.4721120 0.0975230


    The first set of benchmarks were done using my desktop (GeForce GTX 1070 Ti GPU). The graphs show that even though the CPU serial code is faster for small problems, for larger ones or more when more timesteps are needed the GPU version could offer a speedup of 865x (10 * 10 * 10 lattice with 200 timesteps).

    If we look at the last row of the table, ignoring the kernel initialization (constant ~13 seconds), it took the GPU ~1 second to update 979,102 edges 200 times which is roughly 19,580,400 elements per second.

    Next step would be to try it on the V100s to see the max performance, I suspect much higher performance as I am using Float64 cuda arrays and I think my personal desktop doesn't support double precision which makes it 32x slower. So I suspect I can reach minimum ~640 million elements per second.

    I also have to update the visualization tunnel to try to get live visualization as well as finish benckmarking to plot the convergance of the dynamic simulation given different damping parameters.