Name: Brownian Dynamics Simulations with Hydrodynamic Interactions on GPUs
Start: 2014-07-17T11:00:00-0400
End: 2014-07-17T11:30:00-0400

XSEDE14 Conference Home Page
Hover over Schedule to see other viewing options.

Back To Schedule

Brownian Dynamics Simulations with Hydrodynamic Interactions on GPUs

Brownian dynamics (BD) is used in multiple fields, including biology, biochemistry, chemical engineering and materials science. The goal of this work is to develop a user-friendly Brownian dynamics code for GPUs that can be used across these multiple fields. The challenge is the incorporation of long-range forces such as hydrodynamic interactions for large systems. This is addressed by using a particle-mesh Ewald algorithm. Long timescale simulations are enabled by using GPUs to parallelize the computations. Our code is a hybrid of GPUs and multicore CPUs, representing a commonly accessible platform for large-scale and long timescale BD simulations.
We have previously developed a BD code using fast al- gorithms for Intel Xeon Phi [1]. In this work, GPU-specific optimizations are used. In addition to GPU acceleration, our work extends the original algorithm in two ways: 1) the Ewald parameter is carefully selected to control the load between the CPU and the GPU as well as reduce computation time, and 2) the trade-off between speed and accuracy associated with different B-spline interpolation orders is examined. We also propose an interface so that our GPU-accelerated particle- mesh Ewald algorithm can be used with different kernels (elec- trostatics, Stokes, etc.) so it can have applicability in many different applications, not just BD. Our code can apply SPME to a single vector of forces or to multiple vector of forces. The latter is useful for computing Brownian displacements for multiple time steps. In each case, the best implementation differs, as well as the choice of the Ewald parameter.
We use the SPME variant of particle-mesh Ewald. By avoiding the formation of a large dense hydrodynamic matrix, we can simulate systems with as many as 500,000 particles for GPUs with 8 GB of memory. Traditional implementations of BD (without using SPME) are limited to approximately 10,000 particles on a 32 GB machine. In this work, the SPME operation is accelerated on the GPU, which consists of 6 parts:
1) Construction of the interpolation matrix, P: Depending on the size of the problem and available memory, we either compute P on the device and store it explicitly, or we multiply with it without storing it explicitly. There are speed and memory tradeoffs associated with the two approaches.
2) Spreading of the forces: The forces at each particle position are spread onto a regular grid using P. The matrix P is large and sparse. Depending on whether P is stored
explicitly or not, we use an optimized hand-written CUDA kernel to perform the sparse matrix-vector multiplication, or else compute with the P matrix as it is being formed. CUS- PARSE was evaluated for use in the former case, however, the performance cost of the necessary data format transformations outweighed the potential benefits.
3) Forward FFT: A forward FFT of the grid of interpolated forces is done. CUFFT is used for this step.
4) Application of the influence function: The application of the influence function consists of a large number of tiny dense matrix-vector products. Hand-written CUDA is used for this step. Because of the size of the of the matrices involved, batched matrix multiplication from CUBLAS does not perform as well as the customized kernel.
5) Inverse FFT: An inverse FFT of the result of step 4 is done. Again, CUFFT is used for this step.
6) Interpolation: P is used to interpolate the result of step 5 back onto the particle positions. Like step 2, we either apply the explicitly stored P or we compute with P on-the-fly. Hand- written CUDA is used for this step.
A performance model and estimates of the error are used to aid in the selection of the Ewald parameter. Steps 2, 4, and 6 above are memory bound and depend upon the memory bandwidth of the device. Steps 3 and 5 above are compute bound and depend upon the peak performance of FFT on the device.
For a given BD algorithm, its accuracy can be evaluated by comparing macroscopic quantities (such as translational diffusion coefficients) obtained by simulation with theoretical values, values obtained from experiments, or simply values from a known, separately validated simulation. Our previous work [1] has already established that this approach can pro- duce diffusion coefficients that are in good agreement with theoretical values.
This work was done on the Keeneland Full Scale (KFS) system [2], an XSEDE resource. KFS is a 264-node cluster based on HP SL250 servers. Each node has 32 GB of host memory, two Intel Sandy Bridge CPUs, three NVIDIA M2090 GPUs, and a Mellanox FDR InfiniBand interconnect. The total peak double precision performance is 615 TF. Each node has temporary storage (job duration), but Keeneland shares disk storage with the National Institute for Computational Science (NICS).

Speakers

XSEDE14

Edmond Chow

Mitch Horton

Graham Lopez

Attendees (0)

XSEDE14

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Edmond Chow

Mitch Horton

Graham Lopez

Attendees (0)