CUDA programming
(8192 bodies/ 2polytropes interacting gravitationally)
Click the pictures to jump to my first polytropic collision (please see Dr. Barnes website for SPH collisions)
For my latest efforts in the realm of full SPH (including viscosity, pressure, heat, etc, please see the new page here).
|
Cuda SDK, Toolkit and documentation [opens in a new window]
A simple loop unroll: a traditional for-loop unrolled to be performed in parallel
Commented code from the GPU Gems N-body CUDA article --my hybrid of the SDK and GPU gems versions of the N-body program
As mentioned on the main page, these graphics cards enable anyone to create magnificent, real-time simulations that were mere pipe-dreams for graduate students and researchers in engineering and computational physics even 5 or 6 years ago. In fact, even Barnes suggests at his site that he borrowed a peppy workstation to generate his polytrope sims. It seems to me that with a couple of NVIDIA series 8 (and subsequent) cards, the average joe can write programs that handle 40k or 50k bodies.
In order to get started here, you'll need a Geforce 8 series (or subsequent) NVIDIA graphics card. Also, the SDK and toolkit is required and contains a lot of sample projects in C++ (Visual Studio) (see above).
After these are gotten, it's best to find a simple tutorial that explains detail about how to get an initial simple project working. In essence, the architecture allows you to unroll loops in which the same calculation is being performed on different data and perform those computations simultaneously. In order to get a basic understanding of passing data from the host (the CPU) and the device (the card) doing the threaded computation and then passing a result set back, a simple tutorial is in order. I found just such a tutorial at this site Parallel Panorama.
Unrolling a standard loop and parallelizing it
Here is a simple unroll for a program I created just for testing based on an NVIDIA tutorial. There is a lot to know about creating an efficient parallel program, including shared/global memory and so forth, but in general, this is how a loop is unrolled.
Also, this is my current version based on my own understanding at this time and it may be subject to correction. It works correctly when I run it, and I think I understand what’s happening, but all of this is fairly new in general, and certainly new to me.
In fact, the book I just received from Amazon Smoothed Particle Hydrodynamics by G.R. Liu and M.B. Liu boasts as being the first book on SPH (printed in 03), so a lot of really great technology and software is converging right here before our eyes.
|
|
There are many things to know in all of this, but
of primary importance is exactly what we’re trying to accomplish in unrolling
a traditional loop, and how to do it. I have attempted to explain this below.
d_ptr is a ptr to device data N = 512 is the number of elements in the array we
wish to parallelize; that is, we will perform the same operation on each
element of an array of size N. a will point to an array of
elements of float with length N. traditional loop to fill a with some data, in this case 1.0. We will then add a constant
to each element in parallel. We need to copy the contents of a to the Nvidia card, and here we
figure out how much memory to allocate size_t the use size_t as per C stdlib cudaMalloc is the analog to malloc, allocating
enough memory to hold the contents of
a cudaMemcpy(target, source, size, copy_what_where).
We copy a to d_ptr, from Host(CPU)
to Device(GPU) p = number of threads q = number of blocks, each of p threads For my card, 256 threads is the maximum. We can
then break out the entire operation into 2 blocks of 256 threads. dim3 threads(p) is a 3 dimensional array, but for
this, make a simple (256,1,1). We are using only a one dimensional block. dim3 blocks(N/p) allocates as many blocks with 256
threads as needed to handle all of our N length array, in this case 512;
there may also be up to 3 dimensions of blocks, and the block(N/p) implies a
grid (N/p,1,1). The function add1 is defined below, but it’s the
parallelized function. It basically executes for each element in the array (a) simultaneously. The operators
<<< >>> indicate such, it is the function name <<<
blocks,threads,sharedmemsize>>>(parameter list). sharedmemsize can
be omitted. We are not using any shared memory in this program (shared memory
is quickly accessed by all the threads in a block). After add1 is done, d_ptr contains an updated set
of data for a (adding a constant, b, in this case). cudaMemcpy(a, d_ptr, size, cudaMemcpyDeviceToHost)
copies the information from the device back to the host. Then we print it, free the memory, and call it a
day: our original array a has been
passed to the NVIDIA card, had each element manipulated in parallel
(effectively simultaneously), and been passed back to the calling host
program. |
|
|
Function add1 accepts a pointer to a, a float b,
and max N. Ordinarily, we would not want to pass b in this way. b can be
copied to __constant__ memory from the host and can be shared by all of the
threads, instead of being copied as it is here for each thread, but for
simplicity, I just passed it as a parameter. int idx = blockDim.x * blockIdx.x + threadIdx.x is
a way to map the current thread in the current block back to the actual array
element. And this addressing makes sense. You have N / p blocks of p threads
each. blockDim.x = number of threads *
blockIdx.x = current block + threadIdx.x = the actual thread we’re processing
for this copy of the function. We then perform an operation on this element,
indexed by idx. |
Calling an unmanaged DLL from C#
The standard CUDA SDK package is for C++ projects. Although I know C and C++ (and you should too), I prefer C# or Visual Basic.net (VB.net)
Since I’m a managed programmer who is used to menus, buttons, windows, etc, I do the absolute minimum amount of unmanaged C++ possible and strive to use Visual Basic.net or C#.
Of course, VB.net can’t handle pointers, so we have to actually use C# to call the unmanaged dll in order to create managed arrays that VB can handle.
The examples you get with the package contain a template project that you can modify to suit your needs in order to get a project working. Create a version as a static DLL> In my case, I have managed code I wrote previously that I want to use, and I just want to pass off the computations to my graphics card. This means, I'll create a barebones C++ unmanaged DLL and simply call that from managed code to process my data in much the same way as one calls a shader file to process a set of vertices. C# (which runs under .net and is managed) features unsafe blocks allow unmanaged pointer code. For managed arrays being passed to the unsafe blocks, C# requires the arrays be fixed. This is all shown below.
In this simple program, I had a program that generates a collection of bodies into various configurations. When the system is run, the bodies are calculated pair wise or as a unit etc. I rerouted the computation of each body from the CPU to the GPU using CUDA and the results are pretty startling. In this simple example, each body is just given a constant velocity, so the positions change in relation to dt. I basically just wanted to see how it would work.
|
__global__ void test2(vector4 *p, vector3 *v, float dt, int N) |
__global__ indicates that this will be called by the host to be run on the device. In addition, __device__ functions are called on the device and run on the device. __host__ functions are called by the host and run on the host. Thus, this function will be called from the host to run on the NVIDIA card. The parameters are pointers to positions array and velocities array of each body, pairwise matching, dt is the delta time and N is the number of points in the arrays p and v. The next code is simply the threaded code--it consists of an unrolled loop (it would be a loop from 0 to N, p[i] + v[i]*dt). This doesn't look like much, but the CUDA <<< operator that made this call in the function shows below gives more detail.
|
|
//test2(vector4 *p, vector3 *v, float dt, int N) |
vector4 and vector3 are structs in the usual, minimal
type, x,y,z,w and x,y,z respectively, defined in a separate .c file.
struct vector3 {
float x;
float y;
float z; };
struct vector4 {
float x;
float y;
float z;
float w; }; This is an extern "C" function that calls the above threaded code. The extern "C" just allows one to hook into it from C++, which is what the DLL will be written in. testVec_CU takes the same arguments: p, v, dt, and N. p_d and v_d are created on the device and the source data p and v respectively, are copied to the device memory. Later, they are copied back to the host pointer and returned to the caller. The the n_blocks and block_size int values define the way the calculations will be threaded. I'm using 256 and a total point count will be a multiple of 16 since the number of particles is specifically unimportant whereas making it coincide with an optimal thread/block size count is. test2 <<< n_blocks, block_size (p,_d,v_d,dt,N) calls the above function with n_blocks, block_size and is passed the the device variables p_d, v_d. After the process has been completed the p_d and v_d contents are copied back to the host arrays and returned to the program. |
|
extern "C" __declspec(dllexport)
void testVec_CPP(vector4 * p, vector3 * v, float dt, int N) { //this is the name of the code
in the template.cu program, matches the extern hook at the top of this file testVec_CU(p,v,dt,N); }
|
CPP dll code. This is the entry point in the dll for accessing this function. So, to perform the above actions, testVec_CPP will be called and passed the appropriate parameters.
|
|
using System; using System.Collections.Generic; using System.Text; using
System.Runtime.InteropServices; namespace cudaCSS { public static class Cuda { //blittable unsafe structs vector3 and vector4 as defined in the
external dll. I also added constructors for the //managed counterparts here public unsafe struct vector3 { public float x; public float y; public float z; public vector3(float ix, float iy,
float iz) { x
= ix; y
= iy; z
= iz; } } public unsafe struct vector4 { public float x; public float y; public float z; public float w; public vector4(float ix, float iy,
float iz, float
iw) { x
= ix; y
= iy; z
= iz; w
= iw; } } //defined a static class vectorDLL whose job it is to grab the
dll and // and call its C++ functions public static class vectorDLL { //dllimport, from library InteropServices defines entry points
in the DLL (cuda1.dll). //these match the calls in the dll; the dll is expecting
pointers to vector4, vector3 and a few //primitive variables dt and N [DllImport(@"..\..\..\..\cuda1a\Debug\cuda1.dll")] public unsafe static extern void
testVec_CPP( vector4* p, vector3* v, float
dt, int N); //this is a C# function which requires the unsafe/fixed be done
elsewhere. It directly access //the unsafe static extern testVec_CPP. It is not used in the
program, only in testing public unsafe static void testVec_CS(vector4
* p, vector3* v, float
dt, int N) { testVec_CPP(p,
v, dt, N); } //this is the main point here. This can be called from Visual
Basic.net since it accepts //managed objects, an array of vector4, and array of vector3 and
the primitives public static void
testVec_CS_M(vector4[] p, vector3[] v, float
dt, int N) { //however the unsafe static extern testVec_CPP won't accept these
managed data //and so the incoming arrays vector4 and vector3 need some work
done on them before the //call: unsafe will allow pointer C style (very close) code to
be implemented //also note, you need to make a switch in the compilation for
this, but it will give you a warning //it's under the project properties as a checkbox unsafe { //unsafe is all that is needed to create new pointer variables
and manipulate them //but in this case, we want to pass the managed objects in, so
we need to //fix them. They are references and their places in memory will
move around; we fix them //on the stack and then manipulate the pointer reference as
needed, then return it to the heap fixed (vector4* pp = p) { fixed (vector3* vv = v) { //with pp pointing to the positions and vv pointing to the
velocities created (pointers) // we can now call the unsafe C extern to pass to the CUDA dll. testVec_CPP(pp,
vv, dt, N); //this commented part is for testing; notice it uses the same
-> as in C++ for referencing //object members from the pointer /*for (int k = 0; k < N; k++) { Console.WriteLine("here {0}: {1},{2},{3} ", k,
(pp+k)->x, (pp+k)->y, (pp+k)->z); } */ } } } } } } }
|
Finally, calling this from C#. I define blittable structs for easy passing back and forth, if vector4 and vector3.
Now, these are managed objects, but they're blittable, and via unsafe blocks, they can be pointed to and passed to the unmanaged dll.
|
Conclusion, even with 8192 objects, the program simply flies in moving each object with a constant velocity. When it is run serially on the CPU, the program just basically sits there.
This program is a work in progress and will be updated --
Kurt Bingham
-----------------------------
Using only gravity and my program for outputting polytrope data files, I was able to create some preliminary collisions using around 8k particles (shown in real-time below)
However, in order to do high quality collisions, one must use SPH (smoothed particle hydrodynamics), and I'm currently reading up on the subject to work out the details.
I have added modifications to my program based on Oxley’s thesis (below) but have gotten unexpected results at times and am still working out the details. No gas pressure is used in the one below, and in my current version of the program. However, the initial data is stable. The difference is of course, that no pressure is computed once the simulation is running. Instead, gravity is softened to the point that the polytropes remain stable for long enough that the stars (or earths/moons) collide. Finding this stability for various polytropes/G values/smoothing lengths etc is where a lot of time is spent. It is exploration, and I have to admit, it’s a lot of fun. I never thought hardware would come available for this kind of thing so soon, and it’s been a pleasant surprise.
SPH algorithms have been done in various (programming) languages, mainly for workstations or highly parallel machines in the past. I ordered a book on the subject at Amazon Smoothed Particle Hydrodynamics: A Mesh Free Particle Method, but I have been using Stephen Oxley's Phd thesis in the mean time. It is extraordinarily thorough and useful, and I have done everything to date with help of this text.
For the
sim below I used my polytrope program to create the data files.
The data
files contain information on the star from the center outward (see the links on
the first page to output in such files ASP), including each shell mass, density
etc. I then divided up a region of space using a bounding square or box and
(depending on the model dimensions) and broke each polytrope into 4096 mass
bodies. The program also uses various tweaks to keep each system from exploding
gravitationally from the outset as mentioned above, and some of the tweaking is in how the polytropes
are generated (using very small ‘h’ in the diffy Q solver to make the
concentric spheres tight, for example)
The video
below shows just this (click the pictures)
Kurt Bingham 6/10/2008
The basics of parallelizing N-body interactions
6/27/2008 KB
You will need to review and have handy the GPU gems article as well as the original Nvidia SDK sample N-body, because not everything is included here. The is just the engine.
Here's the GPU gems article in pdf format.
They both contain different sets of code. The code that came with the SDK is optimized and contains a lot of additional, confusing code, and I had to get rid of a lot of it just to see what was really going on. I ended up with hybrid of the SDK version and the GPU gems article version. I added a little change to the G force computation to prevent singular points at Rij = 0.
The basic of CUDA parallelization in practice, is that a set of incoming arrays (of length N) is passed to the device, and each element of the arrays k-wise {(0,0,0..), (1,1,1...) etc} is processed in parallel with all of the others.
| #include "stdafx.h" #include <stdio.h> #include <stdlib.h> #include <cuda.h> #include <cutil.h> #include "funcs.c" #include "template_kernel2.cu" extern "C" void setDeviceSoftening(float softening) { float softeningSq = softening; CUDA_SAFE_CALL(cudaMemcpyToSymbol("softeningSquared", &softeningSq, sizeof(float), 0, cudaMemcpyHostToDevice)); } extern "C" void setMinGravDistance(float MinGravDistance) { CUDA_SAFE_CALL(cudaMemcpyToSymbol("MinGravDistance", &MinGravDistance,sizeof(float),0,cudaMemcpyHostToDevice)); } extern "C" void setGravFactor(float GravFactor) { CUDA_SAFE_CALL(cudaMemcpyToSymbol("GravFactor", &GravFactor,sizeof(float),0,cudaMemcpyHostToDevice)); } extern "C" void setdamping(float damping) { CUDA_SAFE_CALL(cudaMemcpyToSymbol("damping", &damping,sizeof(float),0,cudaMemcpyHostToDevice)); } void seth(float h) { CUDA_SAFE_CALL(cudaMemcpyToSymbol("h",&h,sizeof(float),0,cudaMemcpyHostToDevice)); } //this is the main function that will call the NVIDIA card to process in parallel //the functions above
//assign __constant__ variables in the device code //__syncthreads() forces all of the above to be done for all threads in the block before
//moving to the next step |
Creating an SPH simulation from the ground up[and some other stuff on MSFT's Framework 3.5 Parallel programming extension]
Smoothed Particle Hydrodynamics by Liu&Liu is an excellent book. Included is their Fortran code, which will simulate each of the topics in the book (which covers everything from a simple shock tube to explosions and everything in between).
While it is a fair amount of work to understand completely, everything that's discussed, anyone with a background in multi-variable calculus, differential equations, and a few semesters of calculus based physics is capable of understanding the theory, although, I must say, the mathematical discussions of theory are quite deep. In the end, most folks will probably be satisfied with the vast array of smoothing functions and search algorithms they supply, and they discuss all of them in great detail.
They spare no detail in going through countless historical examples of smoothing functions, from the basic cubic splines to quintic functions. The first few chapters cover, from the ground up, different particle systems as it builds the case for meshfree particle simulations.
Their Fortran code can be run in a variety of modes, and I am currently porting their code to .net.
I had originally intended to first, code everything in .net, most likely Visual Basic, and then do the arduous task of parallelizing the code using CUDA. I still will parallelize it this way, but this will be a lot of work because, as slick as CUDA is, it's very difficult to use in .net. This is because I am using a DLL, so debugging is nonexistent. In addition to this problem, there is the fact that, unless I am running it in simulation mode, I can't debug anything. In fact, I can even printf while it's running on the device (for obvious reasons).
If I want to see data as it's being calculated on the card in parallel, I have to pass it to the calling C++ code (using CUDA memory copy functions) back and forth--which still will only show the final array results, and no the individual thread results because everything is fanned out on the card into threads and returned into the original arrays. Not easy. If I just wrote the whole program in C++, it would be easier (than using a C++), but I really do not want to lose .net functionality.
However, MSFT just debuted their extension to the 3.5 framework that includes very easy to implement CPU parallelization. This is, on the one hand, awesome (I've just used it), but since I only have a Core 2 Duo currently, I won't be able to really do the great computing on my CPU that I can on the Nvidia card, so I can't just switch over to .net and skip CUDA altogether. However, I may be able to do so when Intel or AMD produces a CPU with serious parallel capability--which will be a couple more years.
Here's a quick copy and past a C# program I just wrote that uses the new System.Threading.Parallel library available from MSFT as of June 8, 2008.
The execution times are 15s(parallel) and 24s(serial)
|
using System;using System.Collections.Generic;using System.Linq;using System.Text;using System.Threading;
namespace parallel1{ class Program{ public static int N = 502542656; static void Main(string[] args){ long ticks = DateTime.Now.Ticks; Parallel.For(1, N, delegate(int i){ ModTask(i); }); Console.WriteLine((DateTime.Now.Ticks - ticks) / 10000000); Console.ReadLine();ticks = DateTime.Now.Ticks; for (int k = 1; k <= N; k++){ ModTask(k); } Console.WriteLine((DateTime.Now.Ticks-ticks)/10000000); Console.ReadLine();} static void ModTask(int k) { double x = k * Math.Sin(k);} } }
|
Kurt Bingham July 08
-----------------------------------------------------------------------------------
Replicating the 1d shock tube data with a .net program and simple DirectX graphics engine
I was able to replicate the results on pg 95 of the Liu&Liu book, which means several portions of code have been successfully adapted to .net
Each video below being with the current state, which is two ideal gases with different pressures separated by a thin later. In each video, the x-axis is the position of each particle while the y-axis shows other properties, such as internal energy, velocity, density etc of each particle.

videos of results
rho | int energy | vx
Still working out various details, but well on the way toward bigger, 3d fluid simulations.
July08 KB