Posts Tagged Massively Parallel
One of the most exciting new technologies that I’ve worked with in the past year is definitely CUDA. CUDA is a technology that allows for general purpose programming code – typically in c or c++ – to be compiled for and run on GPU devices. What that means is the graphics card that’s in your high end laptop or desktop can now run general purpose code as well as the graphics code that its already running.
The potential for CUDA is amazing, it allows for massively parallel processing on the potentially hundreds of cores that are available on modern GPUs. The laptop I’m writing this post on for instance has a NVIDIA 360M card which has 96 processing cores. That’s compared to the 4 cores in the i7 chip that’s on the motherboard. The cores are not truly general purpose, they can’t do everything that a modern CPU core can and they prefer to do work in one or more jobs that can be split across the cores. Math and Physics simulations work extremely well. While traditional CUDA is based in C there’s also a library called Thrust that allows for C++ programmers to get in the mix too. Thrust provides very easy ways of transferring data from main memory to device memory as well as some awesome classes for things like map reduce.
There are a couple of uses that I would like to personally explore with CUDA. Read only database querying. I would love to be able to as a sample or research process create a dialect of SQL or a sub-set that allowed me to process simple traditional database queries on a CUDA capable device. While there are a number of companies doing this sort of work, and probably this is something someone could buy instead of build, I think this would be a great chance to learn by doing. Imagine if I had a table that was approximately 1GB in size with each row being about 128 bytes, that would be somewhere around 8M records in the table. This works best if the individual records are numerical based, in otherwords large volumes of text aren’t a perfect fit. In this case however, each of the 96 cores would have to process only ~90K records whereas the 4 cores of the cpu would have to process ~2M records. While the table can be indexed in a traditional database system if the query patterns are known in advance, it certainly is exciting thinking about how a large volume of work can be spread against a number of cores using CUDA.
Why use CUDA for something like this? Why use it for filtering large sets of data? Well let’s say that its a lot more than 1GB, let’s also say that the native format of the data is some form of binary structure. To load 8M rows of data into a database it takes a non-trivial amount of time, and if the dataset is constantly being updated that’s a tax you’ll have to pay for every update. Whereas a program written to leverage CUDA could likely query it directly and without that tax. Also, this machine is just a laptop you could relatively inexpensively put together a machine with literally thousands of cores. Imagine now that you had 2000 cores, with the same 8M rows, that’s only ~4K rows per core to filter. Now that could be much faster.
This is a technology that I’m interested in and learning more about. I’m sure that I’ve covered no new ground with this post and experts will probably be bored. But perhaps there’s someone out there that wasn’t aware of CUDA or is just getting into it? How are you finding it? What have you found works, what doesn’t?