Code optimization techniques for GPUs

Books on parallel programming theory often talk about such weird beasts like the PRAM model, a hypothetical hardware that would provide the programmer with a number of processors that is proportional to the input size of the problem at hand. Modern general purpose computers afford only a few processing units; four is currently a reasonable number. This limitation makes the development of highly parallel applications quite difficult to the average computer user. However, the low cost and the increasing programmability of graphics processing units, popularly know as GPUs, is contributing to overcome this difficulty. Presently, the application developer can have access, for a few dollars, to a hardware boosting hundreds of processing elements. This brave new world that is now open to many programmers brings, alongside the incredible possibilities, also difficulties and challenges. Perhaps, for the first time since the popularization of computers, it makes sense to open the compiler books on the final chapters, which talk about very unusual concepts, such as polyhedral loops, iteration space and Fourier-Motzkin transformations, only to name a few islands in this enchanted ocean. This material consists of a three-days course, that covers, in a very condensed way, some code generation and optimization techniques that a compiler would use to produce efficient code for graphics processing units. Through these techniques, the compiler writer tries to free the application developer from the intricacies and subtleties of GPU programming, giving him more freedom to focus on algorithms instead of micro-optimizations. We will discuss a little bit of what are GPUs, which applications should target them, how the compiler sees a GPU program and how the compiler can transform this program so that it will take more from this very powerful hardware.

I took much of what is in this page from publicly available books, slides and websites. Hence, nothing more fair than to say that this material can be used free of charges for whoever happens to stumble upon it. Drop me an e-mail if you want the tex sources, if you have any comments whatsoever, of if you just want to say hello.

	This book (in Portuguese) contains all the material that I teach in this short course. It is divided in three chapters: GPU fundamentals, memory optimization techniques and divergence optimization techniques. A shorter version of this book will be a chapter in the proceedings of "Jornada de Atualização em Informática" (JAI).
	I am using many examples in the textbook and in the slides. All these examples are available in this zip file. I have run them on a Mac OS 10.5.8 feature a NVIDIA GeForce 9400M GPU.
	In the first chapter of this course I talk about the history of GPUs and the C for CUDA programming model. I show how to translate some simple C programs to CUDA, relying on the concept of iteration space to help in the understanding.
	In the second chapter I discuss many different types of memory optimization techniques that we can use to speed up CUDA programs, such as coalesced access, asynchronous data transference and loop tiling. These are normally the largest sources of performance improvement that we can apply on GPU kernels.
	In this final chapter I discuss divergence analysis and optimizations. Because GPUs organize threads as SIMD machines, branches and other irregularities in the program's control flow might be a source of inefficiencies. There are ways to find out which parts of the program may diverge, and which parts may not, and we explore these algorithms.