Parallelizing Ray Tracer in a Weekend

14 Feb 2019

I was having a lot of fun implementing Peter Shirley’s ray tracer over the past weekend. In this post I will explain how to make it go faster and visualize the progress.

Tracing rays

As I said, over the weekend I went through the book(let) Ray Tracing in a Weekend written by Peter Shirley, and tried to implement it. It contains almost all the code written in C++, but I took the liberty of improving a thing here and there. The result is pretty impressive as you can see on the image below.

The final result. This scene contains three large spheres with different
materials—lambertian, dielectric and metal. They are surrounded by a lot of little spheres with various colors and materials.
The final result. This scene contains three large spheres with different materials—lambertian, dielectric and metal. They are surrounded by a lot of little spheres with various colors and materials.

The book has a good flow—it leads you from generating an image, over implementing ray tracer components (rays, geometry, materials, etc.) to rendering a complete scene. And all that in a fascinatingly short time!

The key word here is simplicity. The author took a lot of care to use the bare minimum needed to accomplish the final result. There are no dependencies except the C++ standard library and drand48(). The image output format is PPM which is pretty handy for quick image generation, although it is rather bulky because of all the text1.

The Good, the Bad and the Ugly

Up until now, everything is great. We can trace rays, make our own simple scenes with spheres and view beautiful renderings. But there is one thing I didn’t tell you. All this works splendidly for small images, but is painfully slow for anything you want to view without using a magnifier. The author used default size of 100x200 pixels (height x width), an example of which you can see below.

Default image size.
Default image size.

If you wonder why he chose such a small image size, the answer is pretty simple—ray tracing takes time—a lot of it. What we are doing here is simulating light as it happily goes through the space. We can’t beat its majesty nature at this task. What we can do is try to use our computing power more efficiently. Right now, only one processing core is used while most computers have a handful of them at our disposal.

Power and speed

As Jeremy Clarkson likes to say—power and speed solve many problems. I mostly try not to believe him, but in this case he is right. Raw power can speed up the ray tracer quite a bit without touching any component, or being too smart about things.

To accomplish our goal of using multiple cores, we will be using features introduced in C++11: std::thread and std::mutex. Threads provide a simple way to execute things in parallel, while mutexes2 are convenient at stopping those threads from colliding with each other.


Basic idea is to assign tiles of the image to threads. I chose tiles of uniform size N, for example 16x16 or 32x32. You can imagine it like rendering a lot of smaller images which are then stitched together to form the final result. Let’s call those tiles tasks. That way every thread gets tasks to solve, or process.

To achieve maximum CPU utilization, we will create as many threads as viable, by querying std::thread::hardware_concurrency() to get the number of supported concurrent threads. If we create more threads, we probably won’t see much benefit because not all threads could be used at the same time. Some would be idle and waiting.

Now that we have our threads, we have to orchestrate them. Keep in mind that usually there are a lot more tasks than threads (which is what we want!). Threads that finish their assigned work therefore need to ask for new one. That’s where mutexes come to play.

We will have a function called something like get_next_task() that every thread calls after they finish their job. Because any worker can access it at any moment (critical section), we have to allow only one of them at a time. std::mutex does exactly that. When a function gets entered, the mutex gets locked by the current thread. If any other worker tries to get his next task, he will have to wait it out. After the worker currently owning the mutex gets his new task, he unlocks the mutex and the next one can enter. Rinse and repeat.

Finally, if there are no more tasks to get assigned, the thread is done. After all threads finish, we will have our rendering. Please keep in mind that this is one way to do it. There are certainly others, but I found this one simple, both to understand and implement.

Visualizing the progress

Now that we understand how to parallelize the ray tracer, we can make it even more attractive. Currently we only see the image after all the rendering is done. That’s no fun. I’d like to see the image as it’s being rendered.

To achieve that, I used a nice library called SFML (Simple and Fast Multimedia Library). I made it open a window and show a textured sprite in it. The sprite takes the whole window. The texture contains the current state of the image being rendered. You can see the result on the GIF below. Nice, isn’t it? I even implemented an inward snake pattern of task assignment!

Visualizing the progress of rendering. Tiles are 32x32 pixels.
Visualizing the progress of rendering. Tiles are 32x32 pixels.

Now that we introduced a new library, I threw away the PPM format and used PNG which I like a lot more. Rendered image, which is actually a texture, gets converted to a PNG format and saved.

I also encourage you to take a look at this. It uses tiles of size 256, just for fun. Notice how the tiles with less visible spheres finish quicker. That’s also the problem with using big tiles. Threads that finish first don’t have any more tasks to do, while other threads are busy rendering large chunks of the image. Some threads are idle, while other are work. We want all threads crunching most of the time. Smaller tiles allow better balancing of work.


I hope you learned something today. The code is available in my repository. Specifically, take a look at this file. That is where the magic happens. If you use Linux, it should be pretty easy to run it using the available Makefile. Make sure your compiler supports C++11 and that SFML is available.

If you want to learn more about ray tracing and parallelization of said, I encourage you to take a look at Physically Based Rendering book.

If you noticed an error, or just want to talk, send me an email.

Thank you for reading!


  1. PPM actually supports binary encoding, but for the sake of simplicity, ASCII encoding is used. All the data, including numbers, is stored as characters. 

  2. Mutex is short for mutual exclusion. Is the plural mutexes or mutices? The answer. :-)