Why you should never use NVIDIA OptiX: bugs

In my most recent project, I implemented a GPU path and bidirectional path tracer. I already previously used OptiX for my bachelor thesis, and I thought it is a good idea to use it also for this. I mean, it was created for ray tracing, so it should be predestined for such kind of GPU application.

One of the biggest advantages for me were the reliable and fast acceleration structures, which are crucial for ray tracing. But it proved to cause more pain than benefit due to several bugs. And the overall speed was not that great in the end, probably due to lacking optimisation in my code, which in turn is at least partly caused by lacking profilers (which would be available for CUDA).

But the big issue were really the bugs, which prevented me actively from implementing stuff. Some of the bugs are totally ridiculous and don’t get fixed..

I already encountered the first bug when coding for the bachelor thesis 2 years ago. Unfortunately I ignored it and decided again on OptiX. There is a special printf function, which takes into account the massively parallel architecture of the graphics card. For example, it is possible to limit the printing just to one pixel. But, ehm

// the following line works:
rtPrintf("asdf\n");

// on the contrary the following two don't
rtPrintf("asdf\n");
rtPrintf("asdf\n");

The latter crashes with

OptiX Error: Invalid value (Details: Function “RTresult _rtContextLaunch2D(RTcontext, unsigned int, RTsize, RTsize)” caught exception: Error in rtPrintf format string: “”, [7995632])

The difference is really only in calling the rtPrintf once or twice with the same text. Sometimes it could be even a different text and sometimes it worked with the same text. I reduced the code to a minimal example in order to eliminated possible stack corruption, but to no avail. This problem was
reported on the NVIDIA forum in June 2013. It’s possible to use stdio’s printf in conjunction with some ‘if’ guards as a workaround, as proposed in the forum.

The second one is also connected to rtPrintf:

RT_PROGRAM void pathtrace_camera() {
   BiDirSubPathVertex lightVertices[2];
   lightVertices[0].existing = false;
   lightVertices[1].existing = false;

   for(unsigned int i=0; i<2; i++) {
      sampleLightPath();
      if(!(lightVertices[i].existing)) break;
   }
   // rtPrintf("something\n");
   sampleEye();

   output_buffer[launch_index] = make_float4(1.f, 1.f, 1.f, 1.f);
}

This code would crash on two tested operating systems (Windows 7 and Linux) and on two different computers (my own workstation and one from the uni).

OptiX Error: Unknown error (Details: Function “RTresult _rtContextLaunch2D(RTcontext, unsigned int, RTsize, RTsize)” caught exception: Encountered a CUDA error: Kernel launch returned (700): Launch failed, [6619200])

.
It runs fine though, when the rtPrintf is not commented. I reported it here. Again I made a minimal example, in the end the source was only two files, containing ~30 and ~60 lines, but it constantly and reliably crashed depending on weather the rtPrintf was commented or not.

So, later, if the program crashed, the first attempt to resolve the issue was spreading randomly rtPrintfs. This was also one of the solutions to another problem presented in the next paragraphs.

But before I come to it, I quickly have to explain the compilation process. There are two steps, first, during “compile time”, the C++ code is turned into binaries and the OptiX source into intermediate .ptx files. Those contain sort of a GPU assembler, which is then compiled during runtime by the NVIDIA GPU driver into actual binaries executed on the device. This is triggered by a C++ function call, usually context->compile().

Now, the problem is that these context->compile()s don’t return always. Sometimes they run until the host memory is full. Once it helped to spread calls to the mentioned rtPrintf in the code, another time the resolution was an extra RT_CALLABLE_PROGRAM construct, report with additional information here.

Generally those problems are painful and demotivating. Especially taking into account, that it seems like NVIDIA doesn’t care, at least they didn’t answer any of my reports. But the reason for this could also be, that they are phasing out OptiX already due to little success. Any way, it was certainly the last time I used OptiX and I don’t recommend anybody to start with it.

Real Time 3d Mandelbulb

Render of a 3d Mandelbulb
Render of a 3d Mandelbulb

It’s actually already almost one year since I finished the work on an Erasmus project in Universitat Politècnica de València, Spain. The goal was to speed up the computation of distance estimated 3d fractals in a way so that they could be computed in real time and therefore make it possible to explore them in an interactive fly through program.

Fractals are graphics produced by recursive mathematical formulas. Sine the graphics are purely based on math, one can zoom in infinitely into them without loosing quality, well – as long as the float precision is enough. One classical 2d fractal is the Mandelbrot. Some years ago a formula for a 3d Fractal was found that resembles a Mandelbrot and the structure was called Mandelbulb.

Although there already exist programs that do the fractal computation on the graphics card, none of the ones I found was able to do it in acceptable quality and real time. Some of them produced nicer images, but at the cost of having long rendering times, others where sort of interactive, but the quality was bad.

I first implemented a quick proof of concept using NVIDIA OptiX, based on their Julia example, but OptiX quickly showed its limitations, especially when I started to work on a method to speed the thing up. So I switched to NVIDIA CUDA and was satisfied with that decision (and anyway, as I will blog soon, I recommend to stay away from OptiX as far as possible : ).

 

I’ll explain in a few sentences how the rendering works, details for the traditional method can be found on the page of Mikael Hvidtfeldt Christensen and for both, the traditional and improved one  in my report linked below.

Figure explaining ray marching
ray marching

There is a function, which returns the maximum lower bound for the distance to the fractal, called the distance estimator (DE). When walking along a ray, for instance shot by the camera, it is safe to walk this distance, because we have the guarantee that the fractal won’t intersect the ray in this interval. After one step, the distance is estimated again and we march as long, as the estimated distance is above a certain threshold.

My idea to speed up the rendering was to decrease the amount of computation by letting neighbouring rays “ride” on the previous rays.

Principle of the new algorithm
Principle of the new algorithm

This works in two passes, first, “primary” rays are shot with a distance of several pixels to each other, recording the DE values. Secondly, the neighbours, called secondary rays in the figure, are shot. Those can use the information calculated in the first pass to jump straight to the end of the stepping. So, that’s it, in short. Don’t confuse the terms primary and secondary rays with bouncing light of ray tracing here.

Benchmarks of the new algorithm compared to the old one from the report
Benchmarks of the new algorithm compared to the old one from the report

Comparing the traditional ray marching algorithm with the new, faster one, shows a speed up of up to 100% and interactive exploration in almost all views without shadows. It would be probably possible to also implement this speed up for shadow rays, but I had no time to do it. Unfortunately there are some artefacts due to the ray “riding”, details can be found in the report. If you are interested into the source code, please write me a message.

Edit: The  reason I don’t publish the repository is that there is still some (c) NVIDIA code inside (setting up CUDA, the render window etc). If there is somebody willing to replace / redo that code with something GPLv3, check the build and maybe update to the newest sdk, that would be most welcome :)

Edit2: Ok, the GPL parts of the source code are published now. I’m aware that it does not compile, however I haven’t got the time to replace the NVidia parts taken from the examples. Maybe it’ll help some of you anyways :)