I’m working on a little ray tracing CUDA project right now and found out, that GLM also works in that environment. But soon I ran into performance issues and was looking for the culprit. While it certainly isn’t only GLM, it still slowed things down.

I decided to make some quick performance tests and here are the results:

- GLM’s matrix multiplications are 4 times slower compared to my custom one liner and using CUDA’s native vector types.
~~Dot and cross products of vectors are roughly 30% faster than the implementation found in the cuda samples (helper_math.h)~~

I used a GForce GTX 550 Ti, CUDA 6.5 and GLM 0.9.5.4 on linux for the test.

Hopefully the people behind GLM can fix it quickly as it is really a neat library. I submitted a bug report.

In case you want to see or even edit the source code of the tests, it lies on bitbucket, feel free to commit any changes : )

Edit:

~~We were able to fix the performance problem in glm’s bug tracker. After aligning glm::vec4 properly, the matrix multiplication is almost equal and other functions (dot and cross) got even faster.~~

`time for cuda glm (matrix): 233 milliseconds`

time for cuda helper math (matrix): 226 milliseconds

time for cuda glm (dot): 185 milliseconds

time for cuda helper math (dot): 302 milliseconds

time for cuda glm (cross): 46 milliseconds

time for cuda helper math (cross): 162 milliseconds

Edit2:

The matrix multiplication indeed improved performance, but in the real world (well, my code ) this was only part of the issue. Additionally the tests I made triggered some kind of optimisation that was only possible in glm’s code. I discovered it when seeing that the results of those testing computations were either infinity or 0. I changed the values of the matrix/vectors so that the results stay in a range of [10*e-2, 1000] and vualà, the performance is almost the same.

Because of GLM running slower in my code, I conducted some additional tests and attempted to improve GLM’s performance. The performance didn’t improve enough, so I finally submitted another bug report.

## test results of my fastest glm version

#test 1 (synthetic)

CUDA kernel launch with 19532 blocks of 256 threads

time for cuda glm (matrix):.........546 milliseconds

time for cuda helper math (matrix):.660 milliseconds

time for cuda glm (dot):............471 milliseconds

time for cuda helper math (dot):....491 milliseconds

time for cuda glm (cross):..........246 milliseconds

time for cuda helper math (cross):..246 milliseconds

```
```#test 2a (real life)

time for glm:.......................468 milliseconds

time for cuda:......................417 milliseconds

`#test 2b (2a, but removed early exit from a loop)`

time for glm:.......................373 milliseconds

time for cuda:......................370 milliseconds