Why Do I Need to Convert Quaternion to 4x4 Matrix When Uploading to the Shaders?

K

2

6

I have read several tutorials about skeletal animation in OpenGL, they all seem to be single minded in using quaternions for rotation, 3d vector for translation, so not matrices.

But when they come to the vertex skinning process, they combine all of the quaternions and 3d vectors into a 4x4 matrix and upload the matrices to do the rest of calculations in shaders. 4x4 matrices have 16 elements while quaternion + 3d vector has only 7. So why are we converting these to 4x4 matrices before uploading ?

Knightly answered 29/3, 2013 at 13:20 Comment(1)

Because combining everything into a single matrix, then multiplying against it is faster than multiplying against multiple values in different forms? (not to mention that not only is the math simpler, shader languages have builtin matrix mult instructions) – Seamaid 29/3, 2013 at 13:30

S

7

Because with having only two 4×4 matrices, one for each bone a vertex is assigned and weighted to, you have to do only two 4-vector 4×4-matrix multiplications and a weighted sum.

In contrast to this, if you'd submit as a separate quaternion and translation you'd have to do the equvalent of two 3-vector 3×3-matrix multiplications plus four 3-vector 3-vector additions and a weighted sum. Either you first convert your quaternion into a rotation matrix first, then to 3-vector 3×3-matrix multiplication, or you do direct 3-vector quaternion multiplication, the computational effort is about the same. And after that you have to postmultiply with the modelview matrix.

It's perfectly possible to use a 4-element vector uniform as a quaternion, but then you have to chain a lot of computations in the vertex shader: First rotate the vertex by the two quaternions, then translate it and then multiply it with the modelview matrix. By simply uploading two transformation matrix which are weighted in the shader, you save a lot of computations on the GPU. Doing the quaternion-matrix multiplication on the CPU performs the calculation only one time per bone, whereas doing it in the shader performs it for each single vertex. GPUs are great if you have to to a lot of identical computations with varying input date. But they suck if you have to calculate only a handfull of values, which are reused over large amounts of data. CPUs however love this kind of task.

The nice thing about homgenous transformations represented by 4×4 matrices is, that a single matrix can contain a whole transformation chain. If you separate rotations and translations, you have to perform the whole chain of operations in order. With only one rotation and translation it's less operations than a single 4×4 matrix transform. Add one single transformation and you've reached the break even.

The transformation matrices, even in a skeletal pose applied to a mesh, are identical for all vertices. Say the mesh has 100 vertices around a pair of bones (this is a small number, BTW), then you'd have to to the computations outlined above for each any every vertex, wasting precious GPU computation cycles. And for what? To determine some 32 scalar values (or 8 4-vectors). Now compare this: 100 4-vectors (if you only consider vertex position) vs. only 8. This is the order of magnitude of calculation overhead imposed by processing quaternion poses in the shader. Compute it once on the CPU and give it the GPU precalculated to share among the primitives. If you code it right, the whole calculation of a single matrix column will nicely fit into the CPUs pipeline, making is vastly outperform every attempt at parallelizing it. Parallelization doesn't come for free!

Swirsky answered 29/3, 2013 at 13:30 Comment(5)

Wrong. It's not slower. Not even on PS3 and 360. – Blues 29/3, 2013 at 13:39

GPU performance is very architecture dependent, so I don't think you can make broad predictions like your first paragraph. Scalar architectures will likely require more instructions to accomplish the 4×4 multiplications, than the 3×3s. For vector-based machines, the 3×3 multiplications may likely be done using the same instructions 4-element vector operations, but it'd still only take three instructions to accomplish a 3×3 matrix multiply, for example. Further, additions usually are faster than multiplications. – Acumen 29/3, 2013 at 13:51

@Swirsky Can you also edit your first paragraph? It's missing several words, I think. – Acumen 29/3, 2013 at 13:58

i also want to ask why do we need inverse of bind poses – Knightly 29/3, 2013 at 14:8

@deniz: The inverse poses are required for illumination calculations; or to be more specific, the inverse transpose. – Swirsky 29/3, 2013 at 15:42

B

7

In modern GPUs there is no restriction to what data format you upload to constant buffers.

Of course you need to write your vertex shader differently in order to use quaternions for skinning instead of matrices. In fact, we are using dual quaternion skinning in our engine.

Note that older fixed function hardware skinning indeed only worked with matrices, but that was a long time ago.

Blues answered 29/3, 2013 at 13:26 Comment(0)

S

7

Because with having only two 4×4 matrices, one for each bone a vertex is assigned and weighted to, you have to do only two 4-vector 4×4-matrix multiplications and a weighted sum.

In contrast to this, if you'd submit as a separate quaternion and translation you'd have to do the equvalent of two 3-vector 3×3-matrix multiplications plus four 3-vector 3-vector additions and a weighted sum. Either you first convert your quaternion into a rotation matrix first, then to 3-vector 3×3-matrix multiplication, or you do direct 3-vector quaternion multiplication, the computational effort is about the same. And after that you have to postmultiply with the modelview matrix.

It's perfectly possible to use a 4-element vector uniform as a quaternion, but then you have to chain a lot of computations in the vertex shader: First rotate the vertex by the two quaternions, then translate it and then multiply it with the modelview matrix. By simply uploading two transformation matrix which are weighted in the shader, you save a lot of computations on the GPU. Doing the quaternion-matrix multiplication on the CPU performs the calculation only one time per bone, whereas doing it in the shader performs it for each single vertex. GPUs are great if you have to to a lot of identical computations with varying input date. But they suck if you have to calculate only a handfull of values, which are reused over large amounts of data. CPUs however love this kind of task.

The nice thing about homgenous transformations represented by 4×4 matrices is, that a single matrix can contain a whole transformation chain. If you separate rotations and translations, you have to perform the whole chain of operations in order. With only one rotation and translation it's less operations than a single 4×4 matrix transform. Add one single transformation and you've reached the break even.

The transformation matrices, even in a skeletal pose applied to a mesh, are identical for all vertices. Say the mesh has 100 vertices around a pair of bones (this is a small number, BTW), then you'd have to to the computations outlined above for each any every vertex, wasting precious GPU computation cycles. And for what? To determine some 32 scalar values (or 8 4-vectors). Now compare this: 100 4-vectors (if you only consider vertex position) vs. only 8. This is the order of magnitude of calculation overhead imposed by processing quaternion poses in the shader. Compute it once on the CPU and give it the GPU precalculated to share among the primitives. If you code it right, the whole calculation of a single matrix column will nicely fit into the CPUs pipeline, making is vastly outperform every attempt at parallelizing it. Parallelization doesn't come for free!

Swirsky answered 29/3, 2013 at 13:30 Comment(5)

Wrong. It's not slower. Not even on PS3 and 360. – Blues 29/3, 2013 at 13:39

GPU performance is very architecture dependent, so I don't think you can make broad predictions like your first paragraph. Scalar architectures will likely require more instructions to accomplish the 4×4 multiplications, than the 3×3s. For vector-based machines, the 3×3 multiplications may likely be done using the same instructions 4-element vector operations, but it'd still only take three instructions to accomplish a 3×3 matrix multiply, for example. Further, additions usually are faster than multiplications. – Acumen 29/3, 2013 at 13:51

@Swirsky Can you also edit your first paragraph? It's missing several words, I think. – Acumen 29/3, 2013 at 13:58

i also want to ask why do we need inverse of bind poses – Knightly 29/3, 2013 at 14:8

@deniz: The inverse poses are required for illumination calculations; or to be more specific, the inverse transpose. – Swirsky 29/3, 2013 at 15:42

Recommended topics

Hot tags