Because with having only two 4×4 matrices, one for each bone a vertex is assigned and weighted to, you have to do only two 4-vector 4×4-matrix multiplications and a weighted sum.
In contrast to this, if you'd submit as a separate quaternion and translation you'd have to do the equvalent of two 3-vector 3×3-matrix multiplications plus four 3-vector 3-vector additions and a weighted sum. Either you first convert your quaternion into a rotation matrix first, then to 3-vector 3×3-matrix multiplication, or you do direct 3-vector quaternion multiplication, the computational effort is about the same. And after that you have to postmultiply with the modelview matrix.
It's perfectly possible to use a 4-element vector uniform as a quaternion, but then you have to chain a lot of computations in the vertex shader: First rotate the vertex by the two quaternions, then translate it and then multiply it with the modelview matrix. By simply uploading two transformation matrix which are weighted in the shader, you save a lot of computations on the GPU. Doing the quaternion-matrix multiplication on the CPU performs the calculation only one time per bone, whereas doing it in the shader performs it for each single vertex. GPUs are great if you have to to a lot of identical computations with varying input date. But they suck if you have to calculate only a handfull of values, which are reused over large amounts of data. CPUs however love this kind of task.
The nice thing about homgenous transformations represented by 4×4 matrices is, that a single matrix can contain a whole transformation chain. If you separate rotations and translations, you have to perform the whole chain of operations in order. With only one rotation and translation it's less operations than a single 4×4 matrix transform. Add one single transformation and you've reached the break even.
The transformation matrices, even in a skeletal pose applied to a mesh, are identical for all vertices. Say the mesh has 100 vertices around a pair of bones (this is a small number, BTW), then you'd have to to the computations outlined above for each any every vertex, wasting precious GPU computation cycles. And for what? To determine some 32 scalar values (or 8 4-vectors). Now compare this: 100 4-vectors (if you only consider vertex position) vs. only 8. This is the order of magnitude of calculation overhead imposed by processing quaternion poses in the shader. Compute it once on the CPU and give it the GPU precalculated to share among the primitives. If you code it right, the whole calculation of a single matrix column will nicely fit into the CPUs pipeline, making is vastly outperform every attempt at parallelizing it. Parallelization doesn't come for free!