To find a plausible explanation we have to go back to BLAS history
There we can learn that Level 1 was designed in the 70's, well before Level 2, 3 (
Level 2 was 1987, and Level 3 was 1989).
Concerning Level 1 history, in the 1979 paper Basic Linear Algebra Subprograms for Fortran Usage by CL Lawson et al. we can read, page 3
The criterion for including an operation in the package was that it
should involve just one level of looping and occur in the usual
algorithms of numerical linear algebra, such as Gaussian elimination
or the various elimination methods using orthogonal transformations.
This paper is based on an initial specification, 1973 A Proposal for Standard Linear Algebra Subprograms by Hanson et al. In this document, again you can read:
For example, it has been found [Krogh (1)] that the use of assembly
coded modules in a double precision program for solving linear
equations based on Householder transfZormations with column scaling
and column interchanges reduced the execution time on a Univac 1108 by
15% to 30% relative to the time required when carefully written
Fortran modules were used.
and after
The operations which we feel belong in Class I according to the above
stated criteria are: (1) the dot product (inner product) of two
vectors, elementary vector operation, y := ax + y where x and y are
n-vectors and a is a scalar, and (3) the Givens 2 x 2 orthogonal
transformation applied to a 2 x n submatrix.
We can see that the main concern was to implement algorithms (linear solvers...) using linear substitution, Givens rotations or Householder transformations. In this context the most important operations as explained in the cited references, are axpy, scaling, dot, norm, etc. The goal was not to provide a complete set of vector operations like addition, subtraction etc... but only to concentrate the effort on a small set of procedures.
_mm_add_pd()
,_mm_sub_pd()
are available to add or substrat vectors. See software.intel.com/sites/landingpage/IntrinsicsGuide/… Furthermore, by looking at OpenBLAS/kernel/x86_64/daxpy_microk_sandy-2.c , it seems that an usual blas daxpy boils down to applyingvmulpd
(scaling x) and thenvaddpd
(adding), plus the outer loop unrolling. How about trying to call_mm_add_pd()
or_mm_sub_pd()
and see how it performs? – Acinaciform