How can I efficiently calculate a quadratic form in Julia?

Asked 9/1, 2022 at 21:54 Answered 10/1, 2022 at 14:27

Solved performance julia quadratic-programming

I would like to calculate a quadratic form: x' Q y in Julia.
What would be the most efficient way to calculate this for the cases:

No assumption.
Q is symmetric.
x and y are the same (x = y).
Both Q is symmetric and x = y.

I know Julia has dot(). But I wonder if it is faster than BLAS call.

Holley answered 9/1, 2022 at 21:54 Comment(1)

If you want to know more about quadratic forms in Julia : forem.julialang.org/vinodv/… – Tarsia 23/9, 2023 at 2:16

Julia's LinearAlgebra stdlib has native implementations of 3-argument dot, and also a version that is specialized for symmetric/hermitian matrices. You can view the source here and here.

You can confirm that they do not allocate using BenchmarkTools.@btime or BenchmarkTools.@ballocated (remember to interpolate variables using $). The symmetry of the matrix is exploited, but looking at the source, I don't see how x == y could enable any serious speedup, except perhaps saving a few array lookups.

Edit: To compare the execution speed of the BLAS version and the native one, you can do

1.7.0> using BenchmarkTools, LinearAlgebra

1.7.0> X = rand(100,100); y=rand(100);

1.7.0> @btime $y' * $X * $y
  42.800 μs (1 allocation: 896 bytes)
1213.5489200642382

1.7.0> @btime dot($y, $X, $y)
  1.540 μs (0 allocations: 0 bytes)
1213.548920064238

This is a big win for the native version. For bigger matrices, the picture changes, though:

1.7.0> X = rand(10000,10000); y=rand(10000);

1.7.0> @btime $y' * $X * $y
  33.790 ms (2 allocations: 78.17 KiB)
1.2507105095988091e7

1.7.0> @btime dot($y, $X, $y)
  44.849 ms (0 allocations: 0 bytes)
1.2507105095988117e7

Possibly because BLAS uses threads, while dot is not multithreaded. There are also some floating point differences.

Biomass answered 10/1, 2022 at 14:27 Comment(4)

Could you check a version of @dot with @tturbo instead of @simd? I'd be interested in seeing how well LoopVectorization does here. – Tamatave 10/1, 2022 at 14:48

Just replacing @simd with @turbo produces a slight speedup, while @tturbo seems identical with @turbo. The macros are placed on the inner loop, though. I presume that threading the outer loop would be more beneficial. – Biomass 10/1, 2022 at 15:3

Removing the iszero check (to allow @tturbo to work on the outer loop) gives me a result that is as fast as BLAS for big sizes, and 3x faster than dot for small sizes. – Tamatave 10/1, 2022 at 15:15

@OscarSmith, Care to share the exact code you used? – Holley 10/7, 2023 at 6:56

The existing answers are both good. However, a few additional points:

While Julia will indeed by default simply call BLAS here, as Oscar notes, some BLASes are faster than others. In particular, MKL will often be somewhat faster than OpenBLAS on modern x86 hardware. Fortunately, in Julia it is unusually easy to manually choose your BLAS backend. Simply typing using MKL at the REPL on Julia 1.7 or later will switch you to the MKL backend via the MKL.jl package.
While it is not used by default, there do now exist a couple pure Julia linear algebra packages that can match or even beat traditional Fortran-based BLAS. In particular, Octavian.jl:

Riki answered 9/1, 2022 at 22:59 Comment(2)

The point of the question was to have a direct method to calculate the scalar which would be faster than a BLAS call. – Holley 10/1, 2022 at 7:2

Ah, that seems to have been already answered by Bogumil's comments about dot, the comments about BLASes are relevant to the case of general linear algebra, which will occur for y'*X*y, and looking at the source, for some of the steps in calls to three-arg dot except in the case of dot(x::AbstractVector, A::AbstractMatrix, y::AbstractVector) which has a non-allocating pure Julia implementation in Julia 1.4 and onwards. – Riki 10/1, 2022 at 15:49

Julia's LinearAlgebra stdlib has native implementations of 3-argument dot, and also a version that is specialized for symmetric/hermitian matrices. You can view the source here and here.

Edit: To compare the execution speed of the BLAS version and the native one, you can do

1.7.0> using BenchmarkTools, LinearAlgebra

1.7.0> X = rand(100,100); y=rand(100);

1.7.0> @btime $y' * $X * $y
  42.800 μs (1 allocation: 896 bytes)
1213.5489200642382

1.7.0> @btime dot($y, $X, $y)
  1.540 μs (0 allocations: 0 bytes)
1213.548920064238

This is a big win for the native version. For bigger matrices, the picture changes, though:

1.7.0> X = rand(10000,10000); y=rand(10000);

1.7.0> @btime $y' * $X * $y
  33.790 ms (2 allocations: 78.17 KiB)
1.2507105095988091e7

1.7.0> @btime dot($y, $X, $y)
  44.849 ms (0 allocations: 0 bytes)
1.2507105095988117e7

Possibly because BLAS uses threads, while dot is not multithreaded. There are also some floating point differences.

Biomass answered 10/1, 2022 at 14:27 Comment(4)

Could you check a version of @dot with @tturbo instead of @simd? I'd be interested in seeing how well LoopVectorization does here. – Tamatave 10/1, 2022 at 14:48

@OscarSmith, Care to share the exact code you used? – Holley 10/7, 2023 at 6:56

If your matrix is symmetric use the Symmetric wrapper to improve performance (a different method is called then):

julia> a = rand(10000); b = rand(10000);

julia> x = rand(10000, 10000); x = (x + x') / 2;

julia> y = Symmetric(x);

julia> @btime dot($a, $x, $b);
  47.000 ms (0 allocations: 0 bytes)

julia> @btime dot($a, $y, $b);
  27.392 ms (0 allocations: 0 bytes)

If x is the same as y see https://discourse.julialang.org/t/most-efficient-way-to-compute-a-quadratic-matrix-form/66606 for a discussion of options (but in general it seems dot is still fast then).

Ironstone answered 9/1, 2022 at 22:28 Comment(11)

Indeed it will be more optimized than a general call to BLAS. But is there an implementation to directly compute the scalar? – Holley 10/1, 2022 at 7:1

dot produces a scalar. – Gadhelic 10/1, 2022 at 7:21

But not directly. It first computes a temporary vector since it calls to BLAS as said by @OscarSmith. – Holley 10/1, 2022 at 7:22

@Eric Johnson Nope, unlike y'*X*y, dot(y,X,y) appears to not allocate at all. You can check this with BenchmarkTools.@btime. – Biomass 10/1, 2022 at 13:34

@DNF, Are you aware of a BLAS function to handle the above? I thought using BLAS means it will do Matrix x Vector and then an inner product operation. – Holley 10/1, 2022 at 13:49

No, I'm not aware of that, but this is not using BLAS anyway, it appears to be fully Julia native. – Biomass 10/1, 2022 at 13:50

Then @OscarSmith's answer is wrong? – Holley 10/1, 2022 at 13:58

@DNF, I think you should post this information as an answer for me to mark. – Holley 10/1, 2022 at 13:58

@OscarSmith normally has deeper knowledge than I about the codebase, so I am a bit hesitant, but @less dot(a, X, b) shows a native implementation, and @btime shows zero allocations. I will look into it a bit more, then post an answer. – Biomass 10/1, 2022 at 14:2

I probably should have taken my advice and looked at the source before saying what the implementation was. Sorry for the misinformation :) – Tamatave 10/1, 2022 at 14:18

The fact that dot is not allocating was exactly the point of my answer above where I have shown @btime results. – Gadhelic 10/1, 2022 at 15:11

You can write an optimized loop using Tullio.jl, doing this all in one sweep. But I think it's not going to beat BLAS significantly:

Chained multiplication is also very slow, because it doesn't know there's a better algorithm.

julia> # a, b, x are the same as in Bogumił's answer

julia> @btime dot($a, $x, $b);
  82.305 ms (0 allocations: 0 bytes)

julia> f(a, x, b) = @tullio r := a[i] * x[i,j] * b[j]
f (generic function with 1 method)

julia> @btime f($a, $x, $b);
  80.430 ms (1 allocation: 16 bytes)

Adding LoopVectorization.jl may be worth it:

julia> using LoopVectorization

julia> f3(a, x, b) = @tullio r := a[i] * x[i,j] * b[j]
f3 (generic function with 1 method)

julia> @btime f3($a, $x, $b);
  73.239 ms (1 allocation: 16 bytes)

But I don't know about how to go about the symmetric case.

julia> @btime dot($a, $(Symmetric(x)), $b);
  42.896 ms (0 allocations: 0 bytes)

although there might be linear algebra tricks to cut that down intelligently with Tullio.jl.

In this kind of problem, benchmarking trade-offs is everything.

Absently answered 10/1, 2022 at 10:9 Comment(2)

This is really great. Maybe supporting symmetric matrices should be a feature added to Tulio.jl. – Holley 10/1, 2022 at 14:4

IIUC, Tullio always devectorized to nested loops, so this wouldn't really make sense. What you could do is formulate the optimized implementation of whatever Symmetric will do as an einsum expression and use Tullio then. – Absently 10/1, 2022 at 15:3

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags