I just had a look at a simpler example. The table is generated at compile time. The time is probably spent in lambdas generated in std::__detail::__variant::__gen_vtable_impl<...>
. For some reason these lambdas which basically call the visitor do not omit the check for the actual type of the variant.
This function lets the compiler create code for four different versions of the visiting lambda inlined into lambdas created deep down in std::visit
and stores the pointers to these lambdas in a static array:
double test(std::variant<int, double> v1, std::variant<int, double> v2) {
return std::visit([](auto a, auto b) -> double {
return a + b;
}, v1, v2);
}
This is created in test:
(...) ; load variant tags and check for bad variant
lea rax, [rcx+rax*2] ; compute index in array
mov rdx, rsi
mov rsi, rdi
lea rdi, [rsp+15]
; index into vtable with rax
call [QWORD PTR std::__detail::__variant::(... bla lambda bla ...)::S_vtable[0+rax*8]]
This is generated for the <double, double>
visitor:
std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<double (*)(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, std::variant<int, double>&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&)>, std::tuple<test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&>, std::integer_sequence<unsigned long, 1ul, 1ul> >::__visit_invoke(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&):
; whew, that is a long name :-)
; redundant checks are performed whether we are accessing variants of the correct type:
cmp BYTE PTR [rdx+8], 1
jne .L15
cmp BYTE PTR [rsi+8], 1
jne .L15
; the actual computation:
movsd xmm0, QWORD PTR [rsi]
addsd xmm0, QWORD PTR [rdx]
ret
I would not be surprised if the profiler attributed both the time for these type checks and the time of your inlined visitors to std::__detail::__variant::__gen_vtable_impl<...>
, rather than giving you the full 800-plus character name of the deeply nested lambda.
The only generic optimization potential I see here would be to omit the checks for bad variant in the lambdas. Since the lambdas are called through a function pointer only with matching variants, the compiler will have a very hard time statically discovering that the checks are redundant.
I had a look at the same example compiled with clang and libc++. In libc++ the redundant type checks are eliminated, so libstdc++ is not quite optimal yet.
decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&): # @"decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&)"
; no redundant check here
movsd xmm0, qword ptr [rsi] # xmm0 = mem[0],zero
addsd xmm0, qword ptr [rdx]
ret
Maybe you can check what code is actually generated in your production software, just in case it is not similar to what I found with my example.
el
and particularly the type ofel
is constant over the two inner loops andvis
(and its type) is constant in the inner loop? – Tamatavestd::variant
s with 8 possible types and still saw a compile time table with gcc 7.2 on ubuntu. I also tried 8^5, but I had to terminate the compiler because my VM went out of memory. – Tamatave