optimizing of std::visit possible?

// 3 class families, all like this class ElementDerivedN: public ElementBase { ... std::variant<ElementDerived1*, ElementDerived2*,... > GetVariant() override { return this; } } std::vector<Element*> elements; std::vector<Visitor*> visitors; std::vector<Third*> thirds; // prepare a hack to get dynamic function object: template<class... Ts> struct funcs : Ts... { using Ts::operator()...; }; template<class... Ts> funcs(Ts...) -> funcs<Ts...>; // demo functions: struct Actions { template < typename R, typename S, typename T> void operator()( R*, S*, T* ) {} }; struct SpecialActionForElement1{ template < typename S, typename T > void operator()( Element1*, S*, T* ) {} }; for ( auto el: elements ) { for ( auto vis: visitors ) { for ( auto th: thirds ) { std::visit( funcs{ Actions(), SpecialActionForElement1Derived1()}, el->GetVariant(), vis->GetVariant(), th->GetVariant() ); } } }

I just had a look at a simpler example. The table is generated at compile time. The time is probably spent in lambdas generated in std::__detail::__variant::__gen_vtable_impl<...>. For some reason these lambdas which basically call the visitor do not omit the check for the actual type of the variant.

This function lets the compiler create code for four different versions of the visiting lambda inlined into lambdas created deep down in std::visit and stores the pointers to these lambdas in a static array:

double test(std::variant<int, double> v1, std::variant<int, double> v2) {
    return std::visit([](auto a, auto b) -> double {
        return a + b;
        }, v1, v2);
}

This is created in test:

  (...) ; load variant tags and check for bad variant
  lea rax, [rcx+rax*2] ; compute index in array
  mov rdx, rsi
  mov rsi, rdi
  lea rdi, [rsp+15]
  ; index into vtable with rax
  call [QWORD PTR std::__detail::__variant::(... bla lambda bla ...)::S_vtable[0+rax*8]]

This is generated for the <double, double> visitor:

std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<double (*)(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, std::variant<int, double>&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&)>, std::tuple<test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&>, std::integer_sequence<unsigned long, 1ul, 1ul> >::__visit_invoke(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&):
; whew, that is a long name :-)
; redundant checks are performed whether we are accessing variants of the correct type:
      cmp BYTE PTR [rdx+8], 1
      jne .L15
      cmp BYTE PTR [rsi+8], 1
      jne .L15
; the actual computation:
      movsd xmm0, QWORD PTR [rsi]
      addsd xmm0, QWORD PTR [rdx]
      ret

I would not be surprised if the profiler attributed both the time for these type checks and the time of your inlined visitors to std::__detail::__variant::__gen_vtable_impl<...>, rather than giving you the full 800-plus character name of the deeply nested lambda.

The only generic optimization potential I see here would be to omit the checks for bad variant in the lambdas. Since the lambdas are called through a function pointer only with matching variants, the compiler will have a very hard time statically discovering that the checks are redundant.

I had a look at the same example compiled with clang and libc++. In libc++ the redundant type checks are eliminated, so libstdc++ is not quite optimal yet.

decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&): # @"decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&)"
  ; no redundant check here
  movsd xmm0, qword ptr [rsi] # xmm0 = mem[0],zero
  addsd xmm0, qword ptr [rdx]
  ret

Maybe you can check what code is actually generated in your production software, just in case it is not similar to what I found with my example.

Recommended topics

Hot tags