optimizing of std::visit possible?
Asked Answered
D

1

10

While using std::visit / std::variant I see in the profiler output that std::__detail::__variant::__gen_vtable_impl functions take the most time.

I did a test like this:

// 3 class families, all like this
class ElementDerivedN: public ElementBase
{
    ...
        std::variant<ElementDerived1*, ElementDerived2*,... > GetVariant() override { return this; }
}

std::vector<Element*> elements;
std::vector<Visitor*> visitors;
std::vector<Third*>   thirds;

// prepare a hack to get dynamic function object:
template<class... Ts> struct funcs : Ts... { using Ts::operator()...; };
template<class... Ts> funcs(Ts...) -> funcs<Ts...>;

// demo functions:
struct Actions { template < typename R, typename S, typename T> void operator()( R*, S*, T* ) {} };
struct SpecialActionForElement1{ template < typename S, typename T > void operator()( Element1*, S*, T* ) {} };


for ( auto el: elements )
{
    for ( auto vis: visitors )
    {
        for ( auto th: thirds )
        {
            std::visit( funcs{ Actions(), SpecialActionForElement1Derived1()}, el->GetVariant(), vis->GetVariant(), th->GetVariant() );
        }
    }
}

As said, std::__detail::__variant::__gen_vtable_impl<...> takes the most amount of time.

Q: As the generated n-dimensional function array generated at each visit call is from call to call the same, it would be nice to keep it between the calls of std::visit. Is that possible?

Maybe I am on the wrong path, if so, let me know!

EDIT: used compiler gcc7.3 from standard fedora installation. std-lib is used as standard in g++ ( what ever this is )

build options:

g++ --std=c++17 -fno-rtti main.cpp -O3 -g -o go
Doubt answered 7/3, 2018 at 7:46 Comment(15)
Do you use polymorphic objects inside std::variant ? Maybe you can simplify your data, by avoiding std::variant at all. variant is actually expensive thing, nor really optimal. If you use simple inheritance, you give compiler a chance to optimize you code, by de-virtualization etc.Foretoken
@VictorGubin: Yes, that is known. The reason is that the standard visitor pattern can not be implemented with templates ( multi dispatch ) because virtual templates are not possible in c++. So using variant/visit is the workaround for that. And as seen, the solution takes around 60% more time to dispatch for N=2. This is acceptable for me, but maybe can optimized ( reason for that question ). The design of my concrete sw is much simpler and easier to read whit variant/visit as with classic visitor pattern implementation. The simplification in the example did not reflect my real sw needs!Doubt
From my (and not only my, Thomas Kyte for example) point of view, if you thinking about optimization, you should start from design, and then move to the low-level. I don't think low levelForetoken
@VictorGubin: You can make any question to an XY one. But this questions asks about potential optimization options for a concrete functionality. The found behavior let me assume that a object will generate a temporary which can be kept from call to call. So what has this todo with my application? I did measurements and found that exactly at this point a time is consumed. So I think it is a point to ask for a improvement. Maybe this is an idea which others can also use to make their code faster. And this question is not "My program is to slow, please help".Doubt
If there was an example that we can compile, we may be able to look at the generated code and experiment a little with it, to suggest an optimization. I tried reading the libstdc++ implementation, but it did not yield many useful insights by itself. Also how large are the vectors approximately (just an order of magnitude)?Tamatave
Have you considered some form of multi-level dispatch, since el and particularly the type of el is constant over the two inner loops and vis (and its type) is constant in the inner loop?Tamatave
Also for optimization it is always useful to know the compiler and compiler options as well as the standard library. Do you use gcc with libstdc++?Tamatave
@PaulR: The constness of outer loop is only example code. So a partly dispatch is not possible in real world.Doubt
@PaulR: add info to my post. Thanks for the hint.Doubt
@Doubt Looks like you misunderstand me. You'r design is really complicated and I don't think you really able to optimize it on low level leaving it as it is. This is not a memmove function, which can be implemented with CPU AVX instructions. Take a look into boost.org/doc/libs/1_47_0/doc/html/boost/signals2/signal.html maybe it can help.Foretoken
@VictorGubin: I have no problem with speed, it works quite perfect and the slow down is accepted. I only want to get an idea it std::visit can be improved by making an internal temporary ( if implemented this way ) can be kept from call to call. Because it looks like ( i did not find all details of implementation, quite complex code ) that the visit function creates a n dimensional function table. Maybe I am wrong. So this question is maybe only for academic interest. So there is no speed nor an general optimizing problem with my application.Doubt
@Doubt So, what if we open <variant> header and exam the implementation ? template<typename _Visitor, typename... _Variants> constexpr decltype(auto) visit(_Visitor&& __visitor, _Variants&&... __variants) {..... constexpr auto& __vtable = __detail::__variant::__gen_vtable< _Result_type, _Visitor&&, _Variants&&...>::_S_vtable; .... } So, how you'd like to optimize it, if it takes most of the time according to your profiling result ?Foretoken
Have a look at Mach7.Leenaleeper
I am curious: how many different classes are there in each variant in your code? I just had a look at code with 4096 simple generated visitors for visiting 4 std::variants with 8 possible types and still saw a compile time table with gcc 7.2 on ubuntu. I also tried 8^5, but I had to terminate the compiler because my VM went out of memory.Tamatave
@PaulR: Each variant contains 3 types and should give an 3x3 matrix. That is what I wonder about also!Doubt
T
4

I just had a look at a simpler example. The table is generated at compile time. The time is probably spent in lambdas generated in std::__detail::__variant::__gen_vtable_impl<...>. For some reason these lambdas which basically call the visitor do not omit the check for the actual type of the variant.

This function lets the compiler create code for four different versions of the visiting lambda inlined into lambdas created deep down in std::visit and stores the pointers to these lambdas in a static array:

double test(std::variant<int, double> v1, std::variant<int, double> v2) {
    return std::visit([](auto a, auto b) -> double {
        return a + b;
        }, v1, v2);
}

This is created in test:

  (...) ; load variant tags and check for bad variant
  lea rax, [rcx+rax*2] ; compute index in array
  mov rdx, rsi
  mov rsi, rdi
  lea rdi, [rsp+15]
  ; index into vtable with rax
  call [QWORD PTR std::__detail::__variant::(... bla lambda bla ...)::S_vtable[0+rax*8]]

This is generated for the <double, double> visitor:

std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<double (*)(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, std::variant<int, double>&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&)>, std::tuple<test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&>, std::integer_sequence<unsigned long, 1ul, 1ul> >::__visit_invoke(test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&, test(std::variant<int, double>, std::variant<int, double>)::{lambda(auto:1, auto:2)#1}&&):
; whew, that is a long name :-)
; redundant checks are performed whether we are accessing variants of the correct type:
      cmp BYTE PTR [rdx+8], 1
      jne .L15
      cmp BYTE PTR [rsi+8], 1
      jne .L15
; the actual computation:
      movsd xmm0, QWORD PTR [rsi]
      addsd xmm0, QWORD PTR [rdx]
      ret

I would not be surprised if the profiler attributed both the time for these type checks and the time of your inlined visitors to std::__detail::__variant::__gen_vtable_impl<...>, rather than giving you the full 800-plus character name of the deeply nested lambda.

The only generic optimization potential I see here would be to omit the checks for bad variant in the lambdas. Since the lambdas are called through a function pointer only with matching variants, the compiler will have a very hard time statically discovering that the checks are redundant.

I had a look at the same example compiled with clang and libc++. In libc++ the redundant type checks are eliminated, so libstdc++ is not quite optimal yet.

decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&): # @"decltype(auto) std::__1::__variant_detail::__visitation::__base::__dispatcher<1ul, 1ul>::__dispatch<std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&>(std::__1::__variant_detail::__visitation::__variant::__value_visitor<test(std::__1::variant<int, double>, std::__1::variant<int, double>)::$_0>&&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&, std::__1::__variant_detail::__base<(std::__1::__variant_detail::_Trait)0, int, double>&)"
  ; no redundant check here
  movsd xmm0, qword ptr [rsi] # xmm0 = mem[0],zero
  addsd xmm0, qword ptr [rdx]
  ret

Maybe you can check what code is actually generated in your production software, just in case it is not similar to what I found with my example.

Tamatave answered 7/3, 2018 at 14:7 Comment(2)
Thanks for your work! As I found in my assembly there is no statically generated table but the generator function. So I would brake down my example code to the point where the compiler generates the static solution. Maybe I find the point which brakes optimization.Doubt
This may be related to size or time limits for constexpr computation, see this answer. Maybe you can tune the -fconstexpr-depth option.Tamatave

© 2022 - 2024 — McMap. All rights reserved.