applying std algorithms over structs?
Asked Answered
E

2

2

We have Boost.PFR and we have the tuple iterator. If we combine the two, we might have a way of applying std algorithms over structs. Does a solution already exist? What I'm looking for is:

S a, b;
auto const ra(to_range(a)), rb(to_range(b));
std::transform(ra.begin(), ra.end(), rb.begin(), [](auto&& a)noexcept{return a;});

This would allow us to use the newer <execution> features to process structs out of sequence or in parallel.

Eladiaelaeoptene answered 10/4, 2021 at 11:1 Comment(2)
Looks like a simple replacement of the standard tuple functions with PFR ones will (in that tuple iterator) will do. But note that a will be an std::variant.Oloroso
yes, I was hoping someone might have made something more substantial :)Eladiaelaeoptene
B
1

So to illustrate the points I tried to make in the comments to the other answer, let's write something like your transform:

A Direct Implementation

I skipped the notion of iterators and standard library, since it is encumbered with the whole "iterator value type must be fixed" and other burden.

Instead, let's do it "functionally".

#include <boost/pfr.hpp>
namespace pfr = boost::pfr;

template <typename Op, typename... T>
void transform(Op f, T&&... operands) {
    auto apply = [&]<int N>() {
        f(pfr::get<N>(std::forward<T>(operands))...);
        return 1;
    };

    constexpr auto size = std::min({pfr::tuple_size<std::decay_t<T>>::value...});
    // optionally assert that sizes match:
    //static_assert(size == std::max({pfr::tuple_size<std::decay_t<T>>::value...}));

    [=]<auto... N>(std::index_sequence<N...>) {
        return (apply.template operator()<N>() + ...);
    }
    (std::make_index_sequence<size>{});
}

I already generalized a bit by not making the arity fixed. It's more like an n-ary zip or visitor now. To get the transform you wanted you'd pass it an operation like

auto binary = [](auto const& a, auto& b) {
    b = a;
};

Let's demo this, highlighting mixed type members, asymmetric types as well as mixed-length structs:

struct S1 { int a; double b; long c; float d; };
struct S2 { double a; double b; double c; double d; };
struct S3 { double a; double b; };

Test cases:

int main() {

    auto n_ary = [](auto&... fields) {
        puts(__PRETTY_FUNCTION__);
        return (... = fields);
    };

    S1 a;
    S2 b;
    S3 c;

    // all directions
    transform(binary, a, b);
    transform(binary, b, a);

    // mixed sizes
    transform(binary, b, c);
    transform(binary, c, a);

    // why settle for binary?
    transform(n_ary, a, b);
    transform(n_ary, a, b, c);
    transform(n_ary, c, b, a);
}

See it Live On Compiler Explorer

Already, the disassembly suggests everything is getting inlined and optimized away. Literally. Only the puts calls remain:

main:
    sub     rsp, 8
    mov     edi, OFFSET FLAT:.LC0
    call    puts
    mov     edi, OFFSET FLAT:.LC1
    call    puts
    mov     edi, OFFSET FLAT:.LC2
    call    puts
    ...
    ...
    xor     eax, eax
    add     rsp, 8
    ret

giving the output

main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = int; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = long int; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = float; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = int]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = long int]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = float]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = int]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(auto:14& ...)> [with auto:14 = {int, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {long int, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {float, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {int, double, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double, int}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double, double}]

Proof Of Work

Let's do some "useful" calculations that we can check. Also, renaming the transform function to nway_visit just reflect its more generic orientation:

auto binary = [](auto& a, auto& b) { return a *= b; };
auto n_ary  = [](auto&... fields) { return (... *= fields); };

So both operations do a right-fold multiply-assigning. Given some distinctly chosen initializers

S1 a {1,2,3,4};
S2 b {2,3,4,5};
S3 c {3,4};

we want to be able to see the data flow through the data structures. So, let's optionally do some debug tracing of that:

#define DEMO(expr)                                                             \
    void(expr);                                                                \
    if constexpr (output_enabled) {                                            \
        std::cout << "After " << std::left << std::setw(26) << #expr;          \
        std::cout << " a:" << pfr::io(a) << "\tb:" << pfr::io(b)               \
                  << "\tc:" << pfr::io(c) << "\n";                             \
    }

    DEMO("initialization");

    // all directions
    DEMO(nway_visit(binary, a, b));
    DEMO(nway_visit(binary, b, a));

    // mixed sizes
    DEMO(nway_visit(binary, b, c));
    DEMO(nway_visit(binary, c, a));

    // why settle for binary?
    DEMO(nway_visit(n_ary, a, b));
    DEMO(nway_visit(n_ary, a, b, c));
    DEMO(nway_visit(n_ary, c, b, a));

    return long(c.a + c.b) % 37; // prevent whole program optimization...

As the cherry-on-top let's be absolutely sure that (with output disabled) the compiler cannot optimize the whole program away because there are no observable effects:

return long(c.a + c.b) % 37; // prevent whole program optimization...

The demo is Live On Compiler Explorer with output enabled, and once with output disabled showing the disassembly:

main:
        mov     eax, 13
        ret

WOW

Holy smokes. That's optimization. The whole program is statically evaluated and just returns the exit code 13. Let's see whether that's the correct exit code:

Output Enabled:

After "initialization"           a:{1, 2, 3, 4} b:{2, 3, 4, 5}  c:{3, 4}
After nway_visit(binary, a, b)   a:{2, 6, 12, 20}   b:{2, 3, 4, 5}  c:{3, 4}
After nway_visit(binary, b, a)   a:{2, 6, 12, 20}   b:{4, 18, 48, 100}  c:{3, 4}
After nway_visit(binary, b, c)   a:{2, 6, 12, 20}   b:{12, 72, 48, 100} c:{3, 4}
After nway_visit(binary, c, a)   a:{2, 6, 12, 20}   b:{12, 72, 48, 100} c:{6, 24}
After nway_visit(n_ary, a, b)    a:{24, 432, 576, 2000} b:{12, 72, 48, 100} c:{6, 24}
After nway_visit(n_ary, a, b, c) a:{1728, 746496, 576, 2000}    b:{12, 72, 48, 100} c:{6, 24}
After nway_visit(n_ary, c, b, a) a:{1728, 746496, 576, 2000}    b:{12, 72, 48, 100} c:{124416, 1289945088}

So, the return value should be (124416 + 1289945088) modulo 37, which a pocket calculator confirms is 13.

From Here: Parallel tasks etc.

Your original motivation included getting the parallel execution options from standard library for free. As you know I'm skeptical of the usefulness of that.

However, little stops you from getting this behaviour from thie algorithm:

boost::asio::thread_pool ctx; // or, e.g. system_executor

auto run_task = [&](auto&... fields) {
    boost::asio::post(ctx, [=] { long_running_task(fields...); });
};

Hope this is good inspiration. And thanks for making me look at PFR. It's pretty sweet.

Belie answered 10/4, 2021 at 22:15 Comment(2)
And here's a hacked-up version just to show that std::variant does not need to incur allocation overhead: godbolt.org/z/5M69KcdYf (don't use this code, it's inconsistent for generic use; in particular I "emulated" the *= behaviour by assigning the result back to the left-most operand). Much less optimized, but no heap in sight. (Same for boost::variant godbolt.org/z/6jobWqvsE, although I had to comment the ternary invocations because boost::apply_visitor doesn't support that)Belie
This really is an eye-opening answer. I share your scepticism as far as std::execution::paris concerned, but std::execution::unseq sometimes produces jaw-dropping SIMD code.Eladiaelaeoptene
B
1

You can already have the existing functionality with pfr::structure_tie, which results in a tuple of references.

This would allow us to use the newer features to process structs out of sequence or in parallel.

That would only make sense if processing per element has considerable execution cost. In that case, you're probably already fine with a bespoke

boost::pfr::for_each_field(var, [](auto& field) { field += 1; });

invocation

Belie answered 10/4, 2021 at 15:38 Comment(8)
nice, but the tuple iterator is just a toy implementation. I threw something together and it works for my purposes.Eladiaelaeoptene
Looks like it makes the assumption that all members are (convertible to) the first element's type? You didn't mention such a - vastly simplifying - requirement. Maybe I misreadBelie
no, you didn't, I just came to the same conclusions as you, that any alternative would be too slow. A variant would slow things down to a crawl. If you have a nice solution for the simplifying assumption, I'm very interested to see it.Eladiaelaeoptene
"A variant would slow things down to a crawl." - need proof. I say: of course not. Given static types the compiler could inline and unrioll 100%Belie
AFAIK it allocates memory, that's death to performance.Eladiaelaeoptene
Still underestimating the optimizer. Also, no need for the variant, unless you insist on calling the value type something "fixed".Belie
yes, it could be some kind of a reference holder. Well, I have a feeling you'll write something for us :) Be sure to update your answer, when you do, I already gave +1, I can't do more.Eladiaelaeoptene
@Eladiaelaeoptene I dived into PFR for the first time. What a nice library. I did some tests/demo to verify my intuitions about optimizability if you avoid the variants (but keeping full variability of static types). The results do not disappoint one bitBelie
B
1

So to illustrate the points I tried to make in the comments to the other answer, let's write something like your transform:

A Direct Implementation

I skipped the notion of iterators and standard library, since it is encumbered with the whole "iterator value type must be fixed" and other burden.

Instead, let's do it "functionally".

#include <boost/pfr.hpp>
namespace pfr = boost::pfr;

template <typename Op, typename... T>
void transform(Op f, T&&... operands) {
    auto apply = [&]<int N>() {
        f(pfr::get<N>(std::forward<T>(operands))...);
        return 1;
    };

    constexpr auto size = std::min({pfr::tuple_size<std::decay_t<T>>::value...});
    // optionally assert that sizes match:
    //static_assert(size == std::max({pfr::tuple_size<std::decay_t<T>>::value...}));

    [=]<auto... N>(std::index_sequence<N...>) {
        return (apply.template operator()<N>() + ...);
    }
    (std::make_index_sequence<size>{});
}

I already generalized a bit by not making the arity fixed. It's more like an n-ary zip or visitor now. To get the transform you wanted you'd pass it an operation like

auto binary = [](auto const& a, auto& b) {
    b = a;
};

Let's demo this, highlighting mixed type members, asymmetric types as well as mixed-length structs:

struct S1 { int a; double b; long c; float d; };
struct S2 { double a; double b; double c; double d; };
struct S3 { double a; double b; };

Test cases:

int main() {

    auto n_ary = [](auto&... fields) {
        puts(__PRETTY_FUNCTION__);
        return (... = fields);
    };

    S1 a;
    S2 b;
    S3 c;

    // all directions
    transform(binary, a, b);
    transform(binary, b, a);

    // mixed sizes
    transform(binary, b, c);
    transform(binary, c, a);

    // why settle for binary?
    transform(n_ary, a, b);
    transform(n_ary, a, b, c);
    transform(n_ary, c, b, a);
}

See it Live On Compiler Explorer

Already, the disassembly suggests everything is getting inlined and optimized away. Literally. Only the puts calls remain:

main:
    sub     rsp, 8
    mov     edi, OFFSET FLAT:.LC0
    call    puts
    mov     edi, OFFSET FLAT:.LC1
    call    puts
    mov     edi, OFFSET FLAT:.LC2
    call    puts
    ...
    ...
    xor     eax, eax
    add     rsp, 8
    ret

giving the output

main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = int; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = long int; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = float; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = int]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = long int]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = float]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = int]
main()::<lambda(const auto:12&, auto:13&)> [with auto:12 = double; auto:13 = double]
main()::<lambda(auto:14& ...)> [with auto:14 = {int, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {long int, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {float, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {int, double, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double, double}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double, int}]
main()::<lambda(auto:14& ...)> [with auto:14 = {double, double, double}]

Proof Of Work

Let's do some "useful" calculations that we can check. Also, renaming the transform function to nway_visit just reflect its more generic orientation:

auto binary = [](auto& a, auto& b) { return a *= b; };
auto n_ary  = [](auto&... fields) { return (... *= fields); };

So both operations do a right-fold multiply-assigning. Given some distinctly chosen initializers

S1 a {1,2,3,4};
S2 b {2,3,4,5};
S3 c {3,4};

we want to be able to see the data flow through the data structures. So, let's optionally do some debug tracing of that:

#define DEMO(expr)                                                             \
    void(expr);                                                                \
    if constexpr (output_enabled) {                                            \
        std::cout << "After " << std::left << std::setw(26) << #expr;          \
        std::cout << " a:" << pfr::io(a) << "\tb:" << pfr::io(b)               \
                  << "\tc:" << pfr::io(c) << "\n";                             \
    }

    DEMO("initialization");

    // all directions
    DEMO(nway_visit(binary, a, b));
    DEMO(nway_visit(binary, b, a));

    // mixed sizes
    DEMO(nway_visit(binary, b, c));
    DEMO(nway_visit(binary, c, a));

    // why settle for binary?
    DEMO(nway_visit(n_ary, a, b));
    DEMO(nway_visit(n_ary, a, b, c));
    DEMO(nway_visit(n_ary, c, b, a));

    return long(c.a + c.b) % 37; // prevent whole program optimization...

As the cherry-on-top let's be absolutely sure that (with output disabled) the compiler cannot optimize the whole program away because there are no observable effects:

return long(c.a + c.b) % 37; // prevent whole program optimization...

The demo is Live On Compiler Explorer with output enabled, and once with output disabled showing the disassembly:

main:
        mov     eax, 13
        ret

WOW

Holy smokes. That's optimization. The whole program is statically evaluated and just returns the exit code 13. Let's see whether that's the correct exit code:

Output Enabled:

After "initialization"           a:{1, 2, 3, 4} b:{2, 3, 4, 5}  c:{3, 4}
After nway_visit(binary, a, b)   a:{2, 6, 12, 20}   b:{2, 3, 4, 5}  c:{3, 4}
After nway_visit(binary, b, a)   a:{2, 6, 12, 20}   b:{4, 18, 48, 100}  c:{3, 4}
After nway_visit(binary, b, c)   a:{2, 6, 12, 20}   b:{12, 72, 48, 100} c:{3, 4}
After nway_visit(binary, c, a)   a:{2, 6, 12, 20}   b:{12, 72, 48, 100} c:{6, 24}
After nway_visit(n_ary, a, b)    a:{24, 432, 576, 2000} b:{12, 72, 48, 100} c:{6, 24}
After nway_visit(n_ary, a, b, c) a:{1728, 746496, 576, 2000}    b:{12, 72, 48, 100} c:{6, 24}
After nway_visit(n_ary, c, b, a) a:{1728, 746496, 576, 2000}    b:{12, 72, 48, 100} c:{124416, 1289945088}

So, the return value should be (124416 + 1289945088) modulo 37, which a pocket calculator confirms is 13.

From Here: Parallel tasks etc.

Your original motivation included getting the parallel execution options from standard library for free. As you know I'm skeptical of the usefulness of that.

However, little stops you from getting this behaviour from thie algorithm:

boost::asio::thread_pool ctx; // or, e.g. system_executor

auto run_task = [&](auto&... fields) {
    boost::asio::post(ctx, [=] { long_running_task(fields...); });
};

Hope this is good inspiration. And thanks for making me look at PFR. It's pretty sweet.

Belie answered 10/4, 2021 at 22:15 Comment(2)
And here's a hacked-up version just to show that std::variant does not need to incur allocation overhead: godbolt.org/z/5M69KcdYf (don't use this code, it's inconsistent for generic use; in particular I "emulated" the *= behaviour by assigning the result back to the left-most operand). Much less optimized, but no heap in sight. (Same for boost::variant godbolt.org/z/6jobWqvsE, although I had to comment the ternary invocations because boost::apply_visitor doesn't support that)Belie
This really is an eye-opening answer. I share your scepticism as far as std::execution::paris concerned, but std::execution::unseq sometimes produces jaw-dropping SIMD code.Eladiaelaeoptene

© 2022 - 2024 — McMap. All rights reserved.