Compiler optimization in C++

Asked 25/2, 2020 at 2:6 Answered 9/10, 2024 at 9:18

c++optimization runtime compiler-optimization

I am trying to implement two classes A and B which contain data stored in std::unique_ptr container, and A could transform to B with some calculation. The class A and class B are shown as below. The input parameter of constructor in class B is designed passing a object A. In order to measure the run-time performance, I use std::chrono library, and add a function "print_construction_time" in class B to show construction time.

CPU: Intel® Core™ i7-6700HQ 2.6GHz

RAM: 16GB

OS: Windows 10 1909

IDE: Microsoft Visual Studio Community 2019 Version 16.4.4

c/c++ optimization setting: Maximum Optimization (Favor Speed) (/O2)

class A
{
public:
    A(int input_size, int input_value)              //  constructor
    {
        this->data = std::make_unique<int[]>(input_size);
        this->size = input_size;
        for (int loop_number = 0; loop_number < size; loop_number++) {
            data[loop_number] = input_value;
        }
    }
    std::unique_ptr<int[]> get_data()
    {
        //  deep copy
        auto return_data = std::make_unique<int[]>(size);
        for (int loop_number = 0; loop_number < size; loop_number++) {
            return_data[loop_number] = data[loop_number];
        }
        return return_data;
    }
    int get_size()
    {
        return this->size;
    }
private:
    int size;
    std::unique_ptr<int[]> data;
};

class B
{
public:
    B(A &input_object)                              //  constructor
    {
        this->size = input_object.get_size();
        this->data = std::make_unique<int[]>(this->size);
        this->start = std::chrono::high_resolution_clock::now();
        //  version 1
        for (int loop_number = 0; loop_number < input_object.get_size(); loop_number++) {
            this->data[loop_number] = transform_from_A(input_object.get_data()[loop_number]);
        }
        this->stop = std::chrono::high_resolution_clock::now();  //  for execution time measurement
    }
    std::unique_ptr<int[]> get_data()
    {
        //  deep copy
        auto return_data = std::make_unique<int[]>(size);
        for (int loop_number = 0; loop_number < size; loop_number++) {
            return_data[loop_number] = data[loop_number];
        }
        return return_data;
    }
    void print_construction_time()
    {
        auto duration = std::chrono::duration_cast<std::chrono::microseconds>(stop - start);
        std::cout << "Duration: " << duration.count() << "microseconds" << std::endl;
    }
private:
    int size;
    std::unique_ptr<int[]> data;
    std::chrono::time_point<std::chrono::steady_clock> start, stop;
    int transform_from_A(int input_value)
    {
        return input_value + 1; // For example
    }
};

The main function is here.

int main()
{
    A a_object(10000, 6);
    B b_object(a_object);
    b_object.print_construction_time();
    
    return 0;
}

For testing, I run this code three times and the execution time is 123407us, 112033us and 107586us. Next, I modify the for loop block in the constructor of class B into version 2 which designed with cached data.

        //  version 2   <=  tremendously faster than version 1
        auto data_cached = input_object.get_data();
        for (int loop_number = 0; loop_number < input_object.get_size(); loop_number++) {
            this->data[loop_number] = transform_from_A(data_cached[loop_number]);
        }

The measurement result of version 2 is 27us, 32us and 43us. I am curious that why the compiler seems not perform the Maximum Optimization in speed automatically through cache tricks based on the condition of the same calculation result and it need to be done by human. I think that this kind of optimization maybe could be done either by compiler automatically or by popping up the modify suggestion in editor environment.

Note: I also modify the optimization setting for testing. The run-time performance is similar between optimization setting in Maximum Optimization (Favor Speed) (/O2) and Optimizations (Favor Speed) (/Ox).

Feb. 27. 2020 Update

I also tried to modify the return type of A::get_data() in class A with adding const keyword of std::unique_ptr.

    const std::unique_ptr<int[]> get_data()    //    <= Add const keyword of std::unique_ptr
    {
        //  deep copy
        auto return_data = std::make_unique<int[]>(size);
        for (int loop_number = 0; loop_number < size; loop_number++) {
            return_data[loop_number] = data[loop_number];
        }
        return return_data;
    }

The measurement result of execution time is as similar as above. They are 101486us, 100538us and 120620us in version 1 and 55us, 30us and 27us in version 2. The optimization setting is Optimizations (Favor Speed) (/Ox) and IDE is updated to Version 16.4.5 .

Moreover, the case of "both const" (not only std::unique_ptr but its content), that is const std::unique_ptr<const int[]> also be considered.

    const std::unique_ptr<const int[]> get_data()
    {
        //  deep copy
        auto return_data = std::make_unique<int[]>(size);
        for (int loop_number = 0; loop_number < size; loop_number++) {
            return_data[loop_number] = data[loop_number];
        }
        return return_data;
    }

The measurement result of execution time is also as similar as above. They are 114754us, 127327us and 106122us in version 1 and 32us, 34us, and 44us in version 2. The optimization setting is also Optimizations (Favor Speed) (/Ox).

Origan answered 25/2, 2020 at 2:6 Comment(13)

So you are asking why the compiler doesn't eliminate all of the copying in A::get_data? – Soilure 25/2, 2020 at 2:24

@Soilure Thank you for your prompt reply. I am curious about why the compiler doesn't estimate iterating of all of the copying in A::get_data is needed or not. – Origan 25/2, 2020 at 2:32

You've tripped over something interesting here: std::chrono::time_point<std::chrono::steady_clock> start, stop; defines with steady_clock but this->start = high_resolution_clock::now(); assigns high_resolution_clock. Looks like high_resolution_clock is steady_clock in the MSVC Standard library. – Soilure 25/2, 2020 at 2:33

I'm not the man to answer that one, it seems. The only sort-of observable behaviour I see is the memory consumption, but clearly the compiler's not making the leap to caching. – Soilure 25/2, 2020 at 2:39

@Soilure In the run-time performance measure part, I refer to geeksforgeeks.org/measure-execution-time-function-cpp about using chrono library. The type of provided example is using auto keyword. The IntelliSense in Visual Studio shows the type is steady_clock. – Origan 25/2, 2020 at 2:40

I knew the standard library could play games with exactly what backed the high resolution clock, but I'd never seen so obvious a demonstration so I'd never given it much thought. Then I tried to compile the code in GCC and the compiler started howling over type mis-matches. – Soilure 25/2, 2020 at 2:45

Sidenote: You have to be a bit more careful with the cut and pasting. The code is a bit confused. Sometimes it's using fully-qualified names and sometimes it's leaving out the namespaces. Makes for a somewhat odd-looking read. – Soilure 25/2, 2020 at 2:47

GCC and Clang don't seem to pick up on this at O2 either. – Soilure 25/2, 2020 at 2:48

@Soilure Thank you for notifying. I have modified the code to fully-qualified names. – Origan 25/2, 2020 at 2:59

Why would you expect the compiler to notice that your (non-const!) method happens to have no side effects so that the allocations can all be merged? Is there some simpler case where such an optimization does happen? – Allspice 27/2, 2020 at 5:35

@DavisHerring Thanks for your response. As your consideration, it may be not appropriate to generate cache automatically in "non-const" condition. I also tried the "const" case and update the question. – Origan 27/2, 2020 at 10:30

I wonder how my code compares: compiler explorer 🤔 (edited: no need for std::move) – Malapropos 4/10, 2024 at 17:29

What is the actual use-case? The most obvious code for this would be to remove get_data and have an A::operator[] that returns the i'th value. That will work the same as version 1 - even if you modify A inside the loop. If you want to modify A in the loop, and use the "cached" value then something like version 2 is necessary, but call it deep_copy - not get_data. – Delinquency 7/10, 2024 at 10:2

The "hidden" issue is that your codes of version 1 and version 2 are not equivalent and you should not be surprised of their different running times.

        //  version 1
        for (int loop_number = 0; loop_number < input_object.get_size(); loop_number++) {
            this->data[loop_number] = transform_from_A(input_object.get_data()[loop_number]);
        }
        
        //  version 2   <=  tremendously faster than version 1
        auto data_cached = input_object.get_data();
        for (int loop_number = 0; loop_number < input_object.get_size(); loop_number++) {
            this->data[loop_number] = transform_from_A(data_cached[loop_number]);
        }

Your function input_object.get_data() is being called in a loop for many times (10K in this your example) in version 1, while in version 2 it's called only once.

EDIT: To clarify, this your input_object.get_data() function is doing some memory manipulation, i.e. a deep-copy of your data object (be careful that it's the one from your base-class that is called), and not just returning an object in the memory. So naming its result as a data_cached adds more ambiguity and masks a bug.

Jenifer answered 4/10, 2024 at 17:15 Comment(0)

First, small tip: write a benchmark instead of using std::chrono, e.g.:

#include <memory>

#include <benchmark/benchmark.h>

class A {
public:
  A(int input_size, int input_value) //  constructor
  {
    this->data = std::make_unique<int[]>(input_size);
    this->size = input_size;
    for (int loop_number = 0; loop_number < size; loop_number++) {
      data[loop_number] = input_value;
    }
  }
  std::unique_ptr<int[]> get_data() {
    //  deep copy
    auto return_data = std::make_unique<int[]>(size);
    for (int loop_number = 0; loop_number < size; loop_number++) {
      return_data[loop_number] = data[loop_number];
    }
    return data;
  }
  int get_size() { return this->size; }

private:
  int size;
  std::unique_ptr<int[]> data;
};

class B1 {
public:
  B1(A &input_object) //  constructor
  {
    this->size = input_object.get_size();
    this->data = std::make_unique<int[]>(this->size);
    //  version 1
    for (int loop_number = 0; loop_number < input_object.get_size();
         loop_number++) {
      this->data[loop_number] =
          transform_from_A(input_object.get_data()[loop_number]);
    }
  }

private:
  int size;
  std::unique_ptr<int[]> data;
  int transform_from_A(int input_value) {
    return input_value + 1; // For example
  }
};

class B2 {
public:
  B2(A &input_object) //  constructor
  {
    this->size = input_object.get_size();
    this->data = std::make_unique<int[]>(this->size);
    //  version 1
    auto data_cached = input_object.get_data();
    for (int loop_number = 0; loop_number < input_object.get_size();
         loop_number++) {
      this->data[loop_number] = transform_from_A(data_cached[loop_number]);
    }
  }

private:
  int size;
  std::unique_ptr<int[]> data;
  int transform_from_A(int input_value) {
    return input_value + 1; // For example
  }
};

static void constructB1(benchmark::State &state) {
  A a_object(10000, 6);
  for (auto _ : state) {
    B1 b_object(a_object);
    benchmark::DoNotOptimize(b_object);
  }
}

BENCHMARK(constructB1);

static void constructB2(benchmark::State &state) {
  A a_object(10000, 6);
  for (auto _ : state) {
    B2 b_object(a_object);
    benchmark::DoNotOptimize(b_object);
  }
}

BENCHMARK(constructB2);

Then of course version B1 is always an order of magnitude slower than B2, because it has O(N^2) complexity compared to B2's linear. There is nothing that compiler can optimize out here, because you are explicitly saying that you need a copy of underlying data. The returned copy can be modified (in your particular case - destroyed), and also the original data could change between the calls - so there is no link between data stored in class A and your copy.

The underlying problem seems to be that you are trying to "share" a unique ownership. If you need to "copy" your unique_ptr and give it away, then indeed the deep-copy of the memory is needed, however if you only need to share "a view", you can simply return a const int *:

  const int* get_data_view() {
    return data.get();
  }

and change your constructors accordingly:

// v1
transform_from_A(input_object.get_data_view()[loop_number]);
// v2
auto data_cached = input_object.get_data_view();

you will then get the same timing results for both versions.

If you can afford to change your class A member to std::shared_ptr<int[]> data; (as suggested by @Hans Olson) then it could be simplified further to

    std::shared_ptr<int[]> getData() {
        return data;
    }
    // if you want to have both const and non-const versions
    std::shared_ptr<const int[]> getData() const {
        return data;
    }

With unique_ptr this is not possible, because it's non-copyable, and const ref tricks like

    const std::unique_ptr<const int[]>& getData() {
        return reinterpret_cast<const std::unique_ptr<const int[]> &>(data);
    }

are violating type aliasing rules.

Plywood answered 9/10, 2024 at 9:18 Comment(3)

For the point about "a view" you could also use std::shared_ptr instead of std::unique_ptr. (and don't change the contents after constructor). – Delinquency 9/10, 2024 at 13:32

You mean to change the type of class member std::unique_ptr<int[]> data; to std::shared_ptr<int[]> data; ? Yes, if it's acceptable from the design POV. Because to return shared_ptr constructed from std::unique_ptr<int[]> data; you would need to std::move(data) which I believe is not desirable. – Plywood 9/10, 2024 at 13:45

Yes, and possibly change to std::shared_ptr<const int[]> data to enforce that it isn't changed. – Delinquency 9/10, 2024 at 14:8

I don't have enough reputation yet to add comments, otherwise I would have added this in the comments section.

I guess what @Davis Herring means with "(non-const!) method" is not the const on your returned data, but the const on the method itself.

The const on the returned data only indicates that the caller is not allowed to modify the returned data, thus the owner is safe to keep its data.
The non-const method still allows to change the object during the call which could affect the returned result. For example iterate through a list and return a different pointer at every call.

The const on the method indicates that the called object cannot change. When the object does not change between similar calls, then the compiler knows that calling the same method again will produce the same response, thus the duplicate calls can be optimized out.

So try to use

std::unique_ptr<const int[]> get_data() const

const std::unique_ptr<const int[]> get_data() const

instead of

std::unique_ptr<const int[]> get_data()

const std::unique_ptr<const int[]> get_data()

Also mark transform_from_A as const or static to indicate it does not modify your B instance.

I guess smart compilers may be able to detect which non-const methods are actually const and can optimize that on their own but you have a better chance when you tell what your code does (and doesn't).

So far with the theory how the compiler could read your code and how it may optimize it.
I have no idea how your compiler actually optimizes your code, it could do less than expected or it could do more.

If you are a bit familiar with assembler, I recommend to compare the generated assembler code. It may give you a better hint which code is optimized in a better way than just comparing the execution times.

Tolland answered 7/10, 2024 at 16:7 Comment(0)

Hot tags

Godot Unity Godot Help Programming Godot 4.X GUI GDScript 3D 2D Physics CSharp Godot 3.X VR XR Projects C++

Recommended topics

Hot tags