Why this performance deterioration?
Asked Answered
P

2

7

I need to reduce the memory used by my native Windows C++ application, without compromising its performances.

My main data structure is composed by several thousands of instances, dynamically allocated, of the following Line class:

struct Properties
{
    // sizeof(Properties) == 28
};


// Version 1
class Line
{
    virtual void parse(xml_node* node, const Data& data)
    {
        parse_internal(node, data);
        create();
    }

    virtual void parse_internal(xml_node*, const Data&);
    void create();

    Properties p;
};

But since I notice that I could get rid of the class member p, because I only need it within the parse method, I changed the Line implementation:

// Version 2
class Line
{
    virtual void parse(xml_node* node, const Data& data)
    {
        Properties p;

        parse_internal(node, data, &p);
        create(&p);
    }

    virtual void parse_internal(xml_node*, const Data&, Properties*);
    void create(Properties*);
};

This reduced the memory allocated of several megabytes, but it increased the elapsed time by more than 50 milliseconds.

I wonder how is this possible considering that the application has been compiled for release version with speed optimization fully on. Is it due to the argument passing? Is it due to the stack allocation of my struct Properties?

Update:

The method Line::parse is called just once for each instance. The data structure is composed by a std::vector of Lines. Multiple threads manage a different subset of this vector.

Pameliapamelina answered 21/10, 2016 at 7:12 Comment(26)
Have you tried making it a unique_ptr in your parse method? Is your parse method called multiple times for the same Line?Stomatitis
is your application multithreaded? If no, you could use a static class member in that case.Threat
@Stomatitis Do you mean make Properties an unique_ptr? If so, why? Parse is called just once, but base class methods are called too.Pameliapamelina
@Jean-FrançoisFabre My application is multithreaded.Pameliapamelina
High performance or low resource (memory) usage, pick one. That's the usual deal. You often can not get both.Malvin
@Someprogrammerdude But the struct Properties is allocaded just once in both cases. The first version because of the new Line(), the second version on the stack within the parse method.Pameliapamelina
@Pameliapamelina you yourself posited that maybe it was the stack usage. Though it is doubtful, try allocating it on the free store within the method.Stomatitis
@Pameliapamelina would it be acceptable to protect your properties single object by a semaphore? or more elaborate: make a list of properties object (not a lot) and allocate one slot for each thread requiring it, semaphore protected and blocking only if no slots available.Threat
@Jean-FrançoisFabre what I could try is to use one struct per thread.Pameliapamelina
Worth updating your question to include the apparent fact that parse is only invoked once per Line instance. If that is not the case, that information is even more relevant.Anole
@Pameliapamelina the important fact to keep in mind is that you don't want the "allocation" process to take more time than allocation of the object in the stack.Threat
@Ap31 No, what I have are 'nested' calls of parse_internal, because the base class method is invoked too. But that leads to more argument passing and not to stack allocation.Pameliapamelina
@Pameliapamelina thanks for clarifying the caller behavior of parse.Anole
@Pameliapamelina maybe Properties are similiar for most of Lines, and you could create a lookup table with Properties objects? Then just come up with a smart way of identifying the actual Properties the Line needs (e.g. let Line hold a PropertiesId being a hash?)Waddington
@Waddington But than I would need to synchronize this lookup table with semaphores.Pameliapamelina
I think what you are doing is correct I have no idea why this would be any slower. I would probably pass the Properties by const& rather than pointer but that would not make it run faster. Can you provide a minimal program that reproduces this behavior(so we can really see how this gets called/instantiated)?Brutus
Also in verison 2 does parse() need to be virtual? Probably won't affect speed though.Brutus
I guess it will be hard to see what makes it slower without actually looking at the assembly. Do you allocate all lines at the same time or each over time? (maybe the "construction" of Properties is still in the i-cache while you construct the lines and at parse it's not anymore so you have to fetch that code again)Photina
Several thousand objects of size 28? That's about 64Kb of data, then. It won't even use RAM - that sort of data fits in cache.Revis
Does Line contain any other data members? What is sizeof(Line)? What objects does parse modify? It is possible that, since you mention multi-threading, the removal of 28 bytes from the Line structure is causing false-sharing in the access of different Line members from multiple threads.Kershner
@Kershner sizeof(Line) == 136 including the 28 of Properties.Pameliapamelina
You state that the run time increased by more than 50ms. But you didn't indicate the time before the change: was it 10ms, 50ms, 1000ms, or more? I guess the vector doesn't have several thousand, but 100,000 elements or more (several mega-bytes divided by 28) - which means we are discussing less than 500 ns per parse-call.Suave
@Kershner To test the idea of multi-threaded false-sharing the simplest idea would be to add Properties to Line in the modified variant - but not use it. If that restores performance it was false-sharing (possibly because the multi-threaded split of the data was adapted to the size of the objects).Suave
@HansOlsson Yes, OP should try that.Kershner
@HansOlsson Yes I was thinking to do the same test, and yes, the number of elements are actually around 140k in my example.Pameliapamelina
And similarly for testing the idea that the new interface to (and/or implementation of) the recursive function parse_internal is causing the slowdown the simple test is to use the new parse_internal with the old memory layout.Suave
S
0

You write that parse_internal is recursive. That means it gets 3 arguments in the changed variant, instead of 2 in the original - and is called recursively a few times.

You also have to access members using pointer syntax instead of element de-referencing (and possibly verify that the Properties pointer is non-null). To eliminate the pointer issues you can use a reference argument to parse_internal.

Is there a reason to have parse_internal as a virtual member function, or could you change it to be static (in the modified variant)?

Suave answered 21/10, 2016 at 14:24 Comment(0)
R
0

In your first version of the implementation, the Properties p; object is a global variable inside your Line class, which means that it is created when you create each and every time that your Line objects are created (140k as you've mentioned in comments), and destructed/de-allocated only when those objects are being destructed themselves.

In your second version of the implementation, the Properties p; object is a local variable inside your parse() function, which means that it is getting initialized upon the call to that function, and also gets de-allocated at the end of that function.

So my guess is that with this your Version 2 when you are measuring the overall performance, it includes that de-allocation time while in the measurement during your Version 1 it's not counted.

Of course you can avoid that destruction of the Properties p; object inside your parse() function by declaring it as static, but in that case your performance should be somewhat very close to your first measurement.

Instead, in order to improve the overall performance, I'd suggest you to create that Properties p; object only once, back inside the Line class as a member-variable, but inside your parse() function to change/reassign its content (i.e. those 28 bytes) appropriately. That way you would avoid the costly object instantiation and de-allocation time.

By the way, I hope that your struct Properties with 28 bytes is already well-aligned.

EDIT: If your objective is not to improve the overall performance but to decrease the memory used by your application, the only thing you could do in my opinion is to go with your Version 2 but instead of creating that object Properties p; inside your parse() function, just pass the actual values it's going to contain. I.e. not to store them in an object but to directly read or generate them from wherever their source is (I suppose another object that you have omitted for the sake of simplicity of your original post).

Raquel answered 24/9 at 11:1 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.