Measuring the bandwidth of multi-threaded application
Asked Answered
R

2

8

What is the easiest and most effective way to measure the bandwidth that my application (multi-threaded, written with OpenMP) is using? I ran STREAM to get the max. sustainable bandwidth, and I'd like now to know if I am saturating the whole available bandwidth or not.

I found a couple related questions (e.g. Main memory bandwidth measurement), but I could not find the answer to this question;

Sadly, I cannot use VTune, but I can use PAPI counters;

My primary goal is to find out if the poor scalability of my application is connected to the saturation of the memory bandwidth.

Thanks

Rockwell answered 25/8, 2014 at 22:59 Comment(0)
F
4

There's a number of ways of getting (from the command line) the bandwidth over the whole application, but it sounds like there are a number of kernels you'd like to look at individually. In that case, wrapping parts of your code with PAPI calls is a perfectly sensible way to go.

You can use PAPI event counters on your system (papi_avail) to find the total number of load/store instructions, and if you know the sizes of your load/stores you can get the memory bandwidth. Alternately, you can count for hits in your caches, and multiply by the line sizes, to infer the actual amount of data transferred across the system. There is documentation in various places on the PAPI wiki, e.g. here for the high-level interface, and here's some useful formula for helpful derived quantities.

Here's a coded-up simple example, doing a matrix-vector multiplication the sensible way and the cache-unfriendly transposed way. Note that calling PAPI_read_counters resets the counters, which is what we want here.

#include <stdio.h>
#include <stdlib.h>
typedef char * caddr_t;
#include <papi.h>
#include <sys/time.h>

int init(float ***a, float **x, float **y, int size);
void report_results(char *tname, long_long *values, const int n, double wtime);
void sensible_matvec(float **a, float *x, float *y, int size);
void wrong_order_matvec(float **a, float *x, float *y, int size);
void tick(struct timeval *t);
double tock(struct timeval *t);

#define NUM_EVENTS 3
int main(int argc, char **argv) {
    const int matsize = 4096;

    float **a, *x, *y;
    init(&a, &x, &y, matsize);

    int events[NUM_EVENTS] = {PAPI_L1_DCM, PAPI_LST_INS, PAPI_FP_INS};
    long_long values[NUM_EVENTS];

    double walltime;
    struct timeval t;

    if (PAPI_start_counters(events, NUM_EVENTS) != PAPI_OK) {
       fprintf(stderr, "Error starting PAPI counters; aborting\n");
       exit(1);
    }

    tick(&t);
    sensible_matvec(a, x, y, matsize);
    PAPI_read_counters(values, NUM_EVENTS);
    walltime = tock(&t);

    report_results("Sensible", values, NUM_EVENTS, walltime);

    tick(&t);
    wrong_order_matvec(a, x, y, matsize);
    PAPI_stop_counters(values, NUM_EVENTS);
    walltime = tock(&t);

    report_results("Wrong order", values, NUM_EVENTS, walltime);

    return 0;
}

void report_results(char *tname, long_long *values, const int n, double wtime) {
    long_long total_mem = values[1];
    long_long total_flops = values[2];
    long_long l1misses = values[0];
    printf("Test %s: time elapsed = %f, memory accesses = %lld, flop = %lld\n",
            tname, wtime, total_mem, total_flops);
    printf("\tMemory bandwidth (MB/sec) = %f\n", 1.0*total_mem*sizeof(float)/(wtime*1024*1024));
    printf("\tL1 cache miss rate = %f\n", 1.0*l1misses/total_mem);
    printf("\tMFLOPS = %lf\n\n", 1.0*total_flops/(wtime*1024*1024));
}

int alloc2d(float ***a, int n);
int free2d(float ***a, int n);
int alloc1d(float **x, int n);
int free1d(float **x, int n);

int init(float ***a, float **x, float **y, int size) {
    if (alloc2d(a,size))
        return -2;

    if (alloc1d(x,size)) {
        free2d(a,size);
        return -2;
    }

    if (alloc1d(y,size)) {
        free2d(a,size);
        free1d(x,size);
        return -3;
    }

    for (int i=0; i<size; i++) {
            (*x)[i] = (float)i;
            (*y)[i] = 0.;
    }

    for (int i=0; i<size; i++) {
        for (int j=0; j<size; j++) {
            (*a)[i][j] = i;
        }
    }

    return 0;
}
void sensible_matvec(float **a, float *x, float *y, int size) {
    for (int i=0; i<size; i++) {
        for (int j=0; j<size; j++) {
            y[i] += a[i][j]*x[j];
        }
    }
}

void wrong_order_matvec(float **a, float *x, float *y, int size) {
    for (int j=0; j<size; j++) {
        for (int i=0; i<size; i++) {
            y[i] += a[i][j]*x[j];
        }
    }
}

void tick(struct timeval *t) {
    gettimeofday(t, NULL);
}


double tock(struct timeval *t) {
    struct timeval now;
    gettimeofday(&now, NULL);
    return (double)(now.tv_sec - t->tv_sec) + ((double)(now.tv_usec - t->tv_usec)/1000000.);

}


void freeall(float ***a, float **x, float **y, int size) {
    free2d(a, size);
    free1d(x, size);
    free1d(y, size);
    return;
}

int alloc2d(float ***a, int n) {
    float *data = (float *)malloc(n*n*sizeof(float));
    if (data == NULL) return -1;

    *a = (float **)malloc(n*sizeof(float *));
    if (*a == NULL) {free(data); return -1;};

    for (int i=0; i<n; i++)
        (*a)[i] = &(data[i*n]);

    return 0;
}
int free2d(float ***a, int n) {
    free (&((*a)[0][0]));
    free(*a);

    return 0;
}


int alloc1d(float **a, int n) {
    *a = (float *)malloc(n*sizeof(float));
    if (*a == NULL) return -1;

    return 0;
}

int free1d(float **a, int n) {
    free(*a);

    return 0;
}

Running gives:

$ gcc -o papi-test papi-test.c -I${PAPI_INC_DIR} -L${PAPI_LIB_DIR} -lpapi -Wall -std=c99
$ ./papi-test
Test Sensible: time elapsed = 0.121877, memory accesses = 302020775, flop = 33580481
    Memory bandwidth (MB/sec) = 9453.119330
    L1 cache miss rate = 0.003921
    MFLOPS = 262.763624

Test Wrong order: time elapsed = 0.537639, memory accesses = 302026751, flop = 39629352
    Memory bandwidth (MB/sec) = 2142.963254
    L1 cache miss rate = 0.094045
    MFLOPS = 70.295301
Factitious answered 26/8, 2014 at 15:6 Comment(17)
What are the commands to get the bandwidth used by the whole application? I'd be also interested in that.Rockwell
According to the site you mentioned, memory bandwidth is computed as follows: ((PAPI_Lx_TCM * Lx_linesize) / PAPI_TOT_CYC) * Clock(MHz), where Lx is the last level cache. We've actually been using a similar formula: (PAPI_Lx_TCM * Lx_linesize) / Time(s). However, we do not think this covers things such as prefetching and cache writeback, which also contribute to used bandwidth (source)Incus
@CristianoSousa - yes, it depends on what you want to measure. Final-level cache traffic gives you a better measure of the actual traffic over the memory subsystem, which is maybe what you care about, but it also gives "credit" for pulling over data which you don't actually use, which might not be what you want, depending on why you're doing this measurement. And prefetching is a whole other set of measurement challenges.Factitious
@Rockwell - if you just want some of the same counter data aggregated over the entire application, you can use perf stat (may need root)Factitious
+1 because your answer is better than mine in terms of answering the OPs question. But could you tell me how knowing the bandwidth by itself is useful? Your example of matrix*vector is a good one because it's one where it's easy to write on an envelop the number of floating point operations (2.0*n^2)and memory reads (O(n^2)) and it's memory bound. But if one only knew the throughput how useful would it be? I guess if it's larger than the max main memory throughput I would know it's not totally memory bound. Wouldn't the rate of CPU stalls much more useful metric?Impressionism
@Zboson - Are counting stalls more useful? Probably, usually, (and there's a PAPI interface to that counter - PAPI_MEM_SCY), but it all depends on the question that someone's trying to answer. If the question is "are there tuning opportunities", stalls are a pretty good thing to target, as are cache misses at various levels; but if the question is the yes/no question "Have I reached the limits of this hardware", measuring total bandwidth is a pretty reasonable place to look, too.Factitious
@Zboson the main reason we want to measure used memory bandwidth is to see if we are saturating the available bandwidth in order to prove that increasing the number of threads wont increase performance (source).Incus
@JonathanDursi Measuring PAPI_MEM_SCY would be an interesting approach, but the only stall counter available on our system is PAPI_STL_ICY.Incus
@JonathanDursi Do you think that perf stat's output, combined with STREAM's, will be enough for making my point?Rockwell
@Rockwell - probably not, because the numbers you get from perf stat will be over the entire program run - including a lot of things (initialization, etc) which aren't running at maximum bandwidth, bringing the averages down. You'll probably need to wrap function calls just around your kernels.Factitious
@JonathanDursi regarding pulling data I don't actually use: I suppose you are referring to unused data from the retrieved cache lines? IMO this must be taken into consideration as it contributes to used memory bandwidth. Furthermore, if the CPU has a write-through cache, we must take into account writes to any cache level. It seems to me that a counter that measures any activity on the bus would be helpful. Vtune seems to offer such a counter.Incus
@CristianoSousa - As with Zboson's question in this thread, it all depends on what the underlying question is you're trying to find an answer to. For this particular OPs question, which is a somewhat unusual use case, yes you want to include absolutely everything that goes over the wire for whatever reason. But that's not the typical memory performance profiling use case.Factitious
@CristianoSousa - and even in this case, it might still not be what one wants. If one is in fact completely saturating the memory bandwidth, but it turns out one was only using 1/4 of the data in each cache line with each request so that with better placement of data one could more effectively make use of memory requests, that would be a very important piece of data to have; indeed, showing that one was saturating the bandwidth to the LLC would be actively misleading. Which is to say, it all depends.Factitious
@JonathanDursi, regarding your last comment: that is "impossible", since this is an irregular algorithm... On top of that, I think that regardless we use the whole cache line or not, we need to measure that for real bandwidth usage values.Rockwell
@JonathanDursi Jonathan, it seems to me that the bottom line is that as we can't accurately measure write backs and prefetching, there is no way to measure bandwidth usage with PAPI counters. I am right?Rockwell
Sure you can; PAPI has (for instance) PAPI_PRF_DM (Data prefetch cache miss), there's the perf tool which has, events for LLC-prefetches and a whole suite of writeback: events. You can even use perf annotate to ascribe those to lines of code so that you can calculate per-routine bandwidths if you also measure the times of those routines. Note you'll still need IC + TLB data, etc. But I still have grave doubts if that's going to get you what you want; because you can prove you're saturating the bandwidth, but you'll still have to prove you're making optimum use of that bandwidth.Factitious
@JonathanDursi proving that I am saturating bandwidth is all I care at this point; Making scientific proof that I am making optimal use of bandwidth seems to me, by all means, impossible (and thats not our concern at the moment, we "simply" want to prove that the algorithm doesn't scale because of BW saturation).Rockwell
I
1

To measure the bandwidth of your application you need to know how much memory is being read and/or written, lets call that the numerator, and you need to know how much time it takes to read and/or write it, lets call that the denominator. The bandwidth is the numerator/denominator.

If you're application is complicated then it might not so easy to calculate how much memory is being read and/or written. Additionally, if your application is doing many other operations it might not be easy to calculate the time. You would have to subtract off the time of the other operations. Therefore, when measuring maximum throughput simple algorithms are usually used.

If you want to pick a simile algorithm to try and compare with your application then you should see if your application only writes to data, only reads from data, or both reads and writes.

If you're only writing data you can use a write (memset) test:

#pragam omp parallel for
for(int i=0; i<n; i++) {
    x[i] = k;
}

If you're both reading and writing data you can do a simple copy (memcpy) test

#pragma omp parallel for
for(int i=0; i<n; i++) {
    y[i] = x[i];
}

In fact if you look at the STREAM source code that's basically what it does for the copy test.

If you're only reading data you can do a reduction like this (make sure to compile with -ffast-math if you want this to vectorize):

#pragma omp parallel for reduction(+:sum)
for(int i=0; i<n; i++) {
    sum += x[i]*y[i];
}

STREAM's test are all read and write tests. I have written my own bandwidth tool which does writes only, reads and writes, and reads only.

Unfortunately, the tests which write data won't get close to the peak bandwidth. The reason is that in order to write data they have to read the data into the cache first. This is the reason that STREAM does not get anywhere close to the peak bandwidth on my system. In order to get the peak bandwidth when doing writes you need to do non-temporal stores which only write to data without first reading it into the cache.

For example with SSE and assuming that x and y are float array you could do the read and write test like this:

#pragma omp parallel for    
for(int i=0; i<n/4; i++) {
    __m128 v = _mm_load_ps(&x[4*i]);
    _mm_stream_ps(&y[4*i], v);
}

If you look at Agner Fog's asmlib you will see this is exactly what he does for memset and memcpy for large arrays. In fact his asmlib and that example I just gave get 85% (45 GB/s out of 51 GB/s) of the bandwidth on my system whereas the STREAM tests only get about 45% of the bandwidth.

These tests assume that your algorithm is memory bound and to compare you read an array much larger than the slowest cache. If your algorithm reuses data that's still in the cache then the read tests won't get close to the peak bandwidth either because of carried loop dependency. To fix that you have to unroll 3-10 times depending on the operation and hardware. Also, if you're doing writes for arrays which fit in the cache which you will reuse then you don't want to do non-temporal stores. That's why Agner Fog's asmlib only uses non-temporal stores for large arrays.

Impressionism answered 26/8, 2014 at 8:24 Comment(9)
I am sorry, I don't see how to generalize your suggestion to a point that I can measure the bandwidth usage of my application. My application is very complex, with many parallel zones and kernels. Data reuse happens in some kernels. My hint is that my application is memory bound, but even that claim I can't back up with a scientific method, since I don't know any.Rockwell
In that case you you have to isolate regions of your code and determine e.g. how many floating point operations you are doing and compare to the peak GFLOPS or bandwidth. That's how I would do it. That's easy to do for something like matrix multiplication or image convolutions. It should be possible for something like ray tracing as well which is more complicated.Impressionism
I only use integer operations, is there any such thing as GIOPS? And BTW, is there any method to determine if an application is memory bound or do we guide ourselves by intuition? ThanksRockwell
I don't know of terms with integers but the concepts are the same. To answer if you're memory bound or not requires that you know how many operations you're trying to calculate. I can write code that reads, writes, multiplies, and adds n-floats (e.g. n=4 for core2) every clock cycle so it totally saturates the memory throughput but it's not memory bound. Maybe you can find an upper/lower bound on the number of integer operations your algorithm does and use that?Impressionism
@a3mlord, BTW, IOPS refers to input/output operations and not integers.Impressionism
@Rockwell maybe there is a profiling tool which can tell you how often your application spends doing reads/write compared to calculating integers. That could be a way to determine if you're memory bound.Impressionism
Can you specify your suggestions a bit further? What profiling tool and how to analyzing this data? Now, if I am not using that, lets assume that I can give an upper bound on the number of operations that I compute. Lets also assume that it is 2^n where n is an input parameter of the application. From this, how can I get to the answer? I still don't know how many reads from memory I have per sec., right?Rockwell
No, somebody else will have not answer you question as to a profiling tool. I don't have enough knowledge about that. However, if you known the upper bound on the number of calculations just divide the time it takes to calculate them. Then divide by the peak integer operations per second to get the efficiency.Impressionism
Thanks Z boson, although thats not the answer that I was looking for. I am looking for a practical way to get there (that includes to specify the profiling tool and what to do to get there).Rockwell

© 2022 - 2024 — McMap. All rights reserved.