Most efficient way to calculate lexicographic index
Asked Answered
S

4

9

Can anybody find any potentially more efficient algorithms for accomplishing the following task?:

For any given permutation of the integers 0 thru 7, return the index which describes the permutation lexicographically (indexed from 0, not 1).

For example,

  • The array 0 1 2 3 4 5 6 7 should return an index of 0.
  • The array 0 1 2 3 4 5 7 6 should return an index of 1.
  • The array 0 1 2 3 4 6 5 7 should return an index of 2.
  • The array 1 0 2 3 4 5 6 7 should return an index of 5039 (that's 7!-1 or factorial(7)-1).
  • The array 7 6 5 4 3 2 1 0 should return an index of 40319 (that's 8!-1). This is the maximum possible return value.

My current code looks like this:

int lexic_ix(int* A){
    int value = 0;
    for(int i=0 ; i<7 ; i++){
        int x = A[i];
        for(int j=0 ; j<i ; j++)
            if(A[j]<A[i]) x--;
        value += x*factorial(7-i);  // actual unrolled version doesn't have a function call
    }
    return value;
}

I'm wondering if there's any way I can reduce the number of operations by removing that inner loop, or if I can reduce conditional branching in any way (other than unrolling - my current code is actually an unrolled version of the above), or if there are any clever bitwise hacks or filthy C tricks to help.

I already tried replacing

if(A[j]<A[i]) x--;

with

x -= (A[j]<A[i]);

and I also tried

x = A[j]<A[i] ? x-1 : x;

Both replacements actually led to worse performance.

And before anyone says it - YES this is a huge performance bottleneck: currently about 61% of the program's runtime is spent in this function, and NO, I don't want to have a table of precomputed values.

Aside from those, any suggestions are welcome.

Scute answered 31/5, 2014 at 1:8 Comment(4)
If you're worried about performance, you also need to show the code for the function factorial().Milquetoast
I don't actually have a function called factorial. As mentioned, I have the loop unrolled entirely which allows me to write the literals in place.Scute
@Milquetoast No he doesn't. Factorial can be constant time with a lookup table, or template expansion, or he can just pre-calculate it before he runs and divide by N-i every iteration.Pressurecook
@Pressurecook He can do any of those things, but we don't know unless we see the function definition.Milquetoast
M
2

Don't know if this helps but here's an other solution :

int lexic_ix(int* A, int n){ //n = last index = number of digits - 1
    int value = 0;
    int x = 0;
    for(int i=0 ; i<n ; i++){
        int diff = (A[i] - x); //pb1
        if(diff > 0)
        {
            for(int j=0 ; j<i ; j++)//pb2
            {
                if(A[j]<A[i] && A[j] > x)
                {
                    if(A[j]==x+1)
                    {
                      x++;
                    }
                    diff--;
                }
            }
            value += diff;
        }
        else
        {
          x++;
        }
        value *= n - i;
    }
    return value;
}

I couldn't get rid of the inner loop, so complexity is o(n log(n)) in worst case, but o(n) in best case, versus your solution which is o(n log(n)) in all cases.

Alternatively, you can replace the inner loop by the following to remove some worst cases at the expense of another verification in the inner loop :

int j=0;
while(diff>1 && j<i)
{
  if(A[j]<A[i])
  {
    if(A[j]==x+1)
    {
      x++;
    }
    diff--;
  }
  j++;
}

Explanation :

(or rather "How I ended with that code", I think it is not that different from yours but it can make you have ideas, maybe) (for less confusion I used characters instead and digit and only four characters)

abcd 0  = ((0 * 3 + 0) * 2 + 0) * 1 + 0
abdc 1  = ((0 * 3 + 0) * 2 + 1) * 1 + 0
acbd 2  = ((0 * 3 + 1) * 2 + 0) * 1 + 0
acdb 3  = ((0 * 3 + 1) * 2 + 1) * 1 + 0
adbc 4  = ((0 * 3 + 2) * 2 + 0) * 1 + 0
adcb 5  = ((0 * 3 + 2) * 2 + 1) * 1 + 0 //pb1
bacd 6  = ((1 * 3 + 0) * 2 + 0) * 1 + 0
badc 7  = ((1 * 3 + 0) * 2 + 1) * 1 + 0
bcad 8  = ((1 * 3 + 1) * 2 + 0) * 1 + 0 //First reflexion
bcda 9  = ((1 * 3 + 1) * 2 + 1) * 1 + 0
bdac 10 = ((1 * 3 + 2) * 2 + 0) * 1 + 0
bdca 11 = ((1 * 3 + 2) * 2 + 1) * 1 + 0
cabd 12 = ((2 * 3 + 0) * 2 + 0) * 1 + 0
cadb 13 = ((2 * 3 + 0) * 2 + 1) * 1 + 0
cbad 14 = ((2 * 3 + 1) * 2 + 0) * 1 + 0
cbda 15 = ((2 * 3 + 1) * 2 + 1) * 1 + 0 //pb2
cdab 16 = ((2 * 3 + 2) * 2 + 0) * 1 + 0
cdba 17 = ((2 * 3 + 2) * 2 + 1) * 1 + 0
[...]
dcba 23 = ((3 * 3 + 2) * 2 + 1) * 1 + 0

First "reflexion" :

An entropy point of view. abcd have the fewest "entropy". If a character is in a place it "shouldn't" be, it creates entropy, and the earlier the entropy is the greatest it becomes.

For bcad for example, lexicographic index is 8 = ((1 * 3 + 1) * 2 + 0) * 1 + 0 and can be calculated that way :

value = 0;
value += max(b - a, 0); // = 1; (a "should be" in the first place [to create the less possible entropy] but instead it is b)
value *= 3 - 0; //last index - current index
value += max(c - b, 0); // = 1; (b "should be" in the second place but instead it is c)
value *= 3 - 1;
value += max(a - c, 0); // = 0; (a "should have been" put earlier, so it does not create entropy to put it there)
value *= 3 - 2;
value += max(d - d, 0); // = 0;

Note that the last operation will always do nothing, that's why "i

First problem (pb1) :

For adcb, for example, the first logic doesn't work (it leads to an lexicographic index of ((0* 3+ 2) * 2+ 0) * 1 = 4) because c-d = 0 but it creates entropy to put c before b. I added x because of that, it represents the first digit/character that isn't placed yet. With x, diff cannot be negative. For adcb, lexicographic index is 5 = ((0 * 3 + 2) * 2 + 1) * 1 + 0 and can be calculated that way :

value = 0; x=0;
diff = a - a; // = 0; (a is in the right place)
diff == 0 => x++; //x=b now and we don't modify value
value *= 3 - 0; //last index - current index
diff = d - b; // = 2; (b "should be" there (it's x) but instead it is d)
diff > 0 => value += diff; //we add diff to value and we don't modify x
diff = c - b; // = 1; (b "should be" there but instead it is c) This is where it differs from the first reflexion
diff > 0 => value += diff;
value *= 3 - 2;

Second problem (pb2) :

For cbda, for example, lexicographic index is 15 = ((2 * 3 + 1) * 2 + 1) * 1 + 0, but the first reflexion gives : ((2 * 3 + 0) * 2 + 1) * 1 + 0 = 13 and the solution to pb1 gives ((2 * 3 + 1) * 2 + 3) * 1 + 0 = 17. The solution to pb1 doesn't work because the two last characters to place are d and a, so d - a "means" 1 instead of 3. I had to count the characters placed before that comes before the character in place, but after x, so I had to add an inner loop.

Putting it all together :

I then realised that pb1 was just a particular case of pb2, and that if you remove x, and you simply take diff = A[i], we end up with the unnested version of your solution (with factorial calculated little by little, and my diff corresponding to your x).

So, basically, my "contribution" (I think) is to add a variable, x, which can avoid doing the inner loop when diff equals 0 or 1, at the expense of checking if you have to increment x and doing it if so.

I also checked if you have to increment x in the inner loop (if(A[j]==x+1)) because if you take for example badce, x will be b at the end because a comes after b, and you will enter the inner loop one more time, encountering c. If you check x in the inner loop, when you encounter d you have no choice but doing the inner loop, but x will update to c, and when you encounter c you will not enter the inner loop. You can remove this check without breaking the program

With the alternative version and the check in the inner loop it makes 4 different versions. The alternative one with the check is the one in which you enter the less the inner loop, so in terms of "theoretical complexity" it is the best, but in terms of performance/number of operations, I don't know.

Hope all of this helps (since the question is rather old, and I didn't read all the answers in details). If not, I still had fun doing it. Sorry for the long post. Also I'm new on Stack Overflow (as a member), and not a native speaker, so please be nice, and don't hesitate to let me know if I did something wrong.

Midcourse answered 7/1, 2016 at 13:14 Comment(0)
P
0

Linear traversal of memory already in cache really doesn't take much times at all. Don't worry about it. You won't be traversing enough distance before factorial() overflows.

Move the 8 out as a parameter.

int factorial ( int input )
{
    return input ? input * factorial (input - 1) : 1;
}

int lexic_ix ( int* arr, int N )
{
    int output = 0;
    int fact = factorial (N);
    for ( int i = 0; i < N - 1; i++ )
    {
        int order = arr [ i ];
        for ( int j = 0; j < i; j++ )
            order -= arr [ j ] < arr [ i ];
        output += order * (fact /= N - i);
    }
    return output;
}

int main()
{
    int arr [ ] = { 11, 10, 9, 8, 7 , 6 , 5 , 4 , 3 , 2 , 1 , 0 };

    const int length = 12;
    for ( int i = 0; i < length; ++i )
        std::cout << lexic_ix ( arr + i, length - i  ) << std::endl;
}
Pressurecook answered 31/5, 2014 at 2:0 Comment(4)
Thanks for the input. However, 1 is actually incorrect. 7-i is indeed the intended parameter. Try your code on the array [7 6 5 4 3 2 1 0]. I intend for this to yield 40319, but instead your code never yields a value above 5039 (if I'm not mistaken). Also your 3rd point has already been taken into account - I don't check the 8th element, I only check up until index 6 (the 7th element).Scute
Oops, i read your limit thing wrong. Anyway, my code did return 40319 for 8 element reverse sort.Pressurecook
You got 5039 because you used 7 instead of 8. N is supposed to designate number of elements. I'm gonna edit it to include a basic factorial implementation and some tests.Pressurecook
Oops, you're definitely right there - all outputs are correct. Unfortunately this runs @ about 1/3 the speed of my original implementation, even after unrolling and writing in runtime constants (ie N, factorial(N)).Scute
J
0

Say, for a M-digit sequence permutation, from your code, you can get the lexicographic SN formula which is something like: Am-1*(m-1)! + Am-2*(m-2)! + ... + A0*(0)! , where Aj range from 0 to j. You can calculate SN from A0*(0)!, then A1*(1)!, ..., then Am-1 * (m-1)!, and add these together(suppose your integer type does not overflow), so you do not need calculate factorials recursively and repeatedly. The SN number is a range from 0 to M!-1 (because Sum(n*n!, n in 0,1, ...n) = (n+1)!-1)

If you are not calculating factorials recursively, I cannot think of anything that could make any big improvement.

Sorry for posting the code a little bit late, I just did some research, and find this: http://swortham.blogspot.com.au/2011/10/how-much-faster-is-multiplication-than.html according to this author, integer multiplication can be 40 times faster than integer division. floating numbers are not so dramatic though, but here is pure integer.

int lexic_ix ( int arr[], int N )
{
    // if this function will be called repeatedly, consider pass in this pointer as parameter
    std::unique_ptr<int[]> coeff_arr = std::make_unique<int[]>(N);
    for ( int i = 0; i < N - 1; i++ )
    {
        int order = arr [ i ];
        for ( int j = 0; j < i; j++ )
            order -= arr [ j ] < arr [ i ];
        coeff_arr[i] = order; // save this into coeff_arr for later multiplication
    }
    // 
    // There are 2 points about the following code:
    // 1). most modern processors have built-in multiplier, \
    //    and multiplication is much faster than division
    // 2). In your code, you are only the maximum permutation serial number,
    //     if you put in a random sequence, say, when length is 10, you put in
    //     a random sequence, say, {3, 7, 2, 9, 0, 1, 5, 8, 4, 6}; if you look into
    //     the coeff_arr[] in debugger, you can see that coeff_arr[] is:
    //     {3, 6, 2, 6, 0, 0, 1, 2, 0, 0}, the last number will always be zero anyway.
    //     so, you will have good chance to reduce many multiplications.
    //     I did not do any performance profiling, you could have a go, and it will be
    //     much appreciated if you could give some feedback about the result.
    //
    long fac = 1;
    long sn = 0;
    for (int i = 1; i < N; ++i) // start from 1, because coeff_arr[N-1] is always 0 
    {
        fac *= i;
        if (coeff_arr[N - 1 - i])
            sn += coeff_arr[N - 1 - i] * fac;
    }
    return sn;
}

int main()
{
    int arr [ ] = { 3, 7, 2, 9, 0, 1, 5, 8, 4, 6 }; // try this and check coeff_arr

    const int length = 10;
    std::cout << lexic_ix(arr, length ) << std::endl;
    return 0;
}
Jurdi answered 31/5, 2014 at 2:58 Comment(6)
Okay so a million samples of your code runs @ ~0.086s. Mine runs at ~0.060s. It was a nice idea, but its essentially the same code - mine just computes each coefficient one by one, while yours precomputes all at first.Scute
Thank you for the feedback, that is interesting. In your code, you do multiplication and division for each coefficient, but in my code you have no division, and, because that many coefficients are zeroes so you can skip many multiplications. I would like to do a performance profiling myself sometime. By the way, could you tell me the method(better in code) you did the profiling?Jurdi
I did the profiling myself, the result is quite the contrary, I generated all possible permutations for 9 and 10 digits, the total number of permutations is 362,880 and 3,628,800 respectively. Mine code is more than 20% faster than your code. I guess that maybe you only used some particular number, maybe the maximum serial number which is N!-1, for this case, there's no extra zeros in the number, so it will not skip extra zeros. I will post another answer include the testing code and result.Jurdi
The difference may be that I did all of my testing with multiple runs of 1 million samples across a randomly shuffled set of 8 integers (as 8 integers is really what I need the performance boost for). I also tried yours both as it's written and unrolled with constants written inline. I'm using g++4.9.0 with the -O3 flag. It might also be important to note that my compiler doesn't recognize the "make_unique" call as it's not standard C++11, so I had to replace it with a plain int array (because I honestly don't know much about unique ptrs). I just retested and am still getting the same results.Scute
Also, I used the <chrono> library's high_resolution_clock::time_point type and the high_resolution_clock::now() function to get the current time. I tried sampling and summing the million timings and also taking a single timing across all 1 million sample runs, both came back with comparable results. My desktop is a 3.0 gHz Quad-Core with 8gb RAM & cache sizes of 32k, 256k, and 6144kScute
It might also be worth noting that the article you linked to is slightly outdated, plus it refers to floating pt arithmetic - much different than integer arithmetic. The performance difference is closer to 20% for int-multiply vs int-divide, depending on your processor.Scute
J
0

This is the whole profiling code, I only run the test in Linux, code was compiled using G++8.4, with '-std=c++11 -O3' compiler options. To be fair, I slightly rewrote your code, pre-calculate the N! and pass it into the function, but it seems this does not help much.

The performance profiling for N = 9 (362,880 permutations) is:

  • Time durations are: 34, 30, 25 milliseconds
  • Time durations are: 34, 30, 25 milliseconds
  • Time durations are: 33, 30, 25 milliseconds

The performance profiling for N=10 (3,628,800 permutations) is:

  • Time durations are: 345, 335, 275 milliseconds
  • Time durations are: 348, 334, 275 milliseconds
  • Time durations are: 345, 335, 275 milliseconds

The first number is your original function, the second is the function re-written that gets N! passed in, the last number is my result. The permutation generation function is very primitive and runs slowly, but as long as it generates all permutations as testing dataset, that is alright. By the way, these tests are run on a Quad-Core 3.1Ghz, 4GBytes desktop running Ubuntu 14.04.

EDIT: I forgot a factor that the first function may need to expand the lexi_numbers vector, so I put an empty call before timing. After this, the times are 333, 334, 275.

EDIT: Another factor that could influence the performance, I am using long integer in my code, if I change those 2 'long' to 2 'int', the running time will become: 334, 333, 264.

#include <iostream>
#include <vector>
#include <chrono>
using namespace std::chrono;

int factorial(int input)
{
    return input ? input * factorial(input - 1) : 1;
}

int lexic_ix(int* arr, int N)
{
    int output = 0;
    int fact = factorial(N);
    for (int i = 0; i < N - 1; i++)
    {
        int order = arr[i];
        for (int j = 0; j < i; j++)
            order -= arr[j] < arr[i];
        output += order * (fact /= N - i);
    }
    return output;
}

int lexic_ix1(int* arr, int N, int N_fac)
{
    int output = 0;
    int fact = N_fac;
    for (int i = 0; i < N - 1; i++)
    {
        int order = arr[i];
        for (int j = 0; j < i; j++)
            order -= arr[j] < arr[i];
        output += order * (fact /= N - i);
    }
    return output;
}

int lexic_ix2( int arr[], int N , int coeff_arr[])
{
    for ( int i = 0; i < N - 1; i++ )
    {
        int order = arr [ i ];
        for ( int j = 0; j < i; j++ )
            order -= arr [ j ] < arr [ i ];
        coeff_arr[i] = order;
    }
    long fac = 1;
    long sn = 0;
    for (int i = 1; i < N; ++i)
    {
        fac *= i;
        if (coeff_arr[N - 1 - i])
            sn += coeff_arr[N - 1 - i] * fac;
    }
    return sn;
}

std::vector<std::vector<int>> gen_permutation(const std::vector<int>& permu_base)
{
    if (permu_base.size() == 1)
        return std::vector<std::vector<int>>(1, std::vector<int>(1, permu_base[0]));

    std::vector<std::vector<int>> results;
    for (int i = 0; i < permu_base.size(); ++i)
    {
        int cur_int = permu_base[i];
        std::vector<int> cur_subseq = permu_base;
        cur_subseq.erase(cur_subseq.begin() + i);
        std::vector<std::vector<int>> temp = gen_permutation(cur_subseq);
        for (auto x : temp)
        {
            x.insert(x.begin(), cur_int);
            results.push_back(x);
        }
    }
    return results;
}

int main()
{
    #define N 10
    std::vector<int> arr;
    int buff_arr[N];
    const int length = N;
    int N_fac = factorial(N);
    for(int i=0; i<N; ++i)
        arr.push_back(N-i-1); // for N=10, arr is {9, 8, 7, 6, 5, 4, 3, 2, 1, 0}
    std::vector<std::vector<int>> all_permus = gen_permutation(arr);

    std::vector<int> lexi_numbers;
    // This call is not timed, only to expand the lexi_numbers vector 
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix2(&x[0], length, buff_arr));

    lexi_numbers.clear();
    auto t0 = high_resolution_clock::now();
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix(&x[0], length));
    auto t1 = high_resolution_clock::now();
    lexi_numbers.clear();
    auto t2 = high_resolution_clock::now();
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix1(&x[0], length, N_fac));
    auto t3 = high_resolution_clock::now();
    lexi_numbers.clear();
    auto t4 = high_resolution_clock::now();
    for (auto x : all_permus)
        lexi_numbers.push_back(lexic_ix2(&x[0], length, buff_arr));
    auto t5 = high_resolution_clock::now();

std::cout << std::endl << "Time durations are: " << duration_cast<milliseconds> \
    (t1 -t0).count() << ", " << duration_cast<milliseconds>(t3 - t2).count() << ", " \
        << duration_cast<milliseconds>(t5 - t4).count() <<" milliseconds" << std::endl;
    return 0;
}
Jurdi answered 2/6, 2014 at 9:29 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.