Run-length decoding in MATLAB
Asked Answered
A

4

13

For clever usage of linear indexing or accumarray, I've sometimes felt the need to generate sequences based on run-length encoding. As there is no built-in function for this, I am asking for the most efficient way to decode a sequence encoded in RLE.

Specification:

As to make this a fair comparison I would like to set up some specifications for the function:

  • If optional second argument values of same length is specified, the output should be according to those values, otherwise just the values 1:length(runLengths).
  • Gracefully handle:
    • zeros in runLengths
    • values being a cell array.
  • Output vector should have same column/row format as runLengths

In short: The function should be equivalent to the following code:

function V = runLengthDecode(runLengths, values)
[~,V] = histc(1:sum(runLengths), cumsum([1,runLengths(:).']));
if nargin>1
    V = reshape(values(V), 1, []);
end
V = shiftdim(V, ~isrow(runLengths));
end

Examples:

Here are a few test cases

runLengthDecode([0,1,0,2])
runLengthDecode([0,1,0,4], [1,2,4,5].')
runLengthDecode([0,1,0,2].', [10,20,30,40])
runLengthDecode([0,3,1,0], {'a','b',1,2})

and their output:

>> runLengthDecode([0,1,0,2])
ans =
     2     4     4

>> runLengthDecode([0,1,0,4], [1,2,4,5].')
ans =    
     2     5     5     5     5

>> runLengthDecode([0,1,0,2].', [10,20,30,40])
ans =
    20
    40
    40

>> runLengthDecode([0,3,1,0],{'a','b',1,2})
ans = 
    'b'    'b'    'b'    [1]
Aphoristic answered 13/2, 2015 at 14:7 Comment(28)
It seems you just need to decorate this question's accepted answer with varargins.Demos
@Divakar: Dang! Why can't questions with great answers simply be titled what I'm looking for!Aphoristic
@Divakar: Wait: The zeros don't really work, but it looks like a good first step.Aphoristic
A mask would make it work, use that as pre-processing at the start to alter both inputs.Demos
@Divakar: Now that I've used the tag to search for run-length encoding, there's even a similar one you answered.Aphoristic
So.. is this a duplicate or not? I was devising an answer...Canasta
Anyway, your function seems to be efficient already, right?Canasta
@LuisMendo: I'm uncertain if we should close it as a duplicate. I was mainly thinking of a simple copy-paste solution if anyone was looking for run-length-decoding. And as I couldn't find one directly (only looking for the text, not the tag; now searching for the tag I've found plenty) I was thinking to produce a question that could be found more easily...Aphoristic
@LuisMendo: So gnovice's answer will be the fastest? It looks like you used it in a recent answer, so I assume.Aphoristic
Hm. Neither of the other questions really clarifies what is the fastest solution and most of them don't handle zeros correctly. So I'm not perfectly content with closing the question.Aphoristic
Shall we create a community wiki answer to measure performance? We could include all other linked answers and compare. Maybe you can define some test cases. And I must tell you, the requirement to preserve runLenghts orientation in the result is painful :-)Canasta
@LuisMendo: Can I convert it somehow? I've never done community-wikis. Yes, the transposing is annoying, but I wanted to make the functionality more predictable :)Aphoristic
Please be more specific than 'most efficient'. At least give (the size of) your input and the 'efficiency' of your example solution.Lapel
@Aphoristic Creating a community wiki answer is easy. Just create the answer, and mark a "wiki" flag somewhere. Important points are: 1) Define test cases. 2) Decide who runs the tests: they should be done in the same computer, for consistency. Do you feel like doing it? Are you familiar with timeit? If not, I could write the benchmarking codeCanasta
@LuisMendo: Oh, an answer to this question? You mean the tests should be run on ideone.com or on the persons own computer?Aphoristic
@LuisMendo: I've recently done a benchmarking-code for euclidean distances, so we could harvest it.Aphoristic
@Aphoristic Yes, a wiki answer to this question. I'm not familiar with running code in ideone.com. I was thinking about a computer. Specifically your computer :-PCanasta
@LuisMendo: I could do it. So the benefit of making it a community answer is that if (theoretically) someone comes up with yet another competing solution, he can benchmark it and add it to the community-answer?Aphoristic
@Aphoristic I don't think there's any benefit; it just makes more sense, as no-one is "the author" and so no-one gets the reputation associated with that answer. And yes, everyone feels more free to edit that answer. Although in this case, if you're going to run the tests, that's not an advantage reallyCanasta
@Aphoristic I think this question should have received more attention, so I'm considering setting a bounty on it (after the compulsory two-day period). What do you think? I hope the system won't get suspicious that I set a bounty on a question which only I have answered :-)Canasta
@LuisMendo: Pro: If we attract more different answers, the runtime comparison would look more definite and the search for the most efficient solution can be ended, as we're being more exhaustive on the space of all possible solutions. Contra: I doubt that anybody will be able to improve much on gnovice's answer. My two cents: I have really no idea if a bounty would change anything.Aphoristic
@Aphoristic Hm. I think I'll give it a try in one or two daysCanasta
@Aphoristic Meta has clarified there's no problem in me offering a bounty on a question I have answered. So here it comesCanasta
@LuisMendo: I'll just put on this and go into waiting mode. ;-)Aphoristic
With ML2015a repelem was added. Seems it does not support cells but I don't have this version to try it.Ozonosphere
@Daniel: I guess it was time to introduce repelem; finally.Aphoristic
@Aphoristic Can you add repelem to your benchmarking?Lunetta
@Dan: I actually don't have access to a MATLAB installation anymore. But feel free to edit the answer if you have!Aphoristic
A
6

To find out which one is the most efficient solution, we provide a test-script that evaluates the performance. The first plot depicts runtimes for growing length of the vector runLengths, where the entries are uniformly distributed with maximum length 200. A modification of gnovice's solution is the fastest, with Divakar's solution being second place. Speed comparison

This second plot uses nearly the same test data except it includes an initial run of length 2000. This mostly affects both bsxfun solutions, whereas the other solutions will perform quite similarly.

Speed comparison

The tests suggest that a modification of gnovice's original answer will be the most performant.


If you want to do the speed comparison yourself, here's the code used to generate the plot above.

function theLastRunLengthDecodingComputationComparisonYoullEverNeed()
Funcs =  {@knedlsepp0, ...
          @LuisMendo1bsxfun, ...
          @LuisMendo2cumsum, ...
          @gnovice3cumsum, ...
          @Divakar4replicate_bsxfunmask, ...
          @knedlsepp5cumsumaccumarray
          };    
%% Growing number of runs, low maximum sizes in runLengths
ns = 2.^(1:25);
paramGenerators{1} = arrayfun(@(n) @(){randi(200,n,1)}, ns,'uni',0);
paramGenerators{2} = arrayfun(@(n) @(){[2000;randi(200,n,1)]}, ns,'uni',0);
for i = 1:2
    times = compareFunctions(Funcs, paramGenerators{i}, 0.5);
    finishedComputations = any(~isnan(times),2);
    h = figure('Visible', 'off');
    loglog(ns(finishedComputations), times(finishedComputations,:));
    legend(cellfun(@func2str,Funcs,'uni',0),'Location','NorthWest','Interpreter','none');
    title('Runtime comparison for run length decoding - Growing number of runs');
    xlabel('length(runLengths)'); ylabel('seconds');
    print(['-f',num2str(h)],'-dpng','-r100',['RunLengthComparsion',num2str(i)]);
end
end

function times = compareFunctions(Funcs, paramGenerators, timeLimitInSeconds)
if nargin<3
    timeLimitInSeconds = Inf;
end
times = zeros(numel(paramGenerators),numel(Funcs));
for i = 1:numel(paramGenerators)
    Params = feval(paramGenerators{i});
    for j = 1:numel(Funcs)
        if max(times(:,j))<timeLimitInSeconds
            times(i,j) = timeit(@()feval(Funcs{j},Params{:}));
        else
            times(i,j) = NaN;
        end
    end
end
end
%% // #################################
%% // HERE COME ALL THE FANCY FUNCTIONS
%% // #################################
function V = knedlsepp0(runLengths, values)
[~,V] = histc(1:sum(runLengths), cumsum([1,runLengths(:).']));%'
if nargin>1
    V = reshape(values(V), 1, []);
end
V = shiftdim(V, ~isrow(runLengths));
end

%% // #################################
function V = LuisMendo1bsxfun(runLengths, values)
nn = 1:numel(runLengths);
if nargin==1 %// handle one-input case
    values = nn;
end
V = values(nonzeros(bsxfun(@times, nn,...
    bsxfun(@le, (1:max(runLengths)).', runLengths(:).'))));
if size(runLengths,1)~=size(values,1) %// adjust orientation of output vector
    V = V.'; %'
end
end

%% // #################################
function V = LuisMendo2cumsum(runLengths, values)
if nargin==1 %// handle one-input case
    values = 1:numel(runLengths);
end
[ii, ~, jj] = find(runLengths(:));
V(cumsum(jj(end:-1:1))) = 1;
V = values(ii(cumsum(V(end:-1:1))));
if size(runLengths,1)~=size(values,1) %// adjust orientation of output vector
    V = V.'; %'
end
end

%% // #################################
function V = gnovice3cumsum(runLengths, values)
isColumnVector =  size(runLengths,1)>1;
if nargin==1 %// handle one-input case
    values = 1:numel(runLengths);
end
values = reshape(values(runLengths~=0),1,[]);
if isempty(values) %// If there are no runs
    V = []; return;
end
runLengths = nonzeros(runLengths(:));
index = zeros(1,sum(runLengths));
index(cumsum([1;runLengths(1:end-1)])) = 1;
V = values(cumsum(index));
if isColumnVector %// adjust orientation of output vector
    V = V.'; %'
end
end
%% // #################################
function V = Divakar4replicate_bsxfunmask(runLengths, values)
if nargin==1   %// Handle one-input case
    values = 1:numel(runLengths);
end

%// Do size checking to make sure that both values and runlengths are row vectors.
if size(values,1) > 1
    values = values.'; %//'
end
if size(runLengths,1) > 1
    yes_transpose_output = false;
    runLengths = runLengths.'; %//'
else
    yes_transpose_output = true;
end

maxlen = max(runLengths);

all_values = repmat(values,maxlen,1);
%// OR all_values = values(ones(1,maxlen),:);

V = all_values(bsxfun(@le,(1:maxlen)',runLengths)); %//'

%// Bring the shape of V back to the shape of runlengths
if yes_transpose_output
    V = V.'; %//'
end
end
%% // #################################
function V = knedlsepp5cumsumaccumarray(runLengths, values)
isRowVector = size(runLengths,2)>1;
%// Actual computation using column vectors
V = cumsum(accumarray(cumsum([1; runLengths(:)]), 1));
V = V(1:end-1);
%// In case of second argument
if nargin>1
    V = reshape(values(V),[],1);
end
%// If original was a row vector, transpose
if isRowVector
    V = V.'; %'
end
end
Aphoristic answered 13/2, 2015 at 14:7 Comment(18)
Do you think it's a good idea to use feval within timeit? Won't that time affected by 'feval's overhead? I usually build an anonymous function or handle and feed that into timeitCanasta
@LuisMendo: The performance difference is not measurable for me if compared to timeit(@()Funcs{j}(Params{:})), sometimes even feval is faster.Aphoristic
@LuisMendo: Great, I'll just add the resulting plot.Aphoristic
@Aphoristic We should perhaps add a version based on gnovice's approach. I can do that if you want (I don't want to keep throwing you things to do!), I mean write the function in the wiki answer.Canasta
@LuisMendo: Ok, just tell me when you've got the code so I can regenerate the plot. I just quickly add my edit that exports the figure without need to open the matlab gui.Aphoristic
@Aphoristic It turns out Gnovice's approach doesn't work when runLengths has zeros. Of course that could be fixed, but doing so requires significant modification. So I think it's not worth the effortCanasta
@LuisMendo: How about what Divakar suggested to simply go with a preprocessing mask before anything else: values = values(runLengths~=0); runLengths = nonzeros(runLengths);?Aphoristic
@Aphoristic Of course! I hadn't realized it was so simple!Canasta
@LuisMendo: The only annoying part is making runLengths==0 work.Aphoristic
I've edited this answer to incorporate Gnovice's approach with the suggestions by Divakar and by you. I also changed my two solutions: there was an error in the final if, which sometimes didn't preserve orientation. Sorry about that. I said that part was painful... :-)Canasta
@LuisMendo: Great! Then here I come!Aphoristic
@LuisMendo: Finally. I did make a few alterations to gnovice's code, to get rid of the second copy of runLengths2 and to make it work for runLengths==0. Seems it is quite a bit faster than the other solutions. The memory overhead can't be seen from the plot, but I guess with the possible exception of bsxfun if there is one run that's a lot longer than the others, they will all perform quite the same. I would have liked if the code of this version came out cleaner, but well, that's a tradeoff for speed I guess.Aphoristic
@Aphoristic This is what I got after adding mine into the list, for ns = 2.^(1:20).Demos
@Divakar: I get similar results on my machine. You came quite close! I'd be happy if you'd add it to the wiki! Wait? So on your machine Luis' answers are slower than histc? Odd...Aphoristic
@Aphoristic Yup, that looks odd. Could you do one pass on this at your end ? Also, I have just updated tiny bit of my function code and here's the updated runtimes - postimg.org/image/yj6f9xk0pDemos
@Divakar: For this test-case on my machine your solution is faster than gnovice's!! I will add the code to the wiki. (I guess I should however add somehow, that there will be significant performance drops if there is one runlength that is relatively large, like in paramGenerators = arrayfun(@(n) @(){[2000;randi(200,n,1)]}, ns,'uni',0);)Aphoristic
@Aphoristic Exactly, that's really the catch with repmatting and using mask!Demos
@Divakar: I did add your code and a second plot with a slight modification of the input data just to show the limits of bsxfun. Oh, you already commented. Nevermind.Aphoristic
C
5

Approach 1

This should be reasonably fast. It uses bsxfun to create a matrix of size numel(runLengths)xnumel(runLengths), so it may not be suitable for huge input sizes.

function V = runLengthDecode(runLengths, values)
nn = 1:numel(runLengths);
if nargin==1 %// handle one-input case
    values = nn;
end
V = values(nonzeros(bsxfun(@times, nn,...
    bsxfun(@le, (1:max(runLengths)).', runLengths(:).'))));
if size(runLengths,1)~=size(values,1) %// adjust orientation of output vector
    V = V.';
end

Approach 2

This approach, based on cumsum, is an adaptation of that used in this other answer. It uses less memory than approach 1.

function V = runLengthDecode2(runLengths, values)
if nargin==1 %// handle one-input case
    values = 1:numel(runLengths);
end
[ii, ~, jj] = find(runLengths(:));
V(cumsum(jj(end:-1:1))) = 1;
V = values(ii(cumsum(V(end:-1:1))));
if size(runLengths,1)~=size(values,1) %// adjust orientation of output vector
    V = V.';
end
Canasta answered 13/2, 2015 at 15:1 Comment(1)
@Aphoristic Yes, gnovice's approach is probably fasterCanasta
A
5

As of R2015a, the function repelem is the best choice to do this:

function V = runLengthDecode(runLengths, values)
if nargin<2
    values = 1:numel(runLengths);
end
V = repelem(values, runLengths);
end

For versions before R2015a this is a fast alternative:

Alternative to repelem:

I had the feeling gnovice's approach could be improved upon by using a better zero-runLength fix than the preprocessing mask. So I gave accumarray a shot. It seems this gives the solution yet another boost:

function V = runLengthDecode(runLengths, values)
%// Actual computation using column vectors
V = cumsum(accumarray(cumsum([1; runLengths(:)]), 1));
V = V(1:end-1);
%// In case of second argument
if nargin>1
    V = reshape(values(V),[],1);
end
%// If original was a row vector, transpose
if size(runLengths,2)>1
    V = V.'; %'
end
end
Aphoristic answered 19/2, 2015 at 20:1 Comment(5)
Looks promising indeed from the new runtime plots!Demos
@LuisMendo: Well, thanks for all the internet coins then! :-)Aphoristic
@LuisMendo: It seems it wasn't enough incentive for anybody else to add their solutions though...Aphoristic
@Aphoristic Well, Divakar did contribute, and the question has been viewed over 250 times! :-)Canasta
@LuisMendo: Oh yeah, forgot you set the bounty before Divakar joined in.Aphoristic
D
4

The solution presented here basically does the run-length decoding in two steps -

  1. Replicate all values upto the maximum number of runLengths.
  2. Use bsxfun's masking capability to select from each column the corresponding runlengths.

Rest of the stuffs inside the function code are to take care of the input and output sizes to satisfy the requirements set in the question.

The function code listed next would be a "cleaned-up" version of one of my previous answers to a similar problem. Here's the code -

function V = replicate_bsxfunmask(runLengths, values)

if nargin==1   %// Handle one-input case
    values = 1:numel(runLengths);
end

%// Do size checking to make sure that both values and runlengths are row vectors.
if size(values,1) > 1
    values = values.'; %//'
end
if size(runLengths,1) > 1
    yes_transpose_output = false;
    runLengths = runLengths.'; %//'
else
    yes_transpose_output = true;
end

maxlen = max(runLengths);

all_values = repmat(values,maxlen,1);
%// OR all_values = values(ones(1,maxlen),:);

V = all_values(bsxfun(@le,(1:maxlen)',runLengths)); %//'

%// Bring the shape of V back to the shape of runlengths
if yes_transpose_output
    V = V.'; %//'
end

return;

The code listed next would be a hybrid (cumsum + replicate_bsxfunmask) and would be suitable when you have a good number of outliers or really huge outliers. Also to keep it simple, for now this works on numeric arrays only. Here's the implementation -

function out = replicate_bsxfunmask_v2(runLengths, values)

if nargin==1                       %// Handle one-input case
    values = 1:numel(runLengths);
end

if size(values,1) > 1
    values = values.';  %//'
end

if size(runLengths,1) > 1
    yes_transpose_output = true;
    runLengths = runLengths.';  %//'
else
    yes_transpose_output = false;
end

%// Regularize inputs
values = values(runLengths>0);
runLengths = runLengths(runLengths>0);

%// Main portion of code
thresh = 200; %// runlengths threshold that are to be processed with cumsum

crunLengths = cumsum(runLengths); %%// cumsums of runlengths
mask = runLengths >= thresh; %// mask of runlengths above threshold
starts = [1 crunLengths(1:end-1)+1]; %// starts of each group of runlengths

mask_ind = find(mask); %// indices of mask

post_mark = starts(mask);
negt_mark = crunLengths(mask)+1;

if  ~isempty(negt_mark) && negt_mark(end) > crunLengths(end)
    negt_mark(end) = [];
end

%// Create array & set starts markers for starts of runlengths above thresh
marked_out = zeros(1,crunLengths(end));
marked_out(post_mark) = mask_ind;
marked_out(negt_mark) = marked_out(negt_mark) -1*mask_ind(1:numel(negt_mark));

%// Setup output array with the cumsumed version of marked array
out = cumsum(marked_out);

%// Mask for final ouput to decide between large and small runlengths
thresh_mask = out~=0;

%// Fill output array with cumsum and then rep-bsxfun based approaches
out(thresh_mask) = values(out(thresh_mask));

values = values(~mask);
runLengths = runLengths(~mask);

maxlen = max(runLengths);
all_values = repmat(values,maxlen,1);
out(~thresh_mask) = all_values(bsxfun(@le,(1:maxlen)',runLengths)); %//'

if yes_transpose_output
    out = out.';  %//'
end

return;
Demos answered 15/2, 2015 at 19:9 Comment(5)
I was expecting an answer from you! Great job, very fast!Canasta
Perhaps you could save some time by using the trick you suggested regarding Gnovices' approach? That is, use only the non-zero values of runLengths. That way you would avoid all-zero columns in bsxfun(@le,(1:maxlen)',runLengths). Not sure it that would speed or slow things, thoughCanasta
This is good code until memory becomes an issue, or there are large outliers in the runlengths as pointed out by @knedlsepp... You can mitigate this by checking max array size left with memory then blocklooping over blocks of size: max array divided by max runlength... I've implemented the solution if anyone cares, I could post it if necessary. There could even be ways of adaptively blocklooping in function of the maximum runlength in the next block which would be computationally very efficient I think and even smarter, but no time to explore that...Spinal
@Spinal Well I did try with a hybrid approach (cumsum + replicate_bsxfunmask) and the runtimes seemed to be really bad, still I have added that here. Still, would be interesting to see what you have in mind, so maybe post what you tried?Demos
@LuisMendo Well I did try with a hybrid approach (cumsum + replicate_bsxfunmask) and the runtimes seemed to be really bad, at least for the original data-generated model of randi(200,n,1), still have added that here!Demos

© 2022 - 2024 — McMap. All rights reserved.