Data structure for unique element storage
Asked Answered
C

2

6

I'm looking for a data structure which should preferably perform equal O(1)? for any number of elements when adding/removing/retrieving elements.

Here are some additional guidelines,

  • retrieving elements should not involve slow keys()
  • elements must be always unique and defined
  • element order is not significant
  • addition or removal of element should not involve iteration over other elements
  • gaps in retrieved list of elements are tolerable and can be represented with undef value

Please suggest better solution than,

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

UPDATE (usage)

my $fa = uniqArrayFactory();

$fa->(add => 10);
$fa->(del => 10);
my $members = $fa->(mebers => 1);
Confiscate answered 19/5, 2016 at 12:2 Comment(10)
Can you add some examples of how you would call that please?Almire
@Almire please check update.Confiscate
... am I being dense, or are you describing a hash?Compose
@Compose yes I am, but currently not happy with keys performance, and it would get invoked a lot.Confiscate
I'm sure you won't be able to write something in Perl that is faster than the built-in keys operator, which is written in C. I think you need to explain more about what you're doing that is giving you unacceptable performance. keys is a trivial operation and does little more than return a pre-existing list of C strings reformatted as Perl scalarsHatbox
@Hatbox returning the ready array ref $members is faster as return [keys %$seen]. So, if the factory must be very fast for the $factory->(members => 1) then the keys isn't the fastest solution.Geosphere
keys() is very far from O(1), not to mention what jm666 wrote. I see this as a worthy memory/speed tradeoff in case when elements are often being retrieved.Confiscate
At some point, you will iterate the member array which is O(n) anyway. So I don't understand why you insist on returning the member array in O(1).Wellbalanced
@Wellbalanced that is a valid point so I did some benchmarks which include member iteration, gist.github.com/mpapec/29198302b7cca8d3c74be45ff4b12230 (keys() and a few flavors of each()) and differences are significant.Confiscate
@Сухой27 Interesting, I didn't realize that keys and each are so slow. for my $v (keys %hash1) is a bit faster but still considerably slower than your uarray.Wellbalanced
W
2

keys and each are surprisingly slow indeed. But if you store each element as a value of a hash and use values, things get a low faster. With

use strict;
use warnings;
use Benchmark qw(:all);

my $i;
my $fa;
my %hash;

my %compare = (
  uarray => sub {
    $fa->(add => $i++);
    my $memb = $fa->(members => 1);
    for my $v (@$memb) { next if !defined $v; }
  },
  hash => sub {
    $hash{ $i } = $i;
    for my $v (values %hash) {}
    $i++;
  },
);

$i = 0; $fa = uniqArrayFactory(); %hash = ();
cmpthese(10000, \%compare);

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if exists $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

I get:

         Rate   hash uarray
hash   3205/s     --    -6%
uarray 3401/s     6%     --
Wellbalanced answered 19/5, 2016 at 23:28 Comment(2)
Great to see values() performing much better, although performance gap is greater on v5.20.3 (above 20%). Which perl version did you use?Confiscate
I was using 5.18 on OS X. With 5.22 on Ubuntu I get -13% and 15%.Wellbalanced
C
1

Ironically, maybe Tie::IxHash, which was motivated by the desire to retrieve the keys of a hash in a specified order, is as close as you're going to get to what you want.

In the Tie::IxHash implementation, keys and values are stored in array references. keys returns a copy of the set of keys, but something like (tied %hash)->[1] would give you direct access to it.

Removing elements in a Tie::IxHash is O(n). A possible workaround for that would be to replace values with undef rather than deleting them. That is, preferring

$ixhash{$obsolete_key} = undef;

to

delete $ixhash{$obsolete_key};

Or if you were able to pool your deletions -- if you could organize your code so that you usually called delete on several keys around the same time and in between other operations on the hash -- then there are opportunities for improving on Tie::IxHash.

Chat answered 19/5, 2016 at 18:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.