Data structure for unique element storage

Asked 19/5, 2016 at 12:2 Answered 19/5, 2016 at 23:28

I'm looking for a data structure which should preferably perform equal O(1)? for any number of elements when adding/removing/retrieving elements.

Here are some additional guidelines,

retrieving elements should not involve slow keys()
elements must be always unique and defined
element order is not significant
addition or removal of element should not involve iteration over other elements
gaps in retrieved list of elements are tolerable and can be represented with undef value

Please suggest better solution than,

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

UPDATE (usage)

my $fa = uniqArrayFactory();

$fa->(add => 10);
$fa->(del => 10);
my $members = $fa->(mebers => 1);

Confiscate answered 19/5, 2016 at 12:2 Comment(10)

Can you add some examples of how you would call that please? – Almire 19/5, 2016 at 12:10

@Almire please check update. – Confiscate 19/5, 2016 at 12:24

... am I being dense, or are you describing a hash? – Compose 19/5, 2016 at 12:43

@Compose yes I am, but currently not happy with keys performance, and it would get invoked a lot. – Confiscate 19/5, 2016 at 12:47

I'm sure you won't be able to write something in Perl that is faster than the built-in keys operator, which is written in C. I think you need to explain more about what you're doing that is giving you unacceptable performance. keys is a trivial operation and does little more than return a pre-existing list of C strings reformatted as Perl scalars – Hatbox 19/5, 2016 at 13:5

@Hatbox returning the ready array ref $members is faster as return [keys %$seen]. So, if the factory must be very fast for the $factory->(members => 1) then the keys isn't the fastest solution. – Geosphere 19/5, 2016 at 14:33

keys() is very far from O(1), not to mention what jm666 wrote. I see this as a worthy memory/speed tradeoff in case when elements are often being retrieved. – Confiscate 19/5, 2016 at 15:51

At some point, you will iterate the member array which is O(n) anyway. So I don't understand why you insist on returning the member array in O(1). – Wellbalanced 19/5, 2016 at 19:6

@Wellbalanced that is a valid point so I did some benchmarks which include member iteration, gist.github.com/mpapec/29198302b7cca8d3c74be45ff4b12230 (keys() and a few flavors of each()) and differences are significant. – Confiscate 19/5, 2016 at 20:28

@Сухой27 Interesting, I didn't realize that keys and each are so slow. for my $v (keys %hash1) is a bit faster but still considerably slower than your uarray. – Wellbalanced 19/5, 2016 at 22:49

keys and each are surprisingly slow indeed. But if you store each element as a value of a hash and use values, things get a low faster. With

use strict;
use warnings;
use Benchmark qw(:all);

my $i;
my $fa;
my %hash;

my %compare = (
  uarray => sub {
    $fa->(add => $i++);
    my $memb = $fa->(members => 1);
    for my $v (@$memb) { next if !defined $v; }
  },
  hash => sub {
    $hash{ $i } = $i;
    for my $v (values %hash) {}
    $i++;
  },
);

$i = 0; $fa = uniqArrayFactory(); %hash = ();
cmpthese(10000, \%compare);

sub uniqArrayFactory {
  my $members = [];
  my $seen = {};
  my $gaps = [];

  return sub {
    my (%arg) = @_;

    return $members if exists $arg{members};
    my $m;
    if (defined ($m = $arg{del})) {

      return if !$seen->{$m};
      ${ $seen->{$m} } = undef;
      push @$gaps, delete($seen->{$m});
    }
    elsif (defined ($m = $arg{add})) {

      return if $seen->{$m};
      if (@$gaps) {
        $seen->{$m} = pop @$gaps;
        ${ $seen->{$m} } = $m;
      }
      else {
        push @$members, $m;
        $seen->{$m} = \( $members->[-1] );
      }
    }
    return $m;
  };
}

I get:

         Rate   hash uarray
hash   3205/s     --    -6%
uarray 3401/s     6%     --

Wellbalanced answered 19/5, 2016 at 23:28 Comment(2)

Great to see values() performing much better, although performance gap is greater on v5.20.3 (above 20%). Which perl version did you use? – Confiscate 20/5, 2016 at 5:36

I was using 5.18 on OS X. With 5.22 on Ubuntu I get -13% and 15%. – Wellbalanced 20/5, 2016 at 12:15

Ironically, maybe Tie::IxHash, which was motivated by the desire to retrieve the keys of a hash in a specified order, is as close as you're going to get to what you want.

In the Tie::IxHash implementation, keys and values are stored in array references. keys returns a copy of the set of keys, but something like (tied %hash)->[1] would give you direct access to it.

Removing elements in a Tie::IxHash is O(n). A possible workaround for that would be to replace values with undef rather than deleting them. That is, preferring

$ixhash{$obsolete_key} = undef;

delete $ixhash{$obsolete_key};

Or if you were able to pool your deletions -- if you could organize your code so that you usually called delete on several keys around the same time and in between other operations on the hash -- then there are opportunities for improving on Tie::IxHash.

Chat answered 19/5, 2016 at 18:27 Comment(0)

Recommended topics

Hot tags