Case Insensitive Unique Array Elements in Perl

Asked 25/10, 2012 at 17:8 Answered 25/10, 2012 at 17:11

I am using the uniq function exported by the module, List::MoreUtils to find the uniq elements in an array. However, I want it to find the uniq elements in a case insensitive way. How can I do that?

I have dumped the output of the Array using Data::Dumper:

#! /usr/bin/perl

use strict;
use warnings;
use Data::Dumper qw(Dumper);
use List::MoreUtils qw(uniq);
use feature "say";

my @elements=<array is formed here>;

my @words=uniq @elements;

say Dumper \@words;

Output:

$VAR1 = [
          'John',
          'john',
          'JohN',
          'JOHN',
          'JoHn',
          'john john'
        ];

Expected output should be: john, john john

Only 2 elements, rest all should be filtered since they are the same word, only the difference is in case.

How can I remove the duplicate elements ignoring the case?

Syringe answered 25/10, 2012 at 17:8 Comment(0)

Use lowercase, lc with a map statement:

my @uniq_no_case = uniq map lc, @elements;

The reason List::MoreUtils' uniq is case sensitive is that it relies on the deduping characteristics of hashes, which also is case sensitive. The code for uniq looks like so:

sub uniq {
    my %seen = ();
    grep { not $seen{$_}++ } @_;
}

If you want to use this sub directly in your own code, you could incorporate lc in there:

sub uniq_no_case {
    my %seen = ();
    grep { not $seen{$_}++ } map lc, @_;
}

Explanation of how this works:

@_ contains the args to the subroutine, and they are fed to a grep statement. Any elements that return true when passed through the code block are returned by the grep statement. The code block consist of a few finer points:

$seen{$_}++ returns 0 the first time an element is seen. The value is still incremented to 1, but after it is returned (as opposed to ++$seen{$_} who would inc first, then return).
By negating the result of the incrementation, we get true for the first key, and false for every following such key. Hence, the list is deduped.
grep as the last statement in the sub will return a list, which in turn is returned by the sub.
map lc, @_ simply applies the lc function to all elements in @_.

Leatheroid answered 25/10, 2012 at 17:9 Comment(7)

And this is the same uniq function exported by List::MoreUtils module? – Syringe 25/10, 2012 at 17:13

Indeed it is. Although since the sub is so simple and short, you can just copy paste it, and save yourself loading the module. – Leatheroid 25/10, 2012 at 17:15

Thanks. I will understand the subroutine and then use it directly :) Can you explain the grep syntax a little? The hash, %seen is using the elements of the array as a key and checking for their occurrence. But, I am not sure, how this entire syntax works. – Syringe 25/10, 2012 at 17:22

@NeonFlash Added an explanation in my answer. It is a fairly cleverly written sub, I think. – Leatheroid 25/10, 2012 at 17:30

@NeonFlash If this answer solves your problem to your satisfaction, don't forget to accept it by clicking the checkmark. – Leatheroid 25/10, 2012 at 17:57

This version of the syntax is slightly more malleable: my @uniq_no_case = uniq map {lc $_} @elements; – Whorl 8/6, 2017 at 20:45

Having this line instead will preserve the case of the array: grep { not $seen{lc $_}++ } @_; – Schaffner 11/8, 2022 at 2:10

Use a hash to keep track of the words you have already seen, but also normalize them for upper/lower case:

my %seen;
my @unique;
for my $w (@words) {
  next if $seen{lc($w)}++;
  push(@unique, $w);
}
# @unique has the unique words

Note that this will preserve the case of the original words.

UPDATE: As noted in the comments, it's not clear exactly what the OP needs, but I wrote the solution this way to illustrate a general technique for selecting unique representatives from a list under some "equivalence relation." In this case the equivalence relationship is word $a is equivalent to word $b if and only if lc($a) eq lc($b).

Most equivalence relationships can be expressed in this way, that is, the relationship is defined by a classifier function f() such that $a is equivalent to $b if and only if f($a) eq f($b). For instance, if we want to say that two words are equivalent if they have the same length, then f() would be length().

So now you might see why I wrote the algorithm this way - the classifier function may not produce values that are part of the original list. In the case of f = length, we want to select words, but f of a word is a number.

Borchers answered 25/10, 2012 at 17:11 Comment(5)

Using lc inside the hash access is much nicer than the other solution given, as it preserves the (first matching) case from the input. – Helyn 26/10, 2012 at 11:54

@Helyn What on earth are you talking about? There is no difference between using lc before and inside the hash. – Leatheroid 26/10, 2012 at 13:8

I meant, as opposed to the map lc ... solution given in the other answer. This one is nicer as it returns values in their original case, not in forced-lower case. – Helyn 26/10, 2012 at 13:36

Aha, I see now. However, that's not what the OP requested. Besides, who's to say that the original case is desireable? Usually, names are ucfirst(lc). – Leatheroid 26/10, 2012 at 14:14

I'm sure that the uniq() library has more support and efficiency than this version. – Whorl 8/6, 2017 at 20:46

Recommended topics

Hot tags