Fast Way to Find Difference between Two Strings of Equal Length in Perl
Asked Answered
B

5

6

Given pairs of string like this.

    my $s1 = "ACTGGA";
    my $s2 = "AGTG-A";

   # Note the string can be longer than this.

I would like to find position and character in in $s1 where it differs with $s2. In this case the answer would be:

#String Position 0-based
# First col = Base in S1
# Second col = Base in S2
# Third col = Position in S1 where they differ
C G 1
G - 4

I can achieve that easily with substr(). But it is horribly slow. Typically I need to compare millions of such pairs.

Is there a fast way to achieve that?

Baldric answered 17/1, 2011 at 2:26 Comment(3)
Could you post your substr example with a benchmark? Then we could use it as a baseline against which to compare our potential solutions. Also, these aren't Unicode strings, right? (They seem like genetic information...) Will the input always be in a narrow subset of characters (i.e. [ACTG-])?Findley
TimToady's classic answer perlmonks.org/?node_id=840593: $matches = ($first ^ $second) =~ tr/\0//;Jovanjove
@snoopy: that gives a count of how many characters are the same, not what's wanted hereCitronella
C
24

Stringwise ^ is your friend:

use strict;
use warnings;
my $s1 = "ACTGGA";
my $s2 = "AGTG-A";

my $mask = $s1 ^ $s2;
while ($mask =~ /[^\0]/g) {
    print substr($s1,$-[0],1), ' ', substr($s2,$-[0],1), ' ', $-[0], "\n";
}

EXPLANATION:

The ^ (exclusive or) operator, when used on strings, returns a string composed of the result of an exclusive or on each bit of the numeric value of each character. Breaking down an example into equivalent code:

"AB" ^ "ab"
( "A" ^ "a" ) . ( "B" ^ "b" )
chr( ord("A") ^ ord("a") ) . chr( ord("B") ^ ord("b") )
chr( 65 ^ 97 ) . chr( 66 ^ 98 )
chr(32) . chr(32)
" " . " "
"  "

The useful feature of this here is that a nul character ("\0") occurs when and only when the two strings have the same character at a given position. So ^ can be used to efficiently compare every character of the two strings in one quick operation, and the result can be searched for non-nul characters (indicating a difference). The search can be repeated using the /g regex flag in scalar context, and the position of each character difference found using $-[0], which gives the offset of the beginning of the last successful match.

Citronella answered 17/1, 2011 at 2:59 Comment(5)
Very neat example of using @-, by the way.Hebdomadal
It'd be nice if you explained what was going on here.Notochord
thanks for the suggested edit to add an explanation, @carandraug; I've done it somewhat differently.Citronella
9 years later, perldoc -f ^ says If the "bitwise" feature is enabled via "use feature 'bitwise'" or "use v5.28", then this operator always treats its operands as numbersMississippian
@Mississippian yes, with the bitwise feature, use the explict string operator ^. instead of ^Citronella
A
4

Use binary bit ops on the complete strings.

Things like $s1 & $s2 or $s1 ^ $s2 run incredibly fast, and work with strings of arbitrary length.

Anywise answered 17/1, 2011 at 2:52 Comment(0)
H
3

I was bored on Thanksgiving break 2012 and answered the question and more. It will work on strings of equal length. It will work if they are not. I added a help, opt handling just for fun. I thought someone might find it useful. If you are new to PERL add don't know. Don't add any code in your script below DATA to the program. Have fun.

./diftxt -h

    usage: diftxt [-v ] string1 string2
                   -v = Verbose 
                  diftxt [-V|--version]
                  diftxt [-h|--help]  "This help!"
Examples:  diftxt test text
           diftxt "This is a test" "this is real"

    Place Holders:  space = "·" , no charater = "ζ"

cat ./diftxt ----------- cut ✂----------

#!/usr/bin/perl -w

use strict;
use warnings;
use Getopt::Std;
my %options=();
getopts("Vhv", \%options);
my $helptxt='
        usage: diftxt [-v ] string1 string2
                       -v = Verbose 
                      diftxt [-V|--version]
                      diftxt [-h|--help]  "This help!"
    Examples:  diftxt test text
               diftxt "This is a test" "this is real"

        Place Holders:  space = "·" , no charater = "ζ"';
my $Version = "inital-release 1.0 - Quincey Craig 11/21/2012";

print "$helptxt\n\n" if defined $options{h};
print "$Version\n" if defined $options{V};
if (@ARGV == 0 ) {
 if (not defined $options{h}) {usage()};
 exit;
}

my $s1 = "$ARGV[0]";
my $s2 = "$ARGV[1]";
my $mask = $s1 ^ $s2;

#  setup unicode output to STDOUT
binmode DATA, ":utf8";
my $ustring = <DATA>;
binmode STDOUT, ":utf8";

my $_DIFF = '';
my $_CHAR1 = '';
my $_CHAR2 = '';

sub usage
{
        print "\n";
        print "usage: diftxt [-v ] string1 string2\n";
        print "               -v = Verbose \n";
        print "       diftxt [-V|--version]\n";
        print "       diftxt [-h|--help]\n\n";
        exit;
}

sub main
{
 print "\nOrig\tDiff\tPos\n----\t----\t----\n" if defined $options{v};
 while ($mask =~ /[^\0]/g) {
### redirect stderr to allow for test of empty variable with error message from substr   
    open STDERR, '>/dev/null';
    if (substr($s2,$-[0],1) eq "") {$_CHAR2 = "\x{03B6}";close STDERR;} else {$_CHAR2 = substr($s2,$-[0],1)};
    if (substr($s2,$-[0],1) eq " ") {$_CHAR2 = "\x{00B7}"};
      $_CHAR1 = substr($s1,$-[0],1);
    if ($_CHAR1 eq "") {$_CHAR1 = "\x{03B6}"} else {$_CHAR1 = substr($s1,$-[0],1)};
    if ($_CHAR1 eq " ") {$_CHAR1 = "\x{00B7}"};
### Print verbose Data  
   print $_CHAR1, "\t", $_CHAR2, "\t", $+[0], "\n" if defined $options{v};
### Build difference list 
   $_DIFF = "$_DIFF$_CHAR2";
### Build mask 
   substr($s1,"$-[0]",1) = "\x{00B7}";
 } ### end loop

 print "\n" if defined $options{v};
 print "$_DIFF, ";
 print "Mask: \"$s1\"\n";
} ### end main
if ($#ARGV == 1) {main()};
__DATA__
Harpsichord answered 22/11, 2012 at 4:43 Comment(0)
C
0

I see you are trying to sequence DNA. You can use the bioperl lib to find better algorithms for your application.

Cholecalciferol answered 9/4 at 4:47 Comment(2)
This does not provide an answer to the question. Once you have sufficient reputation you will be able to comment on any post; instead, provide answers that don't require clarification from the asker. - From ReviewKersey
As it’s currently written, your answer is unclear. Please edit to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers in the help center.Gumboil
V
-3

This is the easiest form you can get

my $s1 = "ACTGGA";
my $s2 = "AGTG-A";

my @s1 = split //,$s1;
my @s2 = split //,$s2;

my $i = 0;
foreach  (@s1) {
    if ($_ ne $s2[$i]) {
        print "$_, $s2[$i] $i\n";
    }
    $i++;
}
Vetter answered 17/1, 2011 at 2:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.