Getting all string combinations by given maximal Hamming distance (number of mismatches) in Java
Asked Answered
N

1

6

Is there an algorithm go generate all possible string combinations of a string (DNA Sequence) by a given number of maximal allowed positions that can variate (maximal Mismatches, maximal Hamming distance)?

The alphabet is {A,C,T,G}.

Example for a string AGCC and maximal number of Mismatches 2:

Hamming distance is 0
  {AGCC}
Hamming distance is 1
  {CGCC, TGCC, GGCC, AACC, ACCC, ATCC, AGAC, AGTC, ..., AGCG}
Hamming distance is 2
  {?}

One possible approach would be to generate a set with all permutations of a given String, iterate over them and remove all strings with greater Hamming distance that it should be.

That approach is very ressource-eating, by a given String of 20 characters and maximal Hamming distance of 5.

Is there another, more efficient approcahes / implementations for that?

Neese answered 9/10, 2013 at 10:36 Comment(2)
recursively call the function that generate combinations for distance 1 for all values that was returned and put to the Set to avoid duplicationsWills
Thank you, I'll also try that kind of solution.Neese
M
8

Just use a normal permutation generation algorithm, except that you pass around the distance, decrementing it when you've got a different character.

static void permute(char[] arr, int pos, int distance, char[] candidates)
{
   if (pos == arr.length)
   {
      System.out.println(new String(arr));
      return;
   }
   // distance > 0 means we can change the current character,
   //   so go through the candidates
   if (distance > 0)
   {
      char temp = arr[pos];
      for (int i = 0; i < candidates.length; i++)
      {
         arr[pos] = candidates[i];
         int distanceOffset = 0;
         // different character, thus decrement distance
         if (temp != arr[pos])
            distanceOffset = -1;
         permute(arr, pos+1, distance + distanceOffset, candidates);
      }
      arr[pos] = temp;
   }
   // otherwise just stick to the same character
   else
      permute(arr, pos+1, distance, candidates);
}

Call with:

permute("AGCC".toCharArray(), 0, 1, "ACTG".toCharArray());

Performance note:

For a string length of 20, distance of 5 and a 5-character alphabet, there are already over 17 million candidates (assuming my code is correct).

The above code takes less than a second to go through them on my machine (without printing), but don't expect any generator to be able to generate much more than that in a reasonable amount of time, as there are simply too many possibilities.

Moten answered 9/10, 2013 at 10:46 Comment(1)
Thank you, that seems to work very well, unless println() is very slow :)Neese

© 2022 - 2024 — McMap. All rights reserved.