Moving some code from Python to C++.
BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" }
Thinking maps might be overkill? What would you use?
Moving some code from Python to C++.
BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" }
Thinking maps might be overkill? What would you use?
If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:
char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }
It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.
You can use the following syntax:
#include <map>
std::map<char, char> my_map = {
{ 'A', '1' },
{ 'B', '2' },
{ 'C', '3' }
};
If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:
char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }
It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.
While using a std::map
is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum
. If you have C++11 features, you can use enum class
for strong-typing:
// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class".
enum class BasePair {
A,
T,
C,
G
};
// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
T, // Compliment of A
A, // Compliment of T
G, // Compliment of C
C, // Compliment of G
};
Usage becomes simple:
int main (int argc, char* argv[] ) {
BasePair bp = BasePair::A;
BasePair complimentbp = Complimentary[(int)bp];
}
If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int)
casts all the time:
BasePair Compliment ( BasePair bp ) {
return Complimentary[(int)bp]; // Move the pain here
}
// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
return BasePairToChar[ (int)bp ];
}
It's clean, it's simple, and its efficient.
Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.
O(1)
, then yes, this whole premise is O(1) and is also maximally compressed for very little effort. –
Leptophyllous BasePair CharToPair(char p) { for (i : 0..3) { if (BasePairToChar(i) == p) return BasePair(i); } }
? –
Unquote inline
to the function definitions to further save on runtime overhead. –
Mauriciomaurie Until I was really concerned about performance, I would use a function, that takes a base and returns its match:
char base_pair(char base)
{
switch(base) {
case 'T': return 'A';
... etc
default: // handle error
}
}
If I was concerned about performance, I would define a base as one fourth of a byte. 0 would represent A, 1 would represent G, 2 would represent C, and 3 would represent T. Then I would pack 4 bases into a byte, and to get their pairs, I would simply take the complement.
Here's the map solution:
#include <iostream>
#include <map>
typedef std::map<char, char> BasePairMap;
int main()
{
BasePairMap m;
m['A'] = 'T';
m['T'] = 'A';
m['C'] = 'G';
m['G'] = 'C';
std::cout << "A:" << m['A'] << std::endl;
std::cout << "T:" << m['T'] << std::endl;
std::cout << "C:" << m['C'] << std::endl;
std::cout << "G:" << m['G'] << std::endl;
return 0;
}
std::endl
does both a \n
(newline) and std::flush
. –
Cappella A table out of char array:
char map[256] = { 0 };
map['T'] = 'A';
map['A'] = 'T';
map['C'] = 'G';
map['G'] = 'C';
/* .... */
char
is not a data type for DNA structures (there's a reason DNA-listing files are 4x bigger in ASCII). –
Leptophyllous BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?
Maybe:
static const char basepairs[] = "ATAGCG";
// lookup:
if (const char* p = strchr(basepairs, c))
// use p[1]
;-)
This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.
#include <iostream>
enum Base_enum { A, C, T, G };
typedef enum Base_enum Base;
static const Base pair[4] = { T, G, A, C };
static const char name[4] = { 'A', 'C', 'T', 'G' };
static const Base base[85] =
{ -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, A, -1, C, -1, -1,
-1, G, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, T };
const Base
base2 (const char b)
{
switch (b)
{
case 'A': return A;
case 'C': return C;
case 'T': return T;
case 'G': return G;
default: abort ();
}
}
int
main (int argc, char *args)
{
for (Base b = A; b <= G; b++)
{
std::cout << name[b] << ":"
<< name[pair[b]] << std::endl;
}
for (Base b = A; b <= G; b++)
{
std::cout << name[base[name[b]]] << ":"
<< name[pair[base[name[b]]]] << std::endl;
}
for (Base b = A; b <= G; b++)
{
std::cout << name[base2(name[b])] << ":"
<< name[pair[base2(name[b])]] << std::endl;
}
};
base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.
crusty old 'C/Asm' coder here...
Take a look at the binary representation for ASCII symbols A,T,C and G :
By using c=(c>>1)&3;
we can convert any symbol into a number 0..3 and this will work regardless of whether the input is provided in lower or upper case.
Now look at the pairs AT and CG:
To convert any of these symbols to its partner, you simply flip the high bit (XOR 2) ... and this works for converting them back again.
This is far simpler than any lookup, map, table or math.
Additional benefits
Because each symbol is only 2 bits wide, you can store 4 symbols in an byte. You can convert all four symbols using XOR 10101010b; (0xCC in hex)
If you're on a 64bit architecture, you can process 32 symbols at a time, with a single bitwise XOR. So, storage and processing can be super efficient.
Similarly, input ASCII can be converted 8 characters at a time using c=(c>>1)& 0x0303030303030303;
over 64bits of char data.
Output
When you need to output ASCII again, this is likely AFTER all your processing is done. Simply define const Symbol="ACTG";
and lookup character Symbol[value&3];
for every 2 bits of your sequence. Value>>=2; will discard each character and present the next.
This indirect is still way faster than any map or dictionary. It doesn't even need to access memory, as "ACTG" fits in a 32bit register.
Note: None of these features require any conditionals, they are branchless :)
© 2022 - 2024 — McMap. All rights reserved.