Simple dictionary in C++
Asked Answered
L

9

90

Moving some code from Python to C++.

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" }

Thinking maps might be overkill? What would you use?

Launcher answered 1/3, 2013 at 5:56 Comment(7)
Why map might be overkill?Weekday
What are you planning on doing with them?Belgravia
Can I define a map as a constant with the base values somehow in the class definiton?Launcher
@WHOEVENCARES Why couldn't you?Emplacement
For that I'd use an IF ladder.Muir
This question has almost no context unless you realize that we're talking about chemistry. In the future, these questions should be worded to give context so a broader group of users can answer it.Pyemia
@Pyemia actually it's genetics.:DEntresol
G
28

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

Geter answered 1/3, 2013 at 6:32 Comment(1)
@WHOEVENCARES I am not sure if it would be faster than the pure conditional Benjamin Lindley proposed. However, at least the subtraction part of my function could be performed in a vector register for several characters in parallel.Geter
A
155

You can use the following syntax:

#include <map>

std::map<char, char> my_map = {
    { 'A', '1' },
    { 'B', '2' },
    { 'C', '3' }
};
Algy answered 1/3, 2013 at 6:28 Comment(0)
G
28

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

Geter answered 1/3, 2013 at 6:32 Comment(1)
@WHOEVENCARES I am not sure if it would be faster than the pure conditional Benjamin Lindley proposed. However, at least the subtraction part of my function could be performed in a vector register for several characters in parallel.Geter
L
24

While using a std::map is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum class for strong-typing:

// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class". 
enum class BasePair {
    A,
    T,
    C,
    G
};

// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
    T, // Compliment of A
    A, // Compliment of T
    G, // Compliment of C
    C, // Compliment of G
};

Usage becomes simple:

int main (int argc, char* argv[] ) {
    BasePair bp = BasePair::A;
    BasePair complimentbp = Complimentary[(int)bp];
}

If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int) casts all the time:

BasePair Compliment ( BasePair bp ) {
    return Complimentary[(int)bp]; // Move the pain here
}

// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
    return BasePairToChar[ (int)bp ];
}

It's clean, it's simple, and its efficient.

Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.

Leptophyllous answered 1/3, 2013 at 6:25 Comment(6)
But this still needs linear lookup time for char to base pair conversionUnquote
@Unquote If by "Linear Lookup Time" you mean O(1), then yes, this whole premise is O(1) and is also maximally compressed for very little effort.Leptophyllous
@Unquote Could you please explain how this is linear time? Genuinely interested.Emplacement
@Rapptz, it is smt like BasePair CharToPair(char p) { for (i : 0..3) { if (BasePairToChar(i) == p) return BasePair(i); } }?Unquote
@ThePhD, I fail to understand your char to pair conversion.Unquote
I would recommend adding inline to the function definitions to further save on runtime overhead.Mauriciomaurie
E
13

Until I was really concerned about performance, I would use a function, that takes a base and returns its match:

char base_pair(char base)
{
    switch(base) {
        case 'T': return 'A';
        ... etc
        default: // handle error
    }
}

If I was concerned about performance, I would define a base as one fourth of a byte. 0 would represent A, 1 would represent G, 2 would represent C, and 3 would represent T. Then I would pack 4 bases into a byte, and to get their pairs, I would simply take the complement.

Expiratory answered 1/3, 2013 at 6:1 Comment(0)
L
12

Here's the map solution:

#include <iostream>
#include <map>

typedef std::map<char, char> BasePairMap;

int main()
{
    BasePairMap m;
    m['A'] = 'T';
    m['T'] = 'A';
    m['C'] = 'G';
    m['G'] = 'C';

    std::cout << "A:" << m['A'] << std::endl;
    std::cout << "T:" << m['T'] << std::endl;
    std::cout << "C:" << m['C'] << std::endl;
    std::cout << "G:" << m['G'] << std::endl;

    return 0;
}
Labonte answered 1/3, 2013 at 6:2 Comment(2)
Unrelated but you are overflushing the stream.Emplacement
@Rapptz: I wondered what you meant with 'overflushing', until i found out that std::endl does both a \n (newline) and std::flush.Cappella
U
8

A table out of char array:

char map[256] = { 0 };
map['T'] = 'A'; 
map['A'] = 'T';
map['C'] = 'G';
map['G'] = 'C';
/* .... */
Unquote answered 1/3, 2013 at 5:58 Comment(16)
That's a very strange map.Emplacement
This is an incredibly wasteful map. But, it gets... the job done... ?Leptophyllous
@ThePhD, incredibly? uses 256 bytes. What is the overhead of std::map? what is the lookup time?Unquote
You do realize this could be accomplished with a simple enumeration and a function? If you wanted to get slightly more fancy, you could use that enumeration into a table of other enums and use only 4 bytes. If you really wanted to grind some optomization gears, you'd pack ATCG into 2 bits (like what most sensible binary representations of DNA do). char is not a data type for DNA structures (there's a reason DNA-listing files are 4x bigger in ASCII).Leptophyllous
@Unquote As far as I remember in std:map lookup time is log(n), while in you map is n/2Revivalist
@spin_eight, this is a direct access table, lookup is O(1)Unquote
@Unquote oh, yeah, char is index!, thanks for pointing to my mistake!Revivalist
@Unquote ideone.com/4nWn1B for your memory usage assumption. I couldn't get the upperbound because it's strictly implementation defined.Emplacement
@Rapptz, all I'm saying is I wouldn't worry about 256 bytes in any case.Unquote
@ThePhD storing internally as 2 bits doesn't help when you are reading from .fasta ASCII files, not to mention working with bits in C++ in such a manner requires shifting that defeats the purpose of using a simple human readable lookup map.Launcher
@WHOEVENCARES No shifting is required.Emplacement
@WHOEVENCARES Oh, you're working with FASTA. Well, good luck to you. I used mostly binary files after a while of FASTA and those other formats, permanently sticking to BAM files and its related. Good luck to you.Leptophyllous
I always heard bit fields were the slowest lookups, because of the way memory is addressed. So would it make since to convert something like a BAM file to ascii in memory first? Btw for this answer the char array can be size [84].Launcher
This is a bad case of premature optimization. Don't do this.Americanism
@Overv, can you elaborate on this?Unquote
The only advantage this method has is that it's fast, the other methods more clearly demonstrate the intention, are type safe and consume less memory.Americanism
R
2

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?

Maybe:

static const char basepairs[] = "ATAGCG";
// lookup:
if (const char* p = strchr(basepairs, c))
    // use p[1]

;-)

Riverine answered 1/3, 2013 at 6:5 Comment(1)
I find this solution classy, even if not the fastest :DZygapophysis
B
2

This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.

#include <iostream>

enum Base_enum { A, C, T, G };
typedef enum Base_enum Base;
static const Base pair[4] = { T, G, A, C };
static const char name[4] = { 'A', 'C', 'T', 'G' };
static const Base base[85] = 
  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1,  A, -1,  C, -1, -1,
    -1,  G, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1,  T };

const Base
base2 (const char b)
{
  switch (b)
    {
    case 'A': return A;
    case 'C': return C;
    case 'T': return T;
    case 'G': return G;
    default: abort ();
    }
}

int
main (int argc, char *args) 
{
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[b] << ":" 
                << name[pair[b]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base[name[b]]] << ":" 
                << name[pair[base[name[b]]]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base2(name[b])] << ":" 
                << name[pair[base2(name[b])]] << std::endl;
    }
};

base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.

Buckie answered 1/3, 2013 at 6:48 Comment(3)
But this solution assumes the input comes as numbers 0, 1, 2, 3, rather than ASCII characters. You'd still have to perform a mapping of the input, correct?Geter
Good point. I fixed it with the fastest mapping I could think of, and a prettier but possibly slower switch-based mapping.Buckie
Good tip to know if you are dealing with C++ The "smallest space solution" is 49 lines long ;)Particularly
O
0

crusty old 'C/Asm' coder here...

Take a look at the binary representation for ASCII symbols A,T,C and G :

  • A = 0100 0001
  • C = 0100 0011
  • T = 0101 0100
  • G = 0100 0111

By using c=(c>>1)&3; we can convert any symbol into a number 0..3 and this will work regardless of whether the input is provided in lower or upper case.

Now look at the pairs AT and CG:

  • A = 00 C = 01
  • T = 10 G = 11

To convert any of these symbols to its partner, you simply flip the high bit (XOR 2) ... and this works for converting them back again.

This is far simpler than any lookup, map, table or math.

Additional benefits

Because each symbol is only 2 bits wide, you can store 4 symbols in an byte. You can convert all four symbols using XOR 10101010b; (0xCC in hex)

If you're on a 64bit architecture, you can process 32 symbols at a time, with a single bitwise XOR. So, storage and processing can be super efficient.

Similarly, input ASCII can be converted 8 characters at a time using c=(c>>1)& 0x0303030303030303; over 64bits of char data.

Output

When you need to output ASCII again, this is likely AFTER all your processing is done. Simply define const Symbol="ACTG"; and lookup character Symbol[value&3]; for every 2 bits of your sequence. Value>>=2; will discard each character and present the next.

This indirect is still way faster than any map or dictionary. It doesn't even need to access memory, as "ACTG" fits in a 32bit register.

Note: None of these features require any conditionals, they are branchless :)

Oys answered 15/6 at 7:24 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.