Simple dictionary in C++

L

9

90

Moving some code from Python to C++.

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" }

Thinking maps might be overkill? What would you use?

Launcher answered 1/3, 2013 at 5:56 Comment(7)

Why map might be overkill? – Weekday 1/3, 2013 at 5:57

What are you planning on doing with them? – Belgravia 1/3, 2013 at 5:57

Can I define a map as a constant with the base values somehow in the class definiton? – Launcher 1/3, 2013 at 5:58

@WHOEVENCARES Why couldn't you? – Emplacement 1/3, 2013 at 5:58

For that I'd use an IF ladder. – Muir 1/3, 2013 at 20:4

This question has almost no context unless you realize that we're talking about chemistry. In the future, these questions should be worded to give context so a broader group of users can answer it. – Pyemia 19/10, 2016 at 17:39

@Pyemia actually it's genetics.:D – Entresol 6/5, 2018 at 16:22

G

28

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

Geter answered 1/3, 2013 at 6:32 Comment(1)

@WHOEVENCARES I am not sure if it would be faster than the pure conditional Benjamin Lindley proposed. However, at least the subtraction part of my function could be performed in a vector register for several characters in parallel. – Geter 1/3, 2013 at 6:55

A

155

You can use the following syntax:

#include <map>

std::map<char, char> my_map = {
    { 'A', '1' },
    { 'B', '2' },
    { 'C', '3' }
};

Algy answered 1/3, 2013 at 6:28 Comment(0)

G

28

If you are into optimization, and assuming the input is always one of the four characters, the function below might be worth a try as a replacement for the map:

char map(const char in)
{ return ((in & 2) ? '\x8a' - in : '\x95' - in); }

It works based on the fact that you are dealing with two symmetric pairs. The conditional works to tell apart the A/T pair from the G/C one ('G' and 'C' happen to have the second-least-significant bit in common). The remaining arithmetics performs the symmetric mapping. It's based on the fact that a = (a + b) - b is true for any a,b.

Geter answered 1/3, 2013 at 6:32 Comment(1)

@WHOEVENCARES I am not sure if it would be faster than the pure conditional Benjamin Lindley proposed. However, at least the subtraction part of my function could be performed in a vector register for several characters in parallel. – Geter 1/3, 2013 at 6:55

L

24

While using a std::map is fine or using a 256-sized char table would be fine, you could save yourself an enormous amount of space agony by simply using an enum. If you have C++11 features, you can use enum class for strong-typing:

// First, we define base-pairs. Because regular enums
// Pollute the global namespace, I'm using "enum class". 
enum class BasePair {
    A,
    T,
    C,
    G
};

// Let's cut out the nonsense and make this easy:
// A is 0, T is 1, C is 2, G is 3.
// These are indices into our table
// Now, everything can be so much easier
BasePair Complimentary[4] = {
    T, // Compliment of A
    A, // Compliment of T
    G, // Compliment of C
    C, // Compliment of G
};

Usage becomes simple:

int main (int argc, char* argv[] ) {
    BasePair bp = BasePair::A;
    BasePair complimentbp = Complimentary[(int)bp];
}

If this is too much for you, you can define some helpers to get human-readable ASCII characters and also to get the base pair compliment so you're not doing (int) casts all the time:

BasePair Compliment ( BasePair bp ) {
    return Complimentary[(int)bp]; // Move the pain here
}

// Define a conversion table somewhere in your program
char BasePairToChar[4] = { 'A', 'T', 'C', 'G' };
char ToCharacter ( BasePair bp ) {
    return BasePairToChar[ (int)bp ];
}

It's clean, it's simple, and its efficient.

Now, suddenly, you don't have a 256 byte table. You're also not storing characters (1 byte each), and thus if you're writing this to a file, you can write 2 bits per Base pair instead of 1 byte (8 bits) per base pair. I had to work with Bioinformatics Files that stored data as 1 character each. The benefit is it was human-readable. The con is that what should have been a 250 MB file ended up taking 1 GB of space. Movement and storage and usage was a nightmare. Of coursse, 250 MB is being generous when accounting for even Worm DNA. No human is going to read through 1 GB worth of base pairs anyhow.

Leptophyllous answered 1/3, 2013 at 6:25 Comment(6)

But this still needs linear lookup time for char to base pair conversion – Unquote 1/3, 2013 at 6:31

@Unquote If by "Linear Lookup Time" you mean O(1), then yes, this whole premise is O(1) and is also maximally compressed for very little effort. – Leptophyllous 1/3, 2013 at 6:33

@Unquote Could you please explain how this is linear time? Genuinely interested. – Emplacement 1/3, 2013 at 6:33

@Rapptz, it is smt like BasePair CharToPair(char p) { for (i : 0..3) { if (BasePairToChar(i) == p) return BasePair(i); } }? – Unquote 1/3, 2013 at 6:35

@ThePhD, I fail to understand your char to pair conversion. – Unquote 1/3, 2013 at 6:39

I would recommend adding inline to the function definitions to further save on runtime overhead. – Mauriciomaurie 26/3, 2021 at 12:29

E

13

Until I was really concerned about performance, I would use a function, that takes a base and returns its match:

char base_pair(char base)
{
    switch(base) {
        case 'T': return 'A';
        ... etc
        default: // handle error
    }
}

If I was concerned about performance, I would define a base as one fourth of a byte. 0 would represent A, 1 would represent G, 2 would represent C, and 3 would represent T. Then I would pack 4 bases into a byte, and to get their pairs, I would simply take the complement.

Expiratory answered 1/3, 2013 at 6:1 Comment(0)

L

12

Here's the map solution:

#include <iostream>
#include <map>

typedef std::map<char, char> BasePairMap;

int main()
{
    BasePairMap m;
    m['A'] = 'T';
    m['T'] = 'A';
    m['C'] = 'G';
    m['G'] = 'C';

    std::cout << "A:" << m['A'] << std::endl;
    std::cout << "T:" << m['T'] << std::endl;
    std::cout << "C:" << m['C'] << std::endl;
    std::cout << "G:" << m['G'] << std::endl;

    return 0;
}

Labonte answered 1/3, 2013 at 6:2 Comment(2)

Unrelated but you are overflushing the stream. – Emplacement 1/3, 2013 at 6:22

@Rapptz: I wondered what you meant with 'overflushing', until i found out that std::endl does both a \n (newline) and std::flush. – Cappella 16/7, 2020 at 14:5

U

8

A table out of char array:

char map[256] = { 0 };
map['T'] = 'A'; 
map['A'] = 'T';
map['C'] = 'G';
map['G'] = 'C';
/* .... */

Unquote answered 1/3, 2013 at 5:58 Comment(16)

That's a very strange map. – Emplacement 1/3, 2013 at 6:0

This is an incredibly wasteful map. But, it gets... the job done... ? – Leptophyllous 1/3, 2013 at 6:5

@ThePhD, incredibly? uses 256 bytes. What is the overhead of std::map? what is the lookup time? – Unquote 1/3, 2013 at 6:7

You do realize this could be accomplished with a simple enumeration and a function? If you wanted to get slightly more fancy, you could use that enumeration into a table of other enums and use only 4 bytes. If you really wanted to grind some optomization gears, you'd pack ATCG into 2 bits (like what most sensible binary representations of DNA do). char is not a data type for DNA structures (there's a reason DNA-listing files are 4x bigger in ASCII). – Leptophyllous 1/3, 2013 at 6:10

@Unquote As far as I remember in std:map lookup time is log(n), while in you map is n/2 – Revivalist 1/3, 2013 at 6:11

@spin_eight, this is a direct access table, lookup is O(1) – Unquote 1/3, 2013 at 6:12

@Unquote oh, yeah, char is index!, thanks for pointing to my mistake! – Revivalist 1/3, 2013 at 6:15

@Unquote ideone.com/4nWn1B for your memory usage assumption. I couldn't get the upperbound because it's strictly implementation defined. – Emplacement 1/3, 2013 at 6:20

@Rapptz, all I'm saying is I wouldn't worry about 256 bytes in any case. – Unquote 1/3, 2013 at 6:29

@ThePhD storing internally as 2 bits doesn't help when you are reading from .fasta ASCII files, not to mention working with bits in C++ in such a manner requires shifting that defeats the purpose of using a simple human readable lookup map. – Launcher 1/3, 2013 at 6:29

@WHOEVENCARES No shifting is required. – Emplacement 1/3, 2013 at 6:35

@WHOEVENCARES Oh, you're working with FASTA. Well, good luck to you. I used mostly binary files after a while of FASTA and those other formats, permanently sticking to BAM files and its related. Good luck to you. – Leptophyllous 1/3, 2013 at 6:39

I always heard bit fields were the slowest lookups, because of the way memory is addressed. So would it make since to convert something like a BAM file to ascii in memory first? Btw for this answer the char array can be size [84]. – Launcher 1/3, 2013 at 6:42

This is a bad case of premature optimization. Don't do this. – Americanism 1/3, 2013 at 14:39

@Overv, can you elaborate on this? – Unquote 1/3, 2013 at 20:56

The only advantage this method has is that it's fast, the other methods more clearly demonstrate the intention, are type safe and consume less memory. – Americanism 2/3, 2013 at 0:37

R

2

BASEPAIRS = { "T": "A", "A": "T", "G": "C", "C": "G" } What would you use?

Maybe:

static const char basepairs[] = "ATAGCG";
// lookup:
if (const char* p = strchr(basepairs, c))
    // use p[1]

;-)

Riverine answered 1/3, 2013 at 6:5 Comment(1)

I find this solution classy, even if not the fastest :D – Zygapophysis 6/7, 2023 at 12:56

B

2

This is the fastest, simplest, smallest space solution I can think of. A good optimizing compiler will even remove the cost of accessing the pair and name arrays. This solution works equally well in C.

#include <iostream>

enum Base_enum { A, C, T, G };
typedef enum Base_enum Base;
static const Base pair[4] = { T, G, A, C };
static const char name[4] = { 'A', 'C', 'T', 'G' };
static const Base base[85] = 
  { -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
    -1, -1, -1, -1, -1,  A, -1,  C, -1, -1,
    -1,  G, -1, -1, -1, -1, -1, -1, -1, -1, 
    -1, -1, -1, -1,  T };

const Base
base2 (const char b)
{
  switch (b)
    {
    case 'A': return A;
    case 'C': return C;
    case 'T': return T;
    case 'G': return G;
    default: abort ();
    }
}

int
main (int argc, char *args) 
{
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[b] << ":" 
                << name[pair[b]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base[name[b]]] << ":" 
                << name[pair[base[name[b]]]] << std::endl;
    }
  for (Base b = A; b <= G; b++)
    {
      std::cout << name[base2(name[b])] << ":" 
                << name[pair[base2(name[b])]] << std::endl;
    }
};

base[] is a fast ascii char to Base (i.e. int between 0 and 3 inclusive) lookup that is a bit ugly. A good optimizing compiler should be able to handle base2() but I'm not sure if any do.

Buckie answered 1/3, 2013 at 6:48 Comment(3)

But this solution assumes the input comes as numbers 0, 1, 2, 3, rather than ASCII characters. You'd still have to perform a mapping of the input, correct? – Geter 1/3, 2013 at 6:53

Good point. I fixed it with the fastest mapping I could think of, and a prettier but possibly slower switch-based mapping. – Buckie 1/3, 2013 at 7:32

Good tip to know if you are dealing with C++ The "smallest space solution" is 49 lines long ;) – Particularly 13/7, 2019 at 8:54

O

0

crusty old 'C/Asm' coder here...

Take a look at the binary representation for ASCII symbols A,T,C and G :

A = 0100 0001
C = 0100 0011
T = 0101 0100
G = 0100 0111

By using c=(c>>1)&3; we can convert any symbol into a number 0..3 and this will work regardless of whether the input is provided in lower or upper case.

Now look at the pairs AT and CG:

A = 00 C = 01
T = 10 G = 11

To convert any of these symbols to its partner, you simply flip the high bit (XOR 2) ... and this works for converting them back again.

This is far simpler than any lookup, map, table or math.

Additional benefits

Because each symbol is only 2 bits wide, you can store 4 symbols in an byte. You can convert all four symbols using XOR 10101010b; (0xCC in hex)

If you're on a 64bit architecture, you can process 32 symbols at a time, with a single bitwise XOR. So, storage and processing can be super efficient.

Similarly, input ASCII can be converted 8 characters at a time using c=(c>>1)& 0x0303030303030303; over 64bits of char data.

Output

When you need to output ASCII again, this is likely AFTER all your processing is done. Simply define const Symbol="ACTG"; and lookup character Symbol[value&3]; for every 2 bits of your sequence. Value>>=2; will discard each character and present the next.

This indirect is still way faster than any map or dictionary. It doesn't even need to access memory, as "ACTG" fits in a 32bit register.

Note: None of these features require any conditionals, they are branchless :)

Oys answered 15/6 at 7:24 Comment(0)

Recommended topics

Hot tags