How do I convert the three letter amino acid codes to one letter code with python or R?
Asked Answered
F

11

15

I have a fasta file as shown below. I would like to convert the three letter codes to one letter code. How can I do this with python or R?

>2ppo
ARGHISLEULEULYS
>3oot
METHISARGARGMET

desired output

>2ppo
RHLLK
>3oot
MHRRM

your suggestions would be appreciated!!

Flaminius answered 6/10, 2012 at 13:39 Comment(7)
How is ARGHISLEULEULYS converted to RHLLK? What is the logic?Berenice
@Tichodroma: ARG = R, HIS = H, LEU = L, etcAnse
@Anse etc.? It would be useful to add the complete translation list to the question or at least link to it. I'd like to help with this question but I'm unable unless I get all necessary information.Berenice
@Tichodroma: en.wikipedia.org/wiki/…Anse
ah, so you need to split the string into an array take every 3rd element of the array as your final string?Fiddling
How about: stat.ethz.ch/pipermail/bioconductor/2008-January/020958.htmlConnacht
I'm curious where you found such a file - I've never seen a FASTA file using three letter amino acid codes like that.Cheyennecheyne
T
19

BioPython already has built-in dictionaries to help with such translations. Following commands will show you a whole list of available dictionaries:

import Bio
help(Bio.SeqUtils.IUPACData)

The predefined dictionary you are looking for:

Bio.SeqUtils.IUPACData.protein_letters_3to1['Ala']
Touching answered 5/1, 2014 at 21:55 Comment(2)
This ought to be the chosen answer. A small note: In Python3 at least the method is actually under the module Bio.Data, while Bio.SeqUtilis imports it from there, therefore if one wanted only the method protein_letters_3to1 in the current namespace one could do: from Bio.Data.IUPACData import protein_letters_3to1Chromosphere
While this answer is useful, it only works (for me as of Jan 8 2023) for codes that have the following capitalization pattern "Ala" not "ALA" like in the question . I found that Bio.PDB.Polypeptide.three_to_one(sequence) works with all capitalized.Dermatitis
A
18

Use a dictionary to look up the one letter codes:

d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M'}

And a simple function to match the three letter codes with one letter codes for the entire string:

def shorten(x):
    if len(x) % 3 != 0: 
        raise ValueError('Input length should be a multiple of three')

    y = ''
    for i in range(len(x) // 3):
        y += d[x[3 * i : 3 * i + 3]]
    return y

Testing your example:

>>> shorten('ARGHISLEULEULYS')
'RHLLK'
Anse answered 6/10, 2012 at 13:55 Comment(4)
Thank you very much for your answer. I am new to python. How can I parse the input file to your code?Flaminius
@user1725152: That depends on the format of the input file. But I imagine it could be something like for line in inputfile: print(shorten(line)).Anse
len(x) / 3 returns a float so if you get the error TypeError: 'float' object cannot be interpreted as an integer Simply change it to: ``` for i in range(int(len(x)/3)): ```Stepup
@universvm: Thanks for the comment. This is from 2012, so it was written in Python 2 where len(x) / 3 would return an int. Updated the answer to use integer division.Anse
I
7

Here is a way to do it in R:

# Variables:
foo <- c("ARGHISLEULEULYS","METHISARGARGMET")

# Code maps:
code3 <- c("Ala", "Arg", "Asn", "Asp", "Cys", "Glu", "Gln", "Gly", "His", 
"Ile", "Leu", "Lys", "Met", "Phe", "Pro", "Ser", "Thr", "Trp", 
"Tyr", "Val")
code1 <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I", "L", "K", 
"M", "F", "P", "S", "T", "W", "Y", "V")

# For each code replace 3letter code by 1letter code:
for (i in 1:length(code3))
{
    foo <- gsub(code3[i],code1[i],foo,ignore.case=TRUE)
}

Results in :

> foo
[1] "RHLLK" "MHRRM"

Note that I changed the variable name as variable names are not allowed to start with a number in R.

Inharmonic answered 6/10, 2012 at 14:1 Comment(1)
This isn't good. Take TRPHISGLU as an example, you expect the algorithm to translate as follows {TRP}{HIS}{GLU} -> WHE but what really happens with your algorithm is TRP{HIS}{GLU} -> TR{PHE} -> TRF. You do need to split foo into substrings of three characters to avoid such possible interactions.Jessamyn
O
6
>>> src = "ARGHISLEULEULYS"
>>> trans = {'ARG':'R', 'HIS':'H', 'LEU':'L', 'LYS':'K'}
>>> "".join(trans[src[x:x+3]] for x in range(0, len(src), 3))
'RHLLK'

You just need to add the rest of the entries to the trans dict.

Edit:

To make the rest of trans, you can do this. File table:

Ala A
Arg R
Asn N
Asp D
Cys C
Glu E
Gln Q
Gly G
His H
Ile I
Leu L
Lys K
Met M
Phe F
Pro P
Ser S
Thr T
Trp W
Tyr Y
Val V

Read it:

trans = dict((l.upper(), s) for l, s in
             [row.strip().split() for row in open("table").readlines()])
Orthopsychiatry answered 6/10, 2012 at 13:53 Comment(0)
U
6

Biopython has a nice solution

>>> from Bio.PDB.Polypeptide import *
>>> three_to_one('ALA')
'A'

For your example, I'll solve it by this one liner

>>> from Bio.PDB.Polypeptide import *
>>> str3aa = 'ARGHISLEULEULYS'
>>> "".join([three_to_one(aa3) for aa3 in [ "".join(g) for g in zip(*(iter(str3aa),) * 3)]])
>>> 'RHLLK'

They may criticize me for this type of one liner :), but deep in my heart I am still in love with PERL.

Upheave answered 18/6, 2014 at 7:3 Comment(1)
For future visitors, Bio.PDB.Polypeptide.three_to_one() was dropped in Biopython 1.80 in favor of using a new module's dictionary, Bio.Data.PDBData.protein_letters_3to1{}. The Polypeptide module uses it as well.Human
E
4

You may try looking into and installing Biopython since you are parsing a .fasta file and then converting to one letter codes. Unfortunately, Biopython only has the function seq3(in package Bio::SeqUtils) which does the inverse of what you want. Example output in IDLE:

>>>seq3("MAIVMGRWKGAR*")
>>>'MetAlaIleValMetGlyArgTrpLysGlyAlaArgTer'

Unfortunately, there is no 'seq1' function (yet...) but I thought this might be helpful to you in the future. As far as your problem, Junuxx is correct. Create a dictionary and use a for loop to read the string in blocks of three and translate. Here is a similar function to the one he provided that is all-inclusive and handles lower cases as well.

def AAcode_3_to_1(seq):
    '''Turn a three letter protein into a one letter protein.

    The 3 letter code can be upper, lower, or any mix of cases
    The seq input length should be a factor of 3 or else results
    in an error

    >>>AAcode_3_to_1('METHISARGARGMET')
    >>>'MHRRM'

    '''
    d = {'CYS': 'C', 'ASP': 'D', 'SER': 'S', 'GLN': 'Q', 'LYS': 'K',
     'ILE': 'I', 'PRO': 'P', 'THR': 'T', 'PHE': 'F', 'ASN': 'N', 
     'GLY': 'G', 'HIS': 'H', 'LEU': 'L', 'ARG': 'R', 'TRP': 'W', 'TER':'*',
     'ALA': 'A', 'VAL':'V', 'GLU': 'E', 'TYR': 'Y', 'MET': 'M','XAA':'X'}

    if len(seq) %3 == 0:
        upper_seq= seq.upper()
        single_seq=''
        for i in range(len(upper_seq)/3):
            single_seq += d[upper_seq[3*i:3*i+3]]
        return single_seq
    else:
        print("ERROR: Sequence was not a factor of 3 in length!")
Ethel answered 8/10, 2012 at 22:43 Comment(1)
You'll be able to use Bio.SeqUtils.seq1 as of the next release, Biopython 1.61 (or run from the github repository if you like being on the leading edge).Cheyennecheyne
J
3

Using R:

convert <- function(l) {

  map <- c("A", "R", "N", "D", "C", "E", "Q", "G", "H", "I",
           "L", "K", "M", "F", "P", "S", "T", "W", "Y", "V")

  names(map) <- c("ALA", "ARG", "ASN", "ASP", "CYS", "GLU", "GLN",
                  "GLY", "HIS", "ILE", "LEU", "LYS", "MET", "PHE",
                  "PRO", "SER", "THR", "TRP", "TYR", "VAL")

  sapply(strsplit(l, "(?<=[A-Z]{3})", perl = TRUE),
         function(x) paste(map[x], collapse = ""))
}

convert(c("ARGHISLEULEULYS", "METHISARGARGMET"))
# [1] "RHLLK" "MHRRM"
Jessamyn answered 6/10, 2012 at 18:53 Comment(3)
+1 for the clever method of splitting a string into 3-character substrings. It demonstrates something interesting about how regex-matching works.Koziarz
@fodel Thank you very much for your answer. I have more than 1000 sequences. it is in a text file. First I have to import this file in to r and has to change the three letter codes to one letter.I have shown the desired output.If you can, please help me.Flaminius
The function I showed you takes a vector of sequences as input. How to read a FASTA file into a vector of sequences in R is a different question. A quick Google search and I can point you to at least three different packages: Biostrings (readFASTA), seqinr (read.fasta), bio3d (read.fasta).Jessamyn
I
2

Another way to do it is with the seqinr and iPAC package in R.

# install.packages("seqinr")
# source("https://bioconductor.org/biocLite.R")
# biocLite("iPAC")

library(seqinr)
library(iPAC)

#read in file
fasta = read.fasta(file = "test_fasta.fasta", seqtype = "AA", as.string = T, set.attributes = F)
#split string
n = 3
fasta1 = lapply(fasta,  substring(x,seq(1,nchar(x),n),seq(n,nchar(x),n)))
#convert the three letter code for each element in the list 
fasta2 = lapply(fasta1, function(x) paste(sapply(x, get.SingleLetterCode), collapse = ""))

# > fasta2
# $`2ppo`
# [1] "RHLLK"
#
# $`3oot`
# [1] "MHRRM"
Inconsumable answered 28/8, 2015 at 11:19 Comment(0)
B
2

For those who land here on 2017 and beyond:

Here's a single line Linux bash command to convert protein amino acid three letter code to single letter code in a text file. I know this is not very elegant, but I hope this helps someone searching for the same and want to use single line command.

sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/HID/H/g;s/HIE/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g;s/MSE/X/g' < input_file_three_letter_code.txt > output_file_single_letter_code.txt

Solution for the original question above, as a single command line:

sed 's/.\{3\}/& /g' | sed 's/ALA/A/g;s/CYS/C/g;s/ASP/D/g;s/GLU/E/g;s/PHE/F/g;s/GLY/G/g;s/HIS/H/g;s/HID/H/g;s/HIE/H/g;s/ILE/I/g;s/LYS/K/g;s/LEU/L/g;s/MET/M/g;s/ASN/N/g;s/PRO/P/g;s/GLN/Q/g;s/ARG/R/g;s/SER/S/g;s/THR/T/g;s/VAL/V/g;s/TRP/W/g;s/TYR/Y/g;s/MSE/X/g' | sed 's/ //g' < input_file_three_letter_code.txt > output_file_single_letter_code.txt

Explanation:

[1] sed 's/.\{3\}/& /g' will spllit the sequence. It will add a space after every 3rd letter.

[2] The second 'sed' command in the pipe will take the output of above and convert to single letter code. Add any non-standard residue as s/XYZ/X/g; to this command.

[3] The third 'sed' command, sed 's/ //g' will remove white-space.

Burlington answered 7/11, 2017 at 15:40 Comment(0)
H
1
my %aa_hash=(
  Ala=>'A',
  Arg=>'R',
  Asn=>'N',
  Asp=>'D',
  Cys=>'C',
  Glu=>'E',
  Gln=>'Q',
  Gly=>'G',
  His=>'H',
  Ile=>'I',
  Leu=>'L',
  Lys=>'K',
  Met=>'M',
  Phe=>'F',
  Pro=>'P',
  Ser=>'S',
  Thr=>'T',
  Trp=>'W',
  Tyr=>'Y',
  Val=>'V',
  Sec=>'U',                       #http://www.uniprot.org/manual/non_std;Selenocysteine (Sec) and pyrrolysine (Pyl)
  Pyl=>'O',
);


    while(<>){
            chomp;
            my $aa=$_;
            warn "ERROR!! $aa invalid or not found in hash\n" if !$aa_hash{$aa};
            print "$aa\t$aa_hash{$aa}\n";
    }

Use this perl script to convert triplet a.a codes to single letter code.

Hamlen answered 5/7, 2013 at 6:51 Comment(0)
C
0

Python 3 solutions.

In my work, the annoyed part is that the amino acid codes can refer to the modified ones which often appear in the PDB/mmCIF files, like

'Tih'-->'A'.

So the mapping can be more than 22 pairs. The 3rd party tools in Python like

Bio.SeqUtils.IUPACData.protein_letters_3to1

cannot handle it. My easiest solution is to use the http://www.ebi.ac.uk/pdbe-srv/pdbechem to find the mapping and add the unusual mapping to the dict in my own functions whenever I encounter them.

def three_to_one(three_letter_code):
    mapping = {'Aba':'A','Ace':'X','Acr':'X','Ala':'A','Aly':'K','Arg':'R','Asn':'N','Asp':'D','Cas':'C',
           'Ccs':'C','Cme':'C','Csd':'C','Cso':'C','Csx':'C','Cys':'C','Dal':'A','Dbb':'T','Dbu':'T',
           'Dha':'S','Gln':'Q','Glu':'E','Gly':'G','Glz':'G','His':'H','Hse':'S','Ile':'I','Leu':'L',
           'Llp':'K','Lys':'K','Men':'N','Met':'M','Mly':'K','Mse':'M','Nh2':'X','Nle':'L','Ocs':'C',
           'Pca':'E','Phe':'F','Pro':'P','Ptr':'Y','Sep':'S','Ser':'S','Thr':'T','Tih':'A','Tpo':'T',
           'Trp':'W','Tyr':'Y','Unk':'X','Val':'V','Ycm':'C','Sec':'U','Pyl':'O'} # you can add more
    return mapping[three_letter_code[0].upper() + three_letter_code[1:].lower()]

The other solution is to retrieve the mapping online (But the url and the html pattern may change through time):

import re
import urllib.request

def three_to_one_online(three_letter_code):
    url = "http://www.ebi.ac.uk/pdbe-srv/pdbechem/chemicalCompound/show/" + three_letter_code
    with urllib.request.urlopen(url) as response:
        single_letter_code = re.search('\s*<td\s*>\s*<h3>One-letter code.*</h3>\s*</td>\s*<td>\s*([A-Z])\s*</td>', response.read().decode('utf-8')).group(1)
    return single_letter_code

Here I directly use the re instead of the html parsers for the simplicity.

Hope these can help.

Celebes answered 7/5, 2018 at 18:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.