Protein sequence from uniprot protein id python
Asked Answered
H

5

7

I was wondering if there is way to get the sequence of proteins from uniprot protein ids. I did check few online softwares but they allow to get one sequence at a time but I have 5536 vlues. Is there any package in biopython to do this?

Hexagram answered 29/9, 2018 at 15:4 Comment(0)
S
10

All the sequences from uniprot can be accesed from "http://www.uniprot.org/uniprot/" + UniprotID +.fasta. You can obtain any sequence with

import requests as r
from Bio import SeqIO
from io import StringIO

cID='P04637'

baseUrl="http://www.uniprot.org/uniprot/"
currentUrl=baseUrl+cID+".fasta"
response = r.post(currentUrl)
cData=''.join(response.text)

Seq=StringIO(cData)
pSeq=list(SeqIO.parse(Seq,'fasta'))

cID can be a list or a single entry, if you loop trough a bug list just add a delay between downloads, trying not to saturate the server. Hope it helps

Stowell answered 30/1, 2019 at 4:11 Comment(1)
why do I have to do this cData=''.join(response.text)Gleam
D
2

One of the fastest and easiest way to fetch many sequences from UniProt in Python is to use pyfaidx package. It is a simple, but well-tested tool, build upon well-known algorithm from SAMtools. It is also citable for academic publications.

Simply download fasta file with all sequences (or just a chosen subset) from https://www.uniprot.org/downloads, unpack the file if needed, install pyfadix (e.g. with pip install pyfaidx --user or bioconda) and load the sequences with Fasta constructor:

from pyfaidx import Fasta
sequences = Fasta('uniprot_sprot.fasta')

The first loading may take a while, but afterwards all operations will be very fast. Now sequences is a dict-like object, so you can access the entry you need with:

p53 = sequences['sp|P04637|P53_HUMAN']
print(p53)

which shows the sequence:

MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD

This sequence object is however much more than a string - it provides many handy utility functions and attributes (long_name, unpadded_len, slices with with: start, end, complement() and reverse() and so on - see the documentation for more).

If you want to access the sequence by the UniprotID instead of the full identifier from fasta file, use:

def extract_id(header):
    return header.split('|')[1]

sequences = Fasta('uniprot_sprot.fasta', key_function=extract_id)
print(sequences['P04637'])

PS. Just one caveat - watch out for 1-based indexing.

Dawes answered 30/9, 2018 at 14:1 Comment(0)
Y
1

You can get the sequences from the SwissProt/UniProt database also from the NCBI Entrez server. A way to fetch files from NCBI Entrez and read the sequences is the Python package biotite:

>>> import biotite.database.entrez as entrez
>>> import biotite.sequence as seq
>>> import biotite.sequence.io.fasta as fasta
>>> # Find UIDs for SwissProt/UniProt entries
>>> query =   entrez.SimpleQuery("Avidin", "Protein Name") \
...         & entrez.SimpleQuery("Gallus gallus", "Organism") \
...         & entrez.SimpleQuery("srcdb_swiss-prot", "Properties")
>>> print(query)
((Avidin[Protein Name]) AND ("Gallus gallus"[Organism])) AND (srcdb_swiss-prot[Properties])
>>> uids = entrez.search(query, db_name="protein")
>>> print(uids)
['158515411']
>>> # Download FASTA file containing the sequence(s)
>>> # from NCBI Entrez database
>>> file_name = entrez.fetch_single_file(
...     uids, "avidin.fa", db_name="protein", ret_type="fasta"
... )
>>> # Read file
>>> fasta_file = fasta.FastaFile()
>>> fasta_file.read(file_name)
>>> print(fasta_file)
>sp|P02701.3|AVID_CHICK RecName: Full=Avidin; Flags: Precursor
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKE
SPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVG
INIFTRLRTQKE
>>> # Convert first sequence in file to 'ProteinSequence' object
>>> seq = fasta.get_sequence(fasta_file)
>>> print(seq)
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKESPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE
Yam answered 9/10, 2018 at 14:47 Comment(1)
THANK YOU for this. I was trying to find a way to programmatically extract the sequences for my protein IDs for hours. Wonderful find, that biotite package. Will definitely add that to my toolkit. The code works wonders.Sherysherye
E
0

You can probably iterate over your list of values, calling the required method from the library each time.

Eolithic answered 29/9, 2018 at 15:5 Comment(2)
I am not aware of any library. I am still searching to get one. That's the reason I posted for help.Hexagram
If you can make an HTTP/REST/SOAP request to an online application to get the info you need on one particular item, you could make a python script which iterates over every data piece you have, sending it to that endpoint. I don't know about your particular problem, but they are web applications, it is likely they work by accepting some POST or GET request. Just make it from python while iterating over your dataset, and store the results. If you have more info on the applications you talk about, we may help more.Eolithic
I
0

Try the below code to get all the protein sequence provided in the query parameter

import urllib,urllib2
url = 'https://www.uniprot.org/uploadlists/'
params = {
    'from':'ACC+ID',
    'to':'ACC',
    'format':'txt',
    'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # contact email address debug
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read()
print page
Inattention answered 1/10, 2018 at 12:10 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.