I was wondering if there is way to get the sequence of proteins from uniprot protein ids. I did check few online softwares but they allow to get one sequence at a time but I have 5536 vlues. Is there any package in biopython to do this?
All the sequences from uniprot can be accesed from "http://www.uniprot.org/uniprot/" + UniprotID +.fasta. You can obtain any sequence with
import requests as r
from Bio import SeqIO
from io import StringIO
cID='P04637'
baseUrl="http://www.uniprot.org/uniprot/"
currentUrl=baseUrl+cID+".fasta"
response = r.post(currentUrl)
cData=''.join(response.text)
Seq=StringIO(cData)
pSeq=list(SeqIO.parse(Seq,'fasta'))
cID can be a list or a single entry, if you loop trough a bug list just add a delay between downloads, trying not to saturate the server. Hope it helps
One of the fastest and easiest way to fetch many sequences from UniProt in Python is to use pyfaidx package. It is a simple, but well-tested tool, build upon well-known algorithm from SAMtools. It is also citable for academic publications.
Simply download fasta file with all sequences (or just a chosen subset) from https://www.uniprot.org/downloads, unpack the file if needed, install pyfadix (e.g. with pip install pyfaidx --user
or bioconda) and load the sequences with Fasta
constructor:
from pyfaidx import Fasta
sequences = Fasta('uniprot_sprot.fasta')
The first loading may take a while, but afterwards all operations will be very fast. Now sequences
is a dict-like object, so you can access the entry you need with:
p53 = sequences['sp|P04637|P53_HUMAN']
print(p53)
which shows the sequence:
MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKMFCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKGEPHHELPPGSTKRALPNNTSSSPQPKKKPLDGEYFTLQIRGRERFEMFRELNEALELKDAQAGKEPGGSRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD
This sequence object is however much more than a string - it provides many handy utility functions and attributes (long_name
, unpadded_len
, slices with with: start
, end
, complement()
and reverse()
and so on - see the documentation for more).
If you want to access the sequence by the UniprotID instead of the full identifier from fasta file, use:
def extract_id(header):
return header.split('|')[1]
sequences = Fasta('uniprot_sprot.fasta', key_function=extract_id)
print(sequences['P04637'])
PS. Just one caveat - watch out for 1-based indexing.
You can get the sequences from the SwissProt/UniProt database also from the NCBI Entrez server. A way to fetch files from NCBI Entrez and read the sequences is the Python package biotite
:
>>> import biotite.database.entrez as entrez
>>> import biotite.sequence as seq
>>> import biotite.sequence.io.fasta as fasta
>>> # Find UIDs for SwissProt/UniProt entries
>>> query = entrez.SimpleQuery("Avidin", "Protein Name") \
... & entrez.SimpleQuery("Gallus gallus", "Organism") \
... & entrez.SimpleQuery("srcdb_swiss-prot", "Properties")
>>> print(query)
((Avidin[Protein Name]) AND ("Gallus gallus"[Organism])) AND (srcdb_swiss-prot[Properties])
>>> uids = entrez.search(query, db_name="protein")
>>> print(uids)
['158515411']
>>> # Download FASTA file containing the sequence(s)
>>> # from NCBI Entrez database
>>> file_name = entrez.fetch_single_file(
... uids, "avidin.fa", db_name="protein", ret_type="fasta"
... )
>>> # Read file
>>> fasta_file = fasta.FastaFile()
>>> fasta_file.read(file_name)
>>> print(fasta_file)
>sp|P02701.3|AVID_CHICK RecName: Full=Avidin; Flags: Precursor
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKE
SPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVG
INIFTRLRTQKE
>>> # Convert first sequence in file to 'ProteinSequence' object
>>> seq = fasta.get_sequence(fasta_file)
>>> print(seq)
MVHATSPLLLLLLLSLALVAPGLSARKCSLTGKWTNDLGSNMTIGAVNSRGEFTGTYITAVTATSNEIKESPLHGTQNTINKRTQPTFGFTVNWKFSESTTVFTGQCFIDRNGKEVLKTMWLLRSSVNDIGDDWKATRVGINIFTRLRTQKE
You can probably iterate over your list of values, calling the required method from the library each time.
Try the below code to get all the protein sequence provided in the query parameter
import urllib,urllib2
url = 'https://www.uniprot.org/uploadlists/'
params = {
'from':'ACC+ID',
'to':'ACC',
'format':'txt',
'query':'P13368 P20806 Q9UM73 P97793 Q17192'
}
data = urllib.urlencode(params)
request = urllib2.Request(url, data)
contact = "" # contact email address debug
request.add_header('User-Agent', 'Python %s' % contact)
response = urllib2.urlopen(request)
page = response.read()
print page
© 2022 - 2024 — McMap. All rights reserved.
cData=''.join(response.text)
– Gleam