Using BioPython, you can give the joined list of Pubmed IDs to Entrez.efetch
and that will perform a single URL lookup:
from Bio import Entrez
Entrez.email = '[email protected]'
pmids = [17284678,9997]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
This gives as result:
{9997: 'Electron paramagnetic resonance and magnetic susceptibility studies of Chromatium flavocytochrome C552 and its diheme flavin-free subunit at temperatures below 45 degrees K are reported. The results show that in the intact protein and the subunit the two low-spin (S = 1/2) heme irons are distinguishable, giving rise to separate EPR signals. In the intact protein only, one of the heme irons exists in two different low spin environments in the pH range 5.5 to 10.5, while the other remains in a constant environment. Factors influencing the variable heme iron environment also influence flavin reactivity, indicating the existence of a mechanism for heme-flavin interaction.',
17284678: 'Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation.'}
Edit:
In the case of pmids without corresponding abstracts, watch out with the fix you suggested:
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0]
for pubmed_article in records['PubmedArticle'] if 'Abstract' in
pubmed_article['MedlineCitation']['Article'].keys()]
Suppose you have the list of Pubmed IDs pmids = [1, 2, 3]
, but pmid 2 doesn't have an abstract, so abstracts = ['abstract of 1', 'abstract of 3']
This will cause problems in the final step where I zip
both lists together to make a dict:
>>> abstract_dict = dict(zip(pmids, abstracts))
>>> print(abstract_dict)
{1: 'abstract of 1',
2: 'abstract of 3'}
Note that abstracts are now out of sync with their corresponding Pubmed IDs, because you didn't filter out the pmids without abstracts and zip
truncates to the shortest list
.
Instead, do:
abstract_dict = {}
without_abstract = []
for pubmed_article in records['PubmedArticle']:
pmid = int(str(pubmed_article['MedlineCitation']['PMID']))
article = pubmed_article['MedlineCitation']['Article']
if 'Abstract' in article:
abstract = article['Abstract']['AbstractText'][0]
abstract_dict[pmid] = abstract
else:
without_abstract.append(pmid)
print(abstract_dict)
print(without_abstract)
code
--------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-34-382b1b6b6529> in <module>() 1 abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0] ----> 2 for pubmed_article in records['PubmedArticle'] 3 ] KeyError: 'Abstract' – Mcmurry