Is there any way to get abstracts for a given list of pubmed ids?

M

2

5

I have list of pmids i want to get abstracts for both of them in a single url hit

    pmids=[17284678,9997]
    abstract_dict={}
    url = https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
    db=pubmed&id=**17284678,9997**&retmode=text&rettype=xml

My requirement is to get in this format

   abstract_dict={"pmid1":"abstract1","pmid2":"abstract2"}

I can get in above format by trying each id and updating the dictionary, but to optimize time I want to give all ids to url and process and get only abstracts part.

Mcmurry answered 29/11, 2017 at 18:7 Comment(0)

P

7

Using BioPython, you can give the joined list of Pubmed IDs to Entrez.efetch and that will perform a single URL lookup:

from Bio import Entrez

Entrez.email = '[email protected]'

pmids = [17284678,9997]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
                       rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
             for pubmed_article in records['PubmedArticle']]


abstract_dict = dict(zip(pmids, abstracts))

This gives as result:

{9997: 'Electron paramagnetic resonance and magnetic susceptibility studies of Chromatium flavocytochrome C552 and its diheme flavin-free subunit at temperatures below 45 degrees K are reported. The results show that in the intact protein and the subunit the two low-spin (S = 1/2) heme irons are distinguishable, giving rise to separate EPR signals. In the intact protein only, one of the heme irons exists in two different low spin environments in the pH range 5.5 to 10.5, while the other remains in a constant environment. Factors influencing the variable heme iron environment also influence flavin reactivity, indicating the existence of a mechanism for heme-flavin interaction.',
 17284678: 'Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation.'}

Edit:

In the case of pmids without corresponding abstracts, watch out with the fix you suggested:

abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0] 
             for pubmed_article in records['PubmedArticle'] if 'Abstract' in
             pubmed_article['MedlineCitation']['Article'].keys()]

Suppose you have the list of Pubmed IDs pmids = [1, 2, 3], but pmid 2 doesn't have an abstract, so abstracts = ['abstract of 1', 'abstract of 3']

This will cause problems in the final step where I zip both lists together to make a dict:

>>> abstract_dict = dict(zip(pmids, abstracts))
>>> print(abstract_dict)
{1: 'abstract of 1', 
 2: 'abstract of 3'}

Note that abstracts are now out of sync with their corresponding Pubmed IDs, because you didn't filter out the pmids without abstracts and zip truncates to the shortest list.

Instead, do:

abstract_dict = {}
without_abstract = []

for pubmed_article in records['PubmedArticle']:
    pmid = int(str(pubmed_article['MedlineCitation']['PMID']))
    article = pubmed_article['MedlineCitation']['Article']
    if 'Abstract' in article:
        abstract = article['Abstract']['AbstractText'][0]
        abstract_dict[pmid] = abstract
    else:
       without_abstract.append(pmid)

print(abstract_dict)
print(without_abstract)

Pattani answered 30/11, 2017 at 9:20 Comment(4)

I tried your code it is working on some aspects and giving 'Key Error on other where articles not having abstracts. Below is attached 'KeyError' code --------------------------------------------------------------------------- KeyError Traceback (most recent call last) <ipython-input-34-382b1b6b6529> in <module>() 1 abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0] ----> 2 for pubmed_article in records['PubmedArticle'] 3 ] KeyError: 'Abstract' – Mcmurry 30/11, 2017 at 19:47

code abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0] for pubmed_article in records['PubmedArticle'] if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys()] – Mcmurry 30/11, 2017 at 20:4

Note that you now will need to filter out the Pubmed IDs without abstracts from pmids, otherwise abstract_dict will be out of sync. See my edit. – Pattani 1/12, 2017 at 8:19

The code you have added will correctly filter out pmids with and without abstracts. I am bit curious here. If a pmid isn't having an abstract, it can have a title for the article, so I want to get the titles of those pmids which are not having abstracts. I have chosen this approach so as not to leave pmids that are not having abstracts. – Mcmurry 1/12, 2017 at 16:53

M

0

from Bio import Entrez
import time
Entrez.email = '[email protected]'
pmids = [29090559 29058482 28991880 28984387 28862677 28804631 28801717 28770950 28768831 28707064 28701466 28685492 28623948 28551248]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
                   rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]  if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys() else pubmed_article['MedlineCitation']['Article']['ArticleTitle']  for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
print abstract_dict

Mcmurry answered 1/12, 2017 at 17:1 Comment(0)

Recommended topics

Hot tags