scikit-bio extract genomic features from gff3 file
Asked Answered
N

1

7

Is it possible in scikit-bio to extract genomic features stored in a gff3 formatted file from a genome fasta file?

Example:


genome.fasta

>sequence1
ATGGAGAGAGAGAGAGAGAGGGGGCAGCATACGCATCGACATACGACATACATCAGATACGACATACTACTACTATGA

annotation.gff3

#gff-version 3
sequence1   source  gene    1   78  .   +   .   ID=gene1
sequence1   source  mRNA    1   78  .   +   .   ID=transcript1;parent=gene1
sequence1   source  CDS 1   6   .   +   0   ID=CDS1;parent=transcript1
sequence1   source  CDS 73  78  .   +   0   ID=CDS2;parent=transcript1

The desired sequence for the mRNA feature (transcript1) would be the concatination of the two child CDS features. So in this case this would be 'ATGGAGCTATGA'.

Nordrheinwestfalen answered 11/7, 2016 at 7:59 Comment(1)
As of scikit-bio 0.5.0, reading gff3 files is not supported. If this is a feature you'd like to see added to the project, please consider submitting a feature request on the issue tracker: github.com/biocore/scikit-bio/issuesPotato
F
1

This feature has been added to scikit-bio, however the version available in bioconda does not support it yet (2017-12-15). The format file for gff3 is present in the Github repository.

You can clone the repo and install it locally using:

$ git clone https://github.com/biocore/scikit-bio.git
$ cd scikit-bio
$ python setup.py install

Following the example given in the file the following code should work:

import io
from skbio.metadata import IntervalMetadata
from skbio.io import read

gff = io.StringIO(open("annotations.gff3", "r").read())
im = read(gff, format='gff3', into=IntervalMetadata, seq_id="sequence1")

print(im)

For me this this raises a FormatIdentificationWarning, but the entries are reported correctly:

4 interval features
-------------------
Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'gene', 'score': '.', 'strand': '+', 'ID': 'gene1'})
Interval(interval_metadata=<140154121000104>, bounds=[(0, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'mRNA', 'score': '.', 'strand': '+', 'ID': 'transcript1', 'parent': 'gene1'})
Interval(interval_metadata=<140154121000104>, bounds=[(0, 6)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS1', 'parent': 'transcript1'})
Interval(interval_metadata=<140154121000104>, bounds=[(72, 78)], fuzzy=[(False, False)], metadata={'source': 'source', 'type': 'CDS', 'score': '.', 'strand': '+', 'phase': 0, 'ID': 'CDS2', 'parent': 'transcript1'})

In the example in the code, the GFF3 and the FASTA file are concatenated in the input string used for the read function. Maybe that can fix this issue. Also I am not 100% sure how you can use the intervals returned to extract the feature.

Fire answered 15/12, 2017 at 16:2 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.