Parsing a file with D
Asked Answered
H

2

7

I am new in D and would like to parse a biological file of the form

>name1
acgcgcagagatatagctagatcg
aagctctgctcgcgct
>name2
acgggggcttgctagctcgatagatcga
agctctctttctccttcttcttctagagaga
>name2
gag ggagag

such that I can capture the 'headers' name1,name2,name3 with the corresponding 'sequence' data, the ..acgcg... stuff.

Now i have this.but it will only iterate line by line,

import std.stdio;
import std.stream;
import std.regex;


int main(string[] args){
  auto filename = args[1];
  auto entry_name = regex(r"^>(.*)"); //captures header only
  auto fasta_regex = regex(r"(\>.+\n)([^\>]+\n)"); //captures header and correponding sequence

  try {
    Stream file = new BufferedFile(filename);
    foreach(ulong n, char[] line; file) {
      auto name_capture = match(line,entry_name);
      writeln(name_capture.captures[1]);
    }

    file.close();
  }
  catch (FileException xy){
    writefln("Error reading the file: ");
  }

  catch (Exception xx){
    writefln("Exception occured: " ~ xx.toString());
  }
  return 0;
}

I would like to know a nice way of extracting the header and the sequence data such that I can create an associative array where each item corresponds to an entry in the file

[name1:acgcgcagagatatagctagatcgaagctctgctcgcgct,name2:acgggggcttgctagctcgatagatcgaagctctctttctccttcttcttctagagaga,.....]
Hurlee answered 24/1, 2012 at 19:19 Comment(1)
D seems to be popular among bioinformaticians :)Maineetloire
M
8

the header is on it's own line right? so why not check for it and use an appender to allocate for the value

auto current = std.array.appender!(char[]);
string name;
foreach(ulong n, char[] line; file) {
      auto entry = match(line,entry_name);
      if(entry){//we are in a header line

          if(name){//write what was caught 
              map[name]=current.data.dup;//dup because .current.data is reused
          }
          name = entry.hit.idup;
          current.clear();
      }else{
          current.put(line);
      }
}
map[name]=current.data.dup;//remember last capture

map is where you'll store the values (a string[string] will do)

Myrtamyrtaceous answered 24/1, 2012 at 20:33 Comment(5)
many thanks! the header is on its own line. I dont understand where c.hit comes from :) Also why should we allocate entry_name as a match object?(it is supposed to a regex). and finally what is the type for map[name] ? Sorry am a n00b on this at the moment.Hurlee
Getting some compile errors(dmd2): expression entry_name of type Regex!(char) does not have a boolean value read_file.d(37): Error: undefined identifier map, did you mean function main? read_file.d(39): Error: cannot implicitly convert expression (c.hit()) of type char[] to string read_file.d(42): Error: no property 'append' for type 'Appender!(char[])' read_file.d(45): Error: undefined identifier map, did you mean function main? read_file.d(49): Error: undefined identifier FileExceptioHurlee
many thanks! Have realised I need to cast(string) current.data.dup to convert it to type string i.e after declaring map as an associative array; string[string] mapHurlee
@Hurlee You don't dup a string and then cast it to string to get a string. You idup it. dup returns a mutable copy of the array that it's called on. idup returns an immutable copy of the array that it's called on.Seadon
nice tip, so map[name] = current.data.idup should do the trick? otherwise dmd 2.057 on compile produces this "Error: cannot implicitly convert expression (_adDupT(& D11TypeInfo_Aa6__initZ,current.data())) of type char[] to string"Hurlee
C
4

Here is my solution without regular expressions (I do not believe for such simple input we need regexp):

import std.stdio;
import std.stream;

int main(string[] args) {
  int ret = 0;
  string fileName = args[1];
  string header;
  char[] sequence;
  string[string] content;
  try {  
    auto file = new BufferedFile(fileName);
    foreach(ulong lineNumber, char[] line; file) {
      if (line[0] == '>') {       
        if (header.length > 0) {
          content[header] = sequence.idup;
          sequence.length = 0;
        } // if
        // we have a new header, and new sequence will start after it
        header = line[1..$].idup;
        content[header] = "";
      } else {
          sequence ~= line;
      } // else
    } // foreach
    content[header] = sequence.idup;
    file.close();
  }
  catch (OpenException oe){
    writefln("Error opening file: " ~ oe.toString());
  }
  catch (Exception e){
    writefln("Exception: " ~ e.toString());
  }
  writeln(content);
  return ret;
} // main() function

/+ -------------------------- BEGIN OUTPUT ------------------------------- +
["name3":"gag ggagag", "name1":"acgcgcagagatatagctagatcgaagctctgctcgcgct", "name2":"acgggggcttgctagctcgatagatcgaagctctctttctccttcttcttctagagaga"]
 + -------------------------- END OUTPUT --------------------------------- +/
Crete answered 25/1, 2012 at 10:11 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.