How to linearize fasta sequences within a not only fasta file?
Asked Answered
G

2

-2

I am running ipcress for in silico PCR and the results look like this:

Ipcress result
Experiment: Primer1
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 2601 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer1 2601 B 91258 0 A 93839 0 revcomp
>F-RK1_product_1 seq QLOD02000001.1:filter(unmasked) start 91258 length 2601
AAGCGGATTGAGAAGTGGTGGTGGTAGTAGCAGTCATGTGGGTAACGAAGACTACAACAGCAGTATTATA
ATTAGGAAAAGGTTTGAAGAAAAGATGAGGCTTGAAAGGGACGACGACGACGACAAGATCTTCAATCCCA
CCAAGTACTTTGTCCAAGAAGTTGTTAATTGCTTTGATGAGTCTGACCTCTACAGAACT...

Ipcress result
Experiment: Primer2
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 854 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer2 854 B 149835 0 A 150669 0 revcomp
>F-RK3_product_1 seq QLOD02000001.1:filter(unmasked) start 149835 length 854
AGGATGACATGGGAATCTGGGACCTCAACCATTTTGTCTAGCTCTCTCCCAAGAGAAAGCGACGAAAATG
ACATGGGTTTGGCTCTGTATTGTTTAACAAATTTAAGTGGCTTAAAAACTCTAC....

I would like to know if there is any way to linearize these fasta sequences (and only that)? I would like my final file to look like this:

Ipcress result
Experiment: Primer1
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 2601 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer1 2601 B 91258 0 A 93839 0 revcomp
>F-RK1_product_1 seq QLOD02000001.1:filter(unmasked) start 91258 length 2601
AAGCGGATTGAGAAGTGGTGGTGGTAGTAGCAGTCATGTGGGTAACGAAGACTACAACAGCAGTATTATAATTAGGAAAAGGTTTGAAGAAAAGATGAGGCTTGAAAGGGACGACGACGACGACAAGATCTTCAATCCCACCAAGTACTTTGTCCAAGAAGTTGTTAATTGCTTTGATGAGTCTGACCTCTACAGAACT...

Ipcress result
Experiment: Primer2
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 854 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer2 854 B 149835 0 A 150669 0 revcomp
>F-RK3_product_1 seq QLOD02000001.1:filter(unmasked) start 149835 length 854
AGGATGACATGGGAATCTGGGACCTCAACCATTTTGTCTAGCTCTCTCCCAAGAGAAAGCGACGAAAATGACATGGGTTTGGCTCTGTATTGTTTAACAAATTTAAGTGGCTTAAAAACTCTAC....
Glow answered 29/11, 2023 at 10:9 Comment(2)
How do you know when the FASTA data ends? The empty line?Eanore
Please edit the question to show us the code for your latest attempt and where you got stuck. See also: How to Ask and help center.Ias
E
1

If you are asking how to unwrap lines between a line which starts with > (a FASTA header) and an empty line, that is quite easy:

awk '/^>/ { wrap=1; print; next }
   wrap && /^$/ { print wrapped; wrapped = ""; wrap = 0 }
   wrap { wrapped = wrapped $0; next }
   1
   END { if (wrap) print wrapped }' file >newfile

Recall that Awk examines one line at a time. If we see the FASTA header, we set wrap to 1 so we can remember this fact, print the current line, and skip to the next line. Now, on subsequent lines, if we see an empty line, we print whatever we have collected (which is handled in the next line of the script), and stop collecting. Otherwise, if we get this far in the script and wrap is true, collect the current line to the end of wrapped and skip to the next input line. Otherwise, anything not covered by the previous cases is simply printed. (The Awk idiom 1 is a shorthand which does this.) Finally, if we have something in wrapped when we finish, don't forget to print that too.

Demo: https://ideone.com/ZCkKss

Eanore answered 29/11, 2023 at 10:27 Comment(0)
S
3

Using any awk:

$ awk '!NF{ORS=RS; print} {print} /^>/{ORS=""} END{print RS}' file
Ipcress result
Experiment: Primer1
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 2601 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer1 2601 B 91258 0 A 93839 0 revcomp
>F-RK1_product_1 seq QLOD02000001.1:filter(unmasked) start 91258 length 2601
AAGCGGATTGAGAAGTGGTGGTGGTAGTAGCAGTCATGTGGGTAACGAAGACTACAACAGCAGTATTATAATTAGGAAAAGGTTTGAAGAAAAGATGAGGCTTGAAAGGGACGACGACGACGACAAGATCTTCAATCCCACCAAGTACTTTGTCCAAGAAGTTGTTAATTGCTTTGATGAGTCTGACCTCTACAGAACT...

Ipcress result
Experiment: Primer2
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 854 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer2 854 B 149835 0 A 150669 0 revcomp
>F-RK3_product_1 seq QLOD02000001.1:filter(unmasked) start 149835 length 854
AGGATGACATGGGAATCTGGGACCTCAACCATTTTGTCTAGCTCTCTCCCAAGAGAAAGCGACGAAAATGACATGGGTTTGGCTCTGTATTGTTTAACAAATTTAAGTGGCTTAAAAACTCTAC....

or:

$ awk -v RS= -F'\n' '{for (i=1; i<=NF; i++) printf "%s%s", $i, (i<10 ? ORS : ""); print ORS}' file
Ipcress result
Experiment: Primer1
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 2601 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer1 2601 B 91258 0 A 93839 0 revcomp
>F-RK1_product_1 seq QLOD02000001.1:filter(unmasked) start 91258 length 2601
AAGCGGATTGAGAAGTGGTGGTGGTAGTAGCAGTCATGTGGGTAACGAAGACTACAACAGCAGTATTATAATTAGGAAAAGGTTTGAAGAAAAGATGAGGCTTGAAAGGGACGACGACGACGACAAGATCTTCAATCCCACCAAGTACTTTGTCCAAGAAGTTGTTAATTGCTTTGATGAGTCTGACCTCTACAGAACT...

Ipcress result
Experiment: Primer2
Primers: B A
Target: QLOD02000001.1:filter(unmasked), whole genome shotgun sequence
Matches: 20/20 20/20
Product: 854 bp (range 100-5000)
Result type: revcomp
ipcress: QLOD02000001.1:filter(unmasked) Primer2 854 B 149835 0 A 150669 0 revcomp
>F-RK3_product_1 seq QLOD02000001.1:filter(unmasked) start 149835 length 854
AGGATGACATGGGAATCTGGGACCTCAACCATTTTGTCTAGCTCTCTCCCAAGAGAAAGCGACGAAAATGACATGGGTTTGGCTCTGTATTGTTTAACAAATTTAAGTGGCTTAAAAACTCTAC....
Skill answered 29/11, 2023 at 12:10 Comment(1)
This is really elegant, thanks for an inspiring example.Eanore
E
1

If you are asking how to unwrap lines between a line which starts with > (a FASTA header) and an empty line, that is quite easy:

awk '/^>/ { wrap=1; print; next }
   wrap && /^$/ { print wrapped; wrapped = ""; wrap = 0 }
   wrap { wrapped = wrapped $0; next }
   1
   END { if (wrap) print wrapped }' file >newfile

Recall that Awk examines one line at a time. If we see the FASTA header, we set wrap to 1 so we can remember this fact, print the current line, and skip to the next line. Now, on subsequent lines, if we see an empty line, we print whatever we have collected (which is handled in the next line of the script), and stop collecting. Otherwise, if we get this far in the script and wrap is true, collect the current line to the end of wrapped and skip to the next input line. Otherwise, anything not covered by the previous cases is simply printed. (The Awk idiom 1 is a shorthand which does this.) Finally, if we have something in wrapped when we finish, don't forget to print that too.

Demo: https://ideone.com/ZCkKss

Eanore answered 29/11, 2023 at 10:27 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.