Hum-drum.
This started as really ugly Perl code. But, like a flower or a cocoon
or a flower named Cocoon, it blossomed into something beautiful after
an hour of reduction, that wonderful sweet process where everything
clicks and hums. Java folks call it “refactor,” Haskell folks call it
“second-level static typed meromorphism arrow structure on the
category Dog”: the fact of the matter is that this is a universal,
pan-language feeling. Anyway, what this does is take a Genbank full
listing for an organism’s genes, for example Salmonella’s genes,
and convert that into a tab-delimited file with–for each gene–the
name, start and stop of the gene, and the description. If you have the
Salmonella genome, you can then pull all the genes from
Salmonella using slurp_ecogene.pl or whatever file we’ll have
1000 revisions from now.
A word of caution: The Genbank links go to extremely large web pages in the order of megabytes, megabytes, like all four of them.
use strict; use warnings;
open(my $h, shift);
my @lines = <$h>;
close($h);
sub infer {
my ($i, $start, $stop) = @_;
# Lucky for us, the next line is always the /gene. Then we'll
# grope around in the dark and try to find /product.
my ($gene) = $lines[++$i] =~ /gene="(.*?)"/;
$i++ until $lines[$i] =~ /product=/;
# Some products span more than one line. We'll need to remove
# initial whitespace and newlines before extracting $desc.
my $desc;
for (join '::', @lines[$i .. $i+3]) {
s/::\s+//g; s/(\r|\n)/ /g;
($desc) = /product="(.*?)"/;
}
return "$gene\t$start\t$stop\t$desc";
}
# Doing a foreach loop over <$h> (linear) and parsing (decidedly
# nonlinear) is hard, so we'll sacrifice elegance for sanity.
foreach my $i (0 .. $#lines) {
local $_ = $lines[$i];
next unless /^\s+CDS/;
my ($start, $stop) = /(\d+)..(\d+)/;
# We're matching for complement(#..#). It's good to know, so
# we'll mark it with a negative $start.
$start *= -1 if /complement\(/;
print infer($i, $start, $stop), "\n";
}