The Dark Balloon

A weblog by Hao Lian.
A journey into the soft of night.
A terrible secret guarded by golems.

§
Hum-drum.

This started as really ugly Perl code. But, like a flower or a cocoon or a flower named Cocoon, it blossomed into something beautiful after an hour of reduction, that wonderful sweet process where everything clicks and hums. Java folks call it “refactor,” Haskell folks call it “second-level static typed meromorphism arrow structure on the category Dog”: the fact of the matter is that this is a universal, pan-language feeling. Anyway, what this does is take a Genbank full listing for an organism’s genes, for example Salmonella’s genes, and convert that into a tab-delimited file with–for each gene–the name, start and stop of the gene, and the description. If you have the Salmonella genome, you can then pull all the genes from Salmonella using slurp_ecogene.pl or whatever file we’ll have 1000 revisions from now. A word of caution: The Genbank links go to extremely large web pages in the order of megabytes, megabytes, like all four of them.

use strict; use warnings;

open(my $h, shift);
my @lines = <$h>;
close($h);

sub infer {
    my ($i, $start, $stop) = @_;

    # Lucky for us, the next line is always the /gene. Then we'll
    # grope around in the dark and try to find /product.
    my ($gene) = $lines[++$i] =~ /gene="(.*?)"/;
    $i++ until $lines[$i] =~ /product=/;

    # Some products span more than one line. We'll need to remove
    # initial whitespace and newlines before extracting $desc.
    my $desc;
    for (join '::', @lines[$i .. $i+3]) {
        s/::\s+//g; s/(\r|\n)/ /g;
        ($desc) = /product="(.*?)"/;
    }
    return "$gene\t$start\t$stop\t$desc";
}

# Doing a foreach loop over <$h> (linear) and parsing (decidedly
# nonlinear) is hard, so we'll sacrifice elegance for sanity.
foreach my $i (0 .. $#lines) {
    local $_ = $lines[$i];
    next unless /^\s+CDS/;
    my ($start, $stop) = /(\d+)..(\d+)/;

    # We're matching for complement(#..#). It's good to know, so
    # we'll mark it with a negative $start.
    $start *= -1 if /complement\(/;
    print infer($i, $start, $stop), "\n";
}
[(2008 January 20) .]

Recent comments (HAO, Jammies.) • (Tim, Jammies.) • (Prashanth, Wedding.) • (Hao, Hands.) • (Prashanth, Hands.).

Recent posts (03/18, The Daily Show: Oscar Romero and textbooks.) • (02/03, Butter-related greetings.) • (01/18, Happy Martin Luther King, Jr. day.) • (01/18, Chances, part one.) • (01/02, Jammies.).