Math and CS in Biology

Source: Lior Pachter’s blog, Dec 2014

Here is a current example from (computational) biology where it is not yet clear what “right” thinking should be despite the experts working hard at it, and that is useful to highlight because of the people involved: With the vast amount of human genomes being sequenced (some estimates are as high as 400,000 in the coming year), there is an increasingly pressing fundamental question about how the (human) genome should be represented and stored. This is ostensibly a computer science question: genomes should perhaps be compressed in ways that allow for efficient search and retrieval, but I’d argue that fundamentally it is a math question.

This is because what the question is really asking, is how should one think about genome sequences related mostly via recombination and only slightly by mutation, and what are the “right” mathematical structures for this challenge? The answer matters not only for the technology (how to store genomes), but much more importantly for the foundations of population and statistical genetics.

Without the right abstractions for genomes, the task of coherently organizing and interpreting genomic information is hopeless. David Haussler (with coauthors) and Richard Durbin have both written about this problem in papers that are hard to describe in any way other than as math papers; see Mapping to a Reference Genome Structure and Efficient haplotype matching and storage using the positional Burrows-Wheeler transform (BPWT). Perhaps it is no coincidence that both David Haussler and Richard Durbin studied mathematics.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s