Haplotype Threading

The biological motivation of this game is to find a set of progenitor haplotypes capable of describing a population with the fewest number of recombinations.
Here’s the basic idea. The screen is divided into two sets, an upper set of templates, and a lower set of examples. The goal of the game is to find the set of templates that best describe all of the examples, and thus achieves tha lowest possible score. The scoring is on a per example basis, and it is basically a count of how many switches between templates are needed to construct it. For example, the first example is a exact match to the first template, so no switches are required. The third example starts on the 4th template, then switches to the first or third template, and finally swicthes back to the 4th to finish, thus requiring only 2 switches.
Kyle has written a dynamic programming based algorithm to find the optimal (lowest) score for threading an example through a given set of templates. One aspect of the game play is for the player to somehow design a template set and see its impact on the score. Initially, the player might just click on a template cell to flip its state.
Can we make an interesting game out of this problem?
February 5th, 2007 at 1:45 pm
Assume that the number of unique examples is E, each example is of length N, and there are K templates. I submit that the total number of switches in the worst-case best solution is bounded from above by N*(E-K)+E*(K-1). We can construct K templates satisfying this bound as follows.
Consider the following: Divide each example Si into E segments of length N/E, {Si1, Si2, …, SiE}. We construct each template by concatenating a sequence of E segments, one from each original examples. For instance, template T1 can be S11, S22, S33, …, SEE. Template T2 can be S21, S32, …, SE(E-1), S1E. And template TK = SK1, S(K+1)2, …, SE(E-K+1), S1(E-K+2), …, S(K-1)E. Note that each example contributes exactly K of its segments to the K templates, exactly one per template. These K segments contribute K-1 switches to each example. For the remaining E-K segments, the worst case is occurs when there is a switch at every position, which gives a maximum number of switches N*(1-K/E) per example. It is difficult to acheive this many switches for every position of every sample, thus it is a strict upper bound. As a result, the number of switches for each example is no more than N*(1-K/E) + K-1, and an upper bound on the maximum number of total switches for all examples is E*(N*(1-K/E)+K-1) = N*(E-K) + E*(K-1).
Note that the above solution is not unique. There may also be a tigher upper bound. For most examples we expect the minimum number of switches should be much lower. My comment only applies to how large this “optimal” solution might be.
In fact there exists another way to assemble the K templates with an upperbound of N*(E-K) + E-2 switches if E>K and N*(E-K) if E=K. The discontinuity is due to boundary effect. I’ll explain it in more detail later.Â
February 5th, 2007 at 1:49 pm
An alternative optimization criterion is to minimize the maximum number of switches per sample. The goal is to avoid having all switches in one sample. Intuitively, we expect every sample have similar number of switches.
February 19th, 2007 at 10:34 am
[…] Our First Playable Game! In the last week, we were able to get our first prototype game up and running. I was going to say limping, but it is much better than that. This is an implementation of the Haplotype Threading game. […]