IT博客汇
  • 首页
  • 精华
  • 技术
  • 设计
  • 资讯
  • 扯淡
  • 权利声明
  • 登录 注册

    Computational Biology: Phylogeny Tree

    MarkNV发表于 2013-09-07 18:17:00
    love 0

    Tree of life

    A phylogeny tree is also called evolutionary tree. It is used to depict the evolutionary relationships among various biological species by similarities and differences in their physical or genetic characteristics. It can also be used to describe linguistic evolution history.

    tree of life

    This (from Wikipedia) is a phylogeny tree showing how bacteria, archaea, eukaryota evolve. It's important to point out that we don't know whether the tree is correct (the true tee) or not, because we could only infer the history of this period of time by some archaeology evidence. In computational biology, however, our task is to simulate to evolution process and generate tree of life then estimate its accuracy. We could still find relatively convincing information from phylogeny trees, though the result may not exactly match the true history due to evolution is a extremely complex process (not even tree structure in fact).

    1. How to Represent Phylogenetic Tress?

    1.1 Rooted Trees

    rooted tree

    Above is a rooted tree for six different species or languages.

    Note:

    • Swapping siblings nodes doesn't change the tree's topology.

    1.1.1 Newick Notation

    We could use the Newick notation to represent any rooted tree. It's super simple, easy. The Newick representation of the rooted tree above is: $(a, (b, ((c, d), (e, f))))$ or other versions by swapping siblings, e.g. $(a, (b, ((f, e), (c, d))))$ so on so forth.

    1.1.2 Clades

    The other representation form of an rooted tree is using clades. To find the clades set, we just need to pick every internal node of the rooted tree then write down the leaves it contains. Given a rooted tree $T=(a, (b, ((c, d), (e, f))))$, we can get $Clade(T)=\lbrace \lbrace a\rbrace , \lbrace b\rbrace , \lbrace c\rbrace , \lbrace d\rbrace , \lbrace e\rbrace , \lbrace f\rbrace , \lbrace c, d\rbrace , \lbrace e, f\rbrace , \lbrace c, d, e, f\rbrace , \lbrace b, c, d, e, f\rbrace , \lbrace a, b, c, d, e, f\rbrace \rbrace$.

    Note:

    • Trivial clades are singleton sets and the set contains all taxa. So the others are called non-trivial clades.
    • Two clades are compatible iff they can exist in on single tree.
    • A given set of subset of Clades(T) $X$ is compatible if $X\subseteq Clades(T)$
    • If set $A$ is compatible, then for any two elements $X$ and $Y$ in $A$, either $X$ and $Y$ are disjoint or one contains the other. That is every pair of elements in A are compatible.

    1.2 Unrooted Trees

    unrooted tree

    Above is an unrooted tree according to the rooted tree given in section 1.1.

    1.2.1 Newick Notation

    Unrooted trees are also able to be represented by Newick notation, however, not unique (doesn't means swapping siblings here). That is, different rooted trees can have same unrooted version. e.g. Unrooted tree for $(a, (b, (c, d)))$ is same to $(b, (a, (c, d)))$ and is also same to $((a, b), (c, d))$.

    1.2.2 Bipartitions

    Clades do not work for unrooted tree, but we can use a new method called bipartition to represent an unrooted tree. The process is for each internal edge write down the left and right part of taxa if it was deleted. So the bipartitions for the unrooted tree above is $\lbrace (ab|cdef), (abcd|ef), (abef|cd) \rbrace$.

    Note: * A given set of subset of C(T) $X$ is compatible if $X\subseteq C(T)$ * if $C(T_1) \subseteq C(T_2)$ we define that $T_1$ refines $T_2$ or $T_1$ is a refinement of $T_2$, which also means that we can get $T_2$ by contracting some edges in $T_1$, so we also say $T_2$ is a contraction of $T_1$.

    1.3 Converting Rooted and Unrooted Trees

    1.3.1 Rooted to Unrooted

    We can root an unrooted tree in everywhere, however, we usually root it on internal edges of nodes. Rooting on a node is like picking a tree on that node, and rooting a tree on an edge means we should "bend" the edge and pick up the tree on the "break point".

    1.3.2 Unrooted to Rooted

    Unrooting a tree is a reversing to rooting a tree, just "unbend" an edge then put the tree down.

    1.4 Something Else

    • A polytomy is a rooted tree has vertex with more than two children, or in an unrooted tree a vertex with degree at least four.
    • A tree is said binary if it doesn't contain polytomies.
    • A binary tree is a fully resolved tree.

    2. How to Compute Trees?

    2.1 Rooted Trees

    Draw Hasse Diagram through the set of Clades, shows as below.

    hasse diagram

    2.2 Unrooted Trees

    The basic idea of generating unrooted tree by bipartitions is first we construct an rooted tree then we unroot the tree. The algorithm of constructing $T$ from $C(T)$ is:

    1. Find a leaf to root: for example as the unrooted above, $a$ $b$ are always on the same side of every bipartition, so we root tree at both $a$ and $b$.
    2. Write down the clades without $a$ and $b$. $Clades(T)=\lbrace \lbrace c, d \rbrace, \lbrace e, f \rbrace, \lbrace c, d, e, f \rbrace \rbrace$.
    3. Draw Hasse Diagram of the rooted tree.
    4. Add the tree root $(a, b)$
    5. Unroot the tree.

    hasse diagram

    3. Consensus Tree

    Consensus trees are used to represent what a set of tress have in common. There are three kinds of consensus trees: strict consensus tree, majority consensus tree and greedy consensus tree.

    We could construct these three kind of trees using this method:

    1. Write down the bipartitions of each tree in the set.
    2. Pick bipartitions appear in every tree and the tree constructed using exactly these bipartitions is called a strict consensus tree.
    3. Pick up additional bipartitions which appear in more than half of the trees. Then the tree constructed using exactly these bipartitions is called a majority consensus tree.
    4. We now have several bipartitions appear in less than half of the trees. Then we obey the following algorithm to construct a greedy consensus tree: keep adding bipartitions until you get a fully resolved tree or exam the entire list.

    Note:

    • The greedy consensus tree depend on the ordering of the remaining bipartitions.
    • The compatibility tree of a set of trees (all on the same set of leaves) is the minimal common refinement, if it exists.

    4. Evaluate Trees

    False Positive: report additional bipartitions that is not belong to the true tree.

    False Negative: miss report some bipartitions supposed to be in the true tree.

    Robinson-Foulds distance: Sum of all false positives and false negatives.

    Note:

    • If the true tree and estimated tree are both binary, then the FP and FN will be equal.
    • If the estimated tree is not binary, when the true tree is binary then the FP rate will be less than FN error rate.
    • If the estimated tree is not binary, when the true tree is not binary then the FP rate will be larger than FN error rate.
    • Let $T$ be the true tree, and $T_1$ and $T_2$ be two estimated trees from the same leaf set. If $T_1$ refines $T_2$, then FN rate of $T_1$ will be less than or equal to that of $T_2$, and the FP rate of $T_1$ will be at least that of $T_2$.
    • The Robinson-Foulds rate (RF distance / 2n - 6) only useful when estimated tree and true tree are both binary.

    Acknowledgement:

    This article is my learning notes of course CS394C Algorithms for Computational Biology (Fall 2013), lectured by Prof. Tandy Warnow.



沪ICP备19023445号-2号
友情链接