A phylogeny tree is also called evolutionary tree. It is used to depict the evolutionary relationships among various biological species by similarities and differences in their physical or genetic characteristics. It can also be used to describe linguistic evolution history.
This (from Wikipedia) is a phylogeny tree showing how bacteria, archaea, eukaryota evolve. It's important to point out that we don't know whether the tree is correct (the true tee) or not, because we could only infer the history of this period of time by some archaeology evidence. In computational biology, however, our task is to simulate to evolution process and generate tree of life then estimate its accuracy. We could still find relatively convincing information from phylogeny trees, though the result may not exactly match the true history due to evolution is a extremely complex process (not even tree structure in fact).
Above is a rooted tree for six different species or languages.
Note:
We could use the Newick notation to represent any rooted tree. It's super simple, easy. The Newick representation of the rooted tree above is: $(a, (b, ((c, d), (e, f))))$ or other versions by swapping siblings, e.g. $(a, (b, ((f, e), (c, d))))$ so on so forth.
The other representation form of an rooted tree is using clades. To find the clades set, we just need to pick every internal node of the rooted tree then write down the leaves it contains. Given a rooted tree $T=(a, (b, ((c, d), (e, f))))$, we can get $Clade(T)=\lbrace \lbrace a\rbrace , \lbrace b\rbrace , \lbrace c\rbrace , \lbrace d\rbrace , \lbrace e\rbrace , \lbrace f\rbrace , \lbrace c, d\rbrace , \lbrace e, f\rbrace , \lbrace c, d, e, f\rbrace , \lbrace b, c, d, e, f\rbrace , \lbrace a, b, c, d, e, f\rbrace \rbrace$.
Note:
Above is an unrooted tree according to the rooted tree given in section 1.1.
Unrooted trees are also able to be represented by Newick notation, however, not unique (doesn't means swapping siblings here). That is, different rooted trees can have same unrooted version. e.g. Unrooted tree for $(a, (b, (c, d)))$ is same to $(b, (a, (c, d)))$ and is also same to $((a, b), (c, d))$.
Clades do not work for unrooted tree, but we can use a new method called bipartition to represent an unrooted tree. The process is for each internal edge write down the left and right part of taxa if it was deleted. So the bipartitions for the unrooted tree above is $\lbrace (ab|cdef), (abcd|ef), (abef|cd) \rbrace$.
Note: * A given set of subset of C(T) $X$ is compatible if $X\subseteq C(T)$ * if $C(T_1) \subseteq C(T_2)$ we define that $T_1$ refines $T_2$ or $T_1$ is a refinement of $T_2$, which also means that we can get $T_2$ by contracting some edges in $T_1$, so we also say $T_2$ is a contraction of $T_1$.
We can root an unrooted tree in everywhere, however, we usually root it on internal edges of nodes. Rooting on a node is like picking a tree on that node, and rooting a tree on an edge means we should "bend" the edge and pick up the tree on the "break point".
Unrooting a tree is a reversing to rooting a tree, just "unbend" an edge then put the tree down.
Draw Hasse Diagram through the set of Clades, shows as below.
The basic idea of generating unrooted tree by bipartitions is first we construct an rooted tree then we unroot the tree. The algorithm of constructing $T$ from $C(T)$ is:
Consensus trees are used to represent what a set of tress have in common. There are three kinds of consensus trees: strict consensus tree, majority consensus tree and greedy consensus tree.
We could construct these three kind of trees using this method:
Note:
False Positive: report additional bipartitions that is not belong to the true tree.
False Negative: miss report some bipartitions supposed to be in the true tree.
Robinson-Foulds distance: Sum of all false positives and false negatives.
Note:
Acknowledgement:
This article is my learning notes of course CS394C Algorithms for Computational Biology (Fall 2013), lectured by Prof. Tandy Warnow.