In this thesis, we address two graph theoretical problems connected to two different biological problems both related to symbiosis (two organisms live in symbiosis if they have a close and long term interaction). The first problem is related to the size of a minimum cover by "chain subgraphs" of a bipartite graph. A chain graph is a bipartite graph whose nodes can be ordered by neighbourhood inclusion. In biological terms, the size of a minimum cover by chain subgraphs represents the number of genetic factors involved in the phenomenon of Cytoplasmic Incompatibility (CI) induced by some parasitic bacteria in their insect hosts. CI results in the impossibility to give birth to an healthy offspring when an infected male mates with an uninfected female. In the first half of the thesis we address three related problems. One is the enumeration of all the maximal edge induced chain subgraphs of a bipartite graph G, for which we provide a polynomial delay algorithm with a delay of O(n^2m) where n is the number of nodes and m the number of edges of G. Furthermore, we show that (n/2)! and 2^(\sqrt{m} \log m) bound the number of maximal chain subgraphs of G and use them to establish the input-sensitive complexity of the algorithm. The second problem we treat is finding the minimum number of chain subgraphs needed to cover all the edges of a bipartite graph. To solve this NP-hard problem, we provide an exact exponential algorithm which runs in time O^*((2+c)^m), for every c>0, by a procedure which uses our algorithm and an inclusion-exclusion technique (by O^* we denote standard big O notation but omitting polynomial factors). Notice that, since a cover by chain subgraphs is a family of subsets of edges, the existence of an algorithm whose complexity is close to 2^m is not obvious. Indeed, the basic search space would have size 2^(2^m), which corresponds to all families of subsets of edges of a graph on $m$ edges. The third problem is the enumeration of all minimal covers by chain sugbgraphs. We show that it is possible to enumerate all such minimal covers of G in time O([(M+1)|S|]^[\log((M+1)|S|)]) where S is the number of minimal covers of G and M the maximum number of chain graphs in a minimal cover. We then present the relation between the second problem and the computation of the interval order dimension of a bipartite poset. We give an interpretation of our results in the context of poset and interval poset dimension. Indeed, we can compute the interval dimension of a bipartite poset P in O^*((2+c)^p) where p is the number of incomparable pairs of P. Finally, we extend further these results to the problem of computing the poset dimension. By a classical result in poset theory (Trotter "split" operation), we then obtain a procedure which solves this problem in O^*((2+c)^(p/2)). To improve our results on the poset dimension and to perform better than O(\sqrt{2}^p), i.e. the minimum time to run the inclusion-exclusion formula on which these results are based, we introduce an associated graph GCP for each poset, called the graph of critical pairs. In this way we obtain two algorithms, an exponential and a polynomial space one. These algorithms compute the poset dimension in 2^q and O(2.9977 ^q) time respectively where q is the number of critical pairs of P (intuitively, critical pairs are the fundamental incomparable pairs to consider). In the second part of the thesis, we deal with the Reconciliation Model of two phylogenetic trees and the exploration of its optimal solutions space. Phylogenetic tree reconciliation is the approach commonly used to investigate the coevolution of sets of organisms such as hosts and symbionts. Given a phylogenetic tree for each such set, respectively denoted by H and S, together with a mapping phi of the leaves of S to the leaves of H, a reconciliation is a mapping rho of the internal nodes of S to the nodes of H which extends \phi with some constraints. Depending on the mapping of a node and its children, four types of events can be identified: "cospeciation" (when host and symbiont evolve together), "duplication" (when the symbiont evolves into different species but not the host), "loss" (when the host evolves into two new species but not the symbiont, leading to the loss of the symbiont in one of the two new host species) and "host switch" (when the symbiont evolves into two new species with one species remaining with its current host while the other switches, that is jumps to another host species). Given a cost for each kind of event, c_c, c_d, c_l, and c_h respectively, we can assign a total cost to each reconciliation. The set of reconciliations with minimum cost is denoted by Rec(H,P,phi,C), where C = (c_c,c_d,c_l,c_h), and its elements are said to represent "parsimonious" reconciliations. However, their number can be often huge. Without further information, any biological interpretation of the underlying coevolution would require that all the parsimonious reconciliations are enumerated and examined. The latter is however impossible without providing some sort of high level view of the situation. In this thesis, we approached this problem by introducing two equivalence relations to collect similar reconciliations and reduce the optimal solutions set to a smaller set of representatives of these equivalence classes. We introduce as well a new distance among optimal reconciliations DH and we compare it with the distances already present in the literature. We show that we can embed the set of parsimonious reconciliations Rec(H,P,phi,C) into the discrete k dimensional hypercube H^k = {0,1}^k and that DH coincides with the "Hamming Distance" on H^k. Finally, we present a series of results on reconciliations based on the conditions c_c <= c_d and c_l > 0 which lead to prove that a reconciliation is characterized by its set of host switches. The equivalence relations and the distance DH are all based on the host-switch events. We performed experiments on some real datasets and we present some of the results obtained to show the efficacy of the two equivalence relations. We comment these results under the light of the chosen cost vector C. The most outstanding results we obtain is in the case of the dataset related to the parasite Wolbachia where we pass from ~4.08 x 10^{42} parsimonious reconciliations to ~ 1.15 x 10^{3} representatives.

Enumeration algorithms and graph theoretical models to address biological problems related to symbiosis / Gastaldello, Mattia. - (2018 Feb 16).

Enumeration algorithms and graph theoretical models to address biological problems related to symbiosis

GASTALDELLO, MATTIA
16/02/2018

Abstract

In this thesis, we address two graph theoretical problems connected to two different biological problems both related to symbiosis (two organisms live in symbiosis if they have a close and long term interaction). The first problem is related to the size of a minimum cover by "chain subgraphs" of a bipartite graph. A chain graph is a bipartite graph whose nodes can be ordered by neighbourhood inclusion. In biological terms, the size of a minimum cover by chain subgraphs represents the number of genetic factors involved in the phenomenon of Cytoplasmic Incompatibility (CI) induced by some parasitic bacteria in their insect hosts. CI results in the impossibility to give birth to an healthy offspring when an infected male mates with an uninfected female. In the first half of the thesis we address three related problems. One is the enumeration of all the maximal edge induced chain subgraphs of a bipartite graph G, for which we provide a polynomial delay algorithm with a delay of O(n^2m) where n is the number of nodes and m the number of edges of G. Furthermore, we show that (n/2)! and 2^(\sqrt{m} \log m) bound the number of maximal chain subgraphs of G and use them to establish the input-sensitive complexity of the algorithm. The second problem we treat is finding the minimum number of chain subgraphs needed to cover all the edges of a bipartite graph. To solve this NP-hard problem, we provide an exact exponential algorithm which runs in time O^*((2+c)^m), for every c>0, by a procedure which uses our algorithm and an inclusion-exclusion technique (by O^* we denote standard big O notation but omitting polynomial factors). Notice that, since a cover by chain subgraphs is a family of subsets of edges, the existence of an algorithm whose complexity is close to 2^m is not obvious. Indeed, the basic search space would have size 2^(2^m), which corresponds to all families of subsets of edges of a graph on $m$ edges. The third problem is the enumeration of all minimal covers by chain sugbgraphs. We show that it is possible to enumerate all such minimal covers of G in time O([(M+1)|S|]^[\log((M+1)|S|)]) where S is the number of minimal covers of G and M the maximum number of chain graphs in a minimal cover. We then present the relation between the second problem and the computation of the interval order dimension of a bipartite poset. We give an interpretation of our results in the context of poset and interval poset dimension. Indeed, we can compute the interval dimension of a bipartite poset P in O^*((2+c)^p) where p is the number of incomparable pairs of P. Finally, we extend further these results to the problem of computing the poset dimension. By a classical result in poset theory (Trotter "split" operation), we then obtain a procedure which solves this problem in O^*((2+c)^(p/2)). To improve our results on the poset dimension and to perform better than O(\sqrt{2}^p), i.e. the minimum time to run the inclusion-exclusion formula on which these results are based, we introduce an associated graph GCP for each poset, called the graph of critical pairs. In this way we obtain two algorithms, an exponential and a polynomial space one. These algorithms compute the poset dimension in 2^q and O(2.9977 ^q) time respectively where q is the number of critical pairs of P (intuitively, critical pairs are the fundamental incomparable pairs to consider). In the second part of the thesis, we deal with the Reconciliation Model of two phylogenetic trees and the exploration of its optimal solutions space. Phylogenetic tree reconciliation is the approach commonly used to investigate the coevolution of sets of organisms such as hosts and symbionts. Given a phylogenetic tree for each such set, respectively denoted by H and S, together with a mapping phi of the leaves of S to the leaves of H, a reconciliation is a mapping rho of the internal nodes of S to the nodes of H which extends \phi with some constraints. Depending on the mapping of a node and its children, four types of events can be identified: "cospeciation" (when host and symbiont evolve together), "duplication" (when the symbiont evolves into different species but not the host), "loss" (when the host evolves into two new species but not the symbiont, leading to the loss of the symbiont in one of the two new host species) and "host switch" (when the symbiont evolves into two new species with one species remaining with its current host while the other switches, that is jumps to another host species). Given a cost for each kind of event, c_c, c_d, c_l, and c_h respectively, we can assign a total cost to each reconciliation. The set of reconciliations with minimum cost is denoted by Rec(H,P,phi,C), where C = (c_c,c_d,c_l,c_h), and its elements are said to represent "parsimonious" reconciliations. However, their number can be often huge. Without further information, any biological interpretation of the underlying coevolution would require that all the parsimonious reconciliations are enumerated and examined. The latter is however impossible without providing some sort of high level view of the situation. In this thesis, we approached this problem by introducing two equivalence relations to collect similar reconciliations and reduce the optimal solutions set to a smaller set of representatives of these equivalence classes. We introduce as well a new distance among optimal reconciliations DH and we compare it with the distances already present in the literature. We show that we can embed the set of parsimonious reconciliations Rec(H,P,phi,C) into the discrete k dimensional hypercube H^k = {0,1}^k and that DH coincides with the "Hamming Distance" on H^k. Finally, we present a series of results on reconciliations based on the conditions c_c <= c_d and c_l > 0 which lead to prove that a reconciliation is characterized by its set of host switches. The equivalence relations and the distance DH are all based on the host-switch events. We performed experiments on some real datasets and we present some of the results obtained to show the efficacy of the two equivalence relations. We comment these results under the light of the chosen cost vector C. The most outstanding results we obtain is in the case of the dataset related to the parasite Wolbachia where we pass from ~4.08 x 10^{42} parsimonious reconciliations to ~ 1.15 x 10^{3} representatives.
16-feb-2018
File allegati a questo prodotto
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11573/1072500
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus ND
  • ???jsp.display-item.citation.isi??? ND
social impact