Français Anglais
Accueil Annuaire Plan du site
Home > Research results > Dissertations & habilitations
Research results
Ph.D de

Ph.D
Group : Bioinformatics

Ranking biological and biomedical data : algorithms and applications

Starts on 01/10/2012
Advisor : DENISE, Alain
[COHEN-BOULAKIA Sarah]

Funding :
Affiliation : Université Paris-Saclay
Laboratory : LRI-bio info

Defended on 25/09/2015, committee :
Directeur de thèse :
- Cohen-Boulakia Sarah, MdC HDR, LRI, Université Paris-Sud
- Denise Alain, PR, LRI, Université Paris-Sud

Rapporteurs :
- Amann Bernd, PR, LIP6, Université Pierre et Marie Curie
- Vialette Stephane, DR, LIGM, Université Paris-Est Marne-la-Vallée

Examinateurs :
- Belhajjame Khalid, MdC, LAMSADE, Université Paris-Dauphine
- Bidoit Nicole, PR, Examinateur, LRI, Université Paris-Sud
- Leser Ulf, PR, WBI, Université Humböldt, Berlin

Research activities :

Abstract :
The rank aggregation problem is to build consensus among a set of rankings (ordered elements). Although this problem has numerous applications (consensus among user votes, consensus between results ordered differently by different search engines ...), computing an optimal consensus is rarely feasible in cases of real applications (problem NP-Hard). Many approximation algorithms and heuristics were therefore designed. However, their performance (time and quality of product loss) are quite different and depend on the datasets to be aggregated. Several studies have compared these algorithms but they have generally not considered the case (yet common in real datasets) that elements can be tied in rankings (elements at the same rank). Choosing a consensus algorithm for a given dataset is therefore a particularly important issue to be studied (many applications) and it is an open problem in the sense that none of the existing studies address it.

More formally, a consensus ranking is a ranking that minimizes the sum of the distances between this consensus and the input rankings. Like much of the state-of-art, we have considered in our studies the generalized Kendall-Tau distance, and variants.

Specifically, this thesis has three contributions.

First, we propose new complexity results associated with cases encountered in the actual data that rankings may be incomplete and where multiple items can be classified equally (ties). We isolate the different "features" that can explain variations in the results produced by the aggregation algorithms (for example, using the generalized distance of Kendall-Tau or variants, pre-processing the datasets with unification or projection). We propose a guide to characterize the context and the need of a user to guide him into the choice of both a pre-treatment of its datasets but also the distance to choose to calculate the consensus. We finally adapt existing algorithms to this new context.

Second, we evaluate these algorithms on a large and varied set of datasets both real and synthetic reproducing actual features such as similarity between rankings, the presence of ties and different pre-treatments. This large evaluation comes with the proposal of a new method to generate synthetic data with similarities based on a Markov chain modeling. This evaluation led to the isolation of datasets features that impact the performance of the aggregation algorithms, and to design a guide to characterize the needs of a user and advise him in the choice of the algorithm to be use. A web platform to replicate and extend these analyzes is available (rank-aggregation-with-ties.lri.fr).

Finally, we demonstrate the value of using the rankings aggregation approach in two use cases.
We provide a tool to reformulating the text user queries through biomedical terminologies, to then query biological databases, and ultimately produce a consensus of results obtained for each reformulation (conqur-bio.lri.fr). We compare the results to the references platform and show a clear improvement in quality results.
We also calculate consensus between list of workflows established by experts in the context of similarity between scientific workflows. We note that the computed consensus agree with the

Ph.D. dissertations & Faculty habilitations
CAUSAL LEARNING FOR DIAGNOSTIC SUPPORT


CAUSAL UNCERTAINTY QUANTIFICATION UNDER PARTIAL KNOWLEDGE AND LOW DATA REGIMES


MICRO VISUALIZATIONS: DESIGN AND ANALYSIS OF VISUALIZATIONS FOR SMALL DISPLAY SPACES
The topic of this habilitation is the study of very small data visualizations, micro visualizations, in display contexts that can only dedicate minimal rendering space for data representations. For several years, together with my collaborators, I have been studying human perception, interaction, and analysis with micro visualizations in multiple contexts. In this document I bring together three of my research streams related to micro visualizations: data glyphs, where my joint research focused on studying the perception of small-multiple micro visualizations, word-scale visualizations, where my joint research focused on small visualizations embedded in text-documents, and small mobile data visualizations for smartwatches or fitness trackers. I consider these types of small visualizations together under the umbrella term ``micro visualizations.'' Micro visualizations are useful in multiple visualization contexts and I have been working towards a better understanding of the complexities involved in designing and using micro visualizations. Here, I define the term micro visualization, summarize my own and other past research and design guidelines and outline several design spaces for different types of micro visualizations based on some of the work I was involved in since my PhD.