Streamlining Big Genomic Data Analysis and Knowledge Extraction

Supervisors: Andreas Kranis, Amrey Krause

Project description:


Nucleic acid genotyping and sequencing technology has transformed livestock genetics and animal breeding, and offers exciting new opportunities to understand how traits of interest to animal production are controlled at the genomic level. Breeding organisations have embraced this new opportunity and have established genomic evaluation programs across different farm animal species, in which a huge volume of data is generated on a routine basis. As big genomic data rapidly accumulates, an inherent barrier to the effective uptake and use of the new technology emerges due to the complexities of managing such a large volume of information. On the same time, the unprecedented quantity and quality of these genomic repositories presents a unique opportunity to use them to unravel the complexities of the biological process that underlie the phenotypic expression of economically important traits.


The overarching aim of the project is to develop the strategies and a data-centric framework to optimise the analysis of big genomic repositories and extract novel biological knowledge.

The following aims will be addressed:

  • Identify genomic variants and regions that explain genetic variation for a variety of traits and across different chicken lines.
  • Create a curated database that collects and processes data from online repositories via the respective APIs and integrates them with in-house results.
  • Streamline the process of combining results from meta-analysis and functional information to rank the results of statistical association with an ultimate objective to strengthen causal interference.

We propose to meet these goals by harnessing the complementary expertise of The Roslin Institute, the Edinburgh Parallel Computing Centre (EPCC) and the industrial partner (Aviagen) with an ultimate objective to maximise the return on investment on DNA technologies.

For this purpose, we will use the existing genomic data coming from one of the world’s largest poultry breeding programs to develop the proposed data science framework.

Training outcomes

The proposed project is an interdisciplinary collaboration that aims to exploit new ways of working under a data-driven approach for delivering research. The first supervisor (AnK) is from a quantitative genetics and genomics background in poultry populations with experience in using huge genomic datasets. The second supervisor (AmK) is a data architect with large experience on the design and implementation of distributed data analysis technologies and infrastructure for big data analytics. The diverse panel of the supervisors will offer to the student exposure to different environments and provide the student with interdisciplinary skills in computational data science.

The current explosion of big and diverse data in biology requires a new breed of researchers that have the data-handling and digital excellence. This project aims to equip the student with thee much sought quantitative and data-handling skills for the future research leaders in biosciences, including advanced statistical analysis, data management, programming, bioinformatics and streamlining complex computational approaches on large genomic data.