Tackling the problem of ‘Big Data’ in Molecular Biology and Genomics: Advanced sequence alignment visualisation and analysis for molecular function prediction

Supervisors: Professor Geoffrey J. Barton, Dr Ulrich Zachariae (Reader)

Project description:

This project should appeal to a student who will most likely have a background in computer science or other subject with strong experience in algorithm development.  They will extend their skills to the challenging problem of visualising ‘big data’ in biology.  They will also gain experience of communicating their achievements to a wide biological research audience through distribution of their software to a large user-base as well as conventional seminars and meetings.

The global archive of DNA sequence data is currently doubling in size every nine months and the trend for sequencing projects to move from the domain of large international consortia to that of individual laboratories continues to accelerate.  To be useful, a sequence must be classified and analysed in context with other sequences and derived data.  Multiple alignments of protein or nucleic acid sequences provide the key scaffold to understand variation and to view annotations and predictions of the functions of genes and proteins in a concise format.  Working with new alignments or established alignment collections such as Pfam requires powerful, but easy-to-use interactive tools. The Jalview open-source, GPL-licenced  multiple sequence alignment editor and analysis workbench developed in my group is widely regarded as the de facto standard desktop tool and web applet for these tasks and current version of Jalview (2.10.1) is installed on over 70,000 computers worldwide.  Many protein families now contain tens of thousands of sequences.  Although Jalview can read and manipulate alignments of this size, visualisation and interpretation of alignments of this size on conventional computer screens is next to impossible. 

The aim of this project will be to develop novel visualisation methods for large sequence alignments to make it easier to extract functional information from the alignment that can then be applied to different biological problems.  Techniques that will be explored will include programming high-performance 3D hardware capabilities that are now common on personal computers as well as methods for downsampling large images that are utilised in mapping applications.   The new techniques developed in this project will take advantage of the two BBSRC Bioinformatics and Biological Resources grants that support Jalview and the Dundee Resource for Protein Sequence Analysis and Structure Prediction in order to roll-out the new techniques to the large audience that uses these resources.  In this way, the research undertaken in this project will have impact not only for the immediate biological applications carried out by the student, but also for the tens of thousands of Jalview users throughout the UK and world.