Prediction of protein-protein interaction sites from genetic variation data

Supervisors: Professor Geoffrey J. Barton, Dr Andrei Pisliakov

Project description:

Our research has focused for more than 20 years on developing effective computational methods to predict the function, structure and specificity of proteins from the amino acid sequence.  This has included work to characterise and predict protein-protein interactions from 3D structural information (e.g. [1]) as well as from sequences and related data (e.g.

[2, 3]).  Much of this  experience is encapsulated in widely used software tools that include the Jalview sequence analysis workbench which has over 70,000 regular users world-wide and JPred which performs up to 250,000 predictions of secondary structure and other features from the amino acid sequence for scientists in laboratories in the UK and internationally.  Together, Jalview and JPred have accumulated over 7,000 citations to the papers that describe them which shows the broad and important impact of this methodological research.

Rapid advances in DNA sequencing technology over recent years have stimulated the large-scale sequencing of populations of single species.  There is now publically available data on variation for over 200,000 human individuals, human cancers, bacterial strains, major food crops (e.g. wheat and barley) and animals (e.g. cow).  While most effort to date has focussed on exploiting these data to identify variants involved in genetic disease, the variation data provides a completely new resource to inform details of protein structure, function and interactions within a species.   Recent work from our group (MacGowan et al, 2017, submitted) has demonstrated that variation data can identify key residues important in protein-ligand and protein-protein interactions in over 200 protein domain families.   This Ph.D. project will build on these findings first to develop machine learning methods that combine this information with other indicators of protein-protein interaction to improve the accuracy and specificity of our established methods.   The project will focus initially on sites within multi-domain proteins and follow-up the initial predictions by applying molecular dynamics simulations to explore which sites are most likely to be functionally important.   The co-supervisor Dr Andrei Pisliakov is an expert in MD simulation techniques and has applied these methods recently to multi-domain proteins to probe the effect of observed mutations on domain-domain interactions and structure.

This project will train the student in software development and advanced bioinformatics research techniques including machine learning noSQL technology and statistics as well as established MD simulation methods.  On completion of the Ph.D. the student will be well prepared for a research career in bioinformatics/biophysics, but also have excellent transferrable skills appropriate to careers in Big Data analytics or software engineering.


1.            Jefferson, E.R., Walsh, T.P., and Barton, G.J., A comparison of SCOP and CATH with respect to domain-domain interactions. Proteins, 2008. 70(1): p. 54-62.

2.            McDowall, M.D., Scott, M.S., and Barton, G.J., PIPs: human protein-protein interaction prediction database. Nucleic Acids Res, 2009. 37(Database issue): p. D651-6.

3.            Scott, M.S. and Barton, G.J., Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformatics, 2007. 8: p. 239.