!DOCTYPE HTML> Machine learning-guided deconvolution of plasma protein levels

Proteomic techniques now measure thousands of circulating proteins at population scale, driving a surge in biomarker studies and biological clocks. However, their potential impact, generalisability, and biological relevance is hard to assess without understanding the origins and role of the thousands of proteins implicated in these studies. Here, we provide a data-driven identification of the foundations of protein variation that underly their links to ageing and diseases, differ between sexes and ancestries, and help guide protein biomarker and drug target discovery. We use machine learning to systematically identify and quantify the foundations of plasma levels of ~3,000 protein targets among 43,240 participants of the UK Biobank. Out of >1,700 participant and sample characteristics, we identify a median of 19 factors (range: 1-36) that jointly explained an average of 23.7% (max. 79.9%) of the variance in plasma levels across protein targets. Proteins segrated into distinct clusters according to their explanatory factors, with modifiable characteristics explaining more variance compared to genetic variation (13.3% vs 9.8%). We identify proteins for which the factors underlying their variation differed by sex (n=1414 proteins) or across ancestries (n=86 proteins). We establish a knowledge graph that integrates our findings with genetic studies and drug characteristics to guide identification of potential markers of target engagement . We demonstrate the value of our resource 1) by identifying disease-specific biomarkers, like matrix metalloproteinase 12 for abdominal aortic aneurysm, and 2) by developing a framework for phenotype enrichment of protein signatures from independent studies to identify underlying sources of variation. All results are explorable via an interactive webportal (https://omicscience.org/apps/prot_foundation) and can be readily integrated into ongoing studies using an associated R package.

Major axes of plasma protein variation. a Uniform Manifold Approximation and Projection (UMAP) mapping of the variance explained matrix across 2853 protein targets. Each protein has been assigned a cluster based no k-means clustering and are coloured accordingly. b Number of protein targets included in each cluster. c – l Same UMAP plot but coloured according to the variance explained by the factor given on top of each plot. Proteins with strong contributions (>1%) are highlighted. pQTL = protein quantitative trait loci.

Data access

By using these results in your research, you agree to cite our publication. Our legal notice, data protection statement and data usage agreement apply.

Protein Atlas:
A data visualization tool for protein variance analysis
Interactive Knowledge Graph
Download app data: data.zip