How it works and what it means for biology and science.
I started writing this on the day of the announcement but got distracted and did not finish. As part of New Year’s plan to finish my blogs :) — I’m finishing this one now — but in reality there are now MANY great blogs and videos explaining AlphaFold2 (see resources below).
Deepmind has claimed to have solved the 50 year old grand challenge of predicting protein folding shapes simply from the constituent amino acid sequences. They produced their own video that frames the problem and their solution quite well.
AlphaFold: a solution to a 50-year-old grand challenge in biology
In a major scientific advance, the latest version of our AI system AlphaFold has been recognised as a solution to this…
What is protein folding?
First, what is a protein? Proteins are macromolecules (chains of molecules) that human “stuff” —tissues, organs, blood and more — is made of.
Amino acids are those organic molecules (carbon based, with just a few atoms) that are the building blocks of protein. There are 20 unique amino acids, a chain of which are molecularly linked together to form a protein.
For a typical protein length of about 300 amino acids, more than 20³⁰⁰ different polypeptide chains could theoretically be made. Source: NIH
And although there are 20³⁰⁰ unique combinations of possible protein chains, only about 20,000 proteins are encoded in the human genome.
Second, what’s the folding about? Protein folding is the process by which the molecules of the amino acid bend and direct the proteins physical shape and structure. This is a nice Ted Talk on the protein folding problem.
Last, what’s the big deal? There are 180M known protein sequences, but we know just over 170k protein structures — that have been painfully, laboriously, experimentally determined. If we can predict the folding directly from the amino acid sequence we would now have a transformative tool for understanding the fundamental processes of life.
What’s it useful for?
Simply, protein structures are helpful in determining function and interactions and behaviors.
It would vastly accelerate efforts to understand the building blocks of cells and enable quicker and more advanced drug discovery. -Nature
This whole book chapter on Protein Structure is good, but this section on The Use of Protein Structures in Inhibitor Discovery and Design is particularly helpful. As it pertains to how protein structure can help in drug discovery:
Structural based design can assist this continuous process by pinpointing opportunities for structural modifications that do not interfere with binding, but may improve affinity and selectivity, and which may favorably alter the properties of the molecule, such as solubility, hydrophobicity and ionization state.
Bottomline, there are a number of ways in which computational biology and physics can leverage the protein structure to determine where a drug may bond or interact in the desired manner. This video goes through some explanation of the physical phenomena and techniques in the development of protein based therapeutics.
How does AlphaFold work?
Before we get to AlphaFold’s particular solution — I was curious how the whole amino acid to protein shape problem is framed to begin with.
What are amino acid sequences? Amino acid sequences are a set of amino acids that comprise the protein. Here’s an example amino acid sequence from Uniprot — an open database of protein sequences (explanation of protein codes can be found here):
What exactly are the “shapes”? The shapes are specified by locations of the molecules in 3-d space — specified by distances and angles. They are visualized separately.
The primary information stored in the PDB archive consists of coordinate files for biological molecules. These files list the atoms in each protein, and their 3D location in space. These files are available in several formats (PDB, mmCIF, XML). A typical PDB formatted file includes a large “header” section of text that summarizes the protein, citation information, and the details of the structure solution, followed by the sequence and a long list of the atoms and their coordinates. The archive also contains the experimental observations that are used to determine these atomic coordinates.
Here is an example of a sequence and the visualization of the structure:
What is the CASP Challenge? The basic mechanics as set out by the CASP challenge is to take as input amino acid sequences and predict distances and angles between the molecules which thereby communicate the shape and can be visualized separately.
How does AlphaFold Work? In terms of the specific algorithm used by AlphaFold — they are not yet fully known. This is DeepMind’s paper on AlphaFold1 — which we assume has some foundational work for the current Alpha Fold solution. This is a nice video explaining AlphaFold1 and shows how to generate predictions and visualize. You can also see the source code AlphaFold1 here.
This article describes AlphaFold1 pretty well too — and discusses the use of Convolutional Neural Networks. This nice video explanation from Lex Fridman, discusses and speculates that the use of attention or transformers has replaced ConvNets in AlphaFold2.
Basically Alphafold is described as a two step system- Step 1) a Machine Learning algorithm (ConvNets in AlphaFold1, Transformers in AlphaFold2) that are used to predict probabilities of distances between amino acids; Step 2 is an optimization algorithm (like gradient descent) that optimizes those distances to match the training data distances.
There is some skepticism about AlphaFold2 and how will it will continue to work in predicting protein shapes for future amino acids, but as of now it does appear to be a great breakthrough. Like most we will await the release of the paper and the code to understand more about how it works.
- CASP Contest
- Khan Academy Protein Structures
- ProteinNet: standardized dataset for machine learning of protein structure
- Protein Data Bank (PDB)
- Guide to understanding PDB Data
- Wiley Book on PDB Data Format
- PDBx/mmCIF Dictionary Resources
- PyMOL Molecular Protein Visualization System
- Uniprot open database of protein sequences
- Explanation of protein codes
- Explanation of sequence annotation
- Structure of SARS Covid
- Design of Protein Therapeutics [ video ]
- Book chapter on protein structure, modeling and applications