# Normal Mode Analysis (NMA)¶

NMA analyses the oscillations of a structure. For proteins, it is useful for studying the large amplitude motions for a selected conformation. The main assumption is that the motions are harmonic. Thus, each normal mode, which is a concerted motion of many atoms, acts as a simple harmonic oscillator, and it is independent of all the other normal modes.

For a harmonic oscillator with a mass \(m\) supported on a spring with force constant \(k\), the potential energy of the system, \(V = kx^2\), for an extension \(x\) leads to the restoring force,

By substituting this Hooke’s Law force into Newton’s Law, \(\mathbf{F} = m\mathbf{a}\) leads to the differential equation,

The solution is,

with \(v\) being the frequency of the vibration.

In three dimensions and for a set of \(N\) atoms, one has the corresponding generalized equation,

where \(\mathbf{H}\) is the \(3N\)x\(3N\) Hessian symmetric matrix of force constants and \(\mathbf{X}\) is the \(3N\)x\(1\) vector of the positions of the atoms. The solutions of the equation may be cast in the form of an eigenvalue decomposition of \(\mathbf{H}\) where the eigenvectors are the mass weighted displacements of the normal coordinates; i.e. the independent vibrational motions of the collection of atoms. The corresponding eigenvalues are the negative of the squared normal mode frequencies of each mode. \(\mathbf{H}\) has exactly six zero eigenvalues for the translations and rotations of the molecule in three dimensional space.

In NMA of proteins, it is central to obtain a good representation of the protein for a proper analysis of the available modes of motion. One approach would be to obtain the Hessian matrix from the second derivatives of the energy for the conformation of interest following a very stringent minimization of the molecule. The latter is important as NMA is based on the harmonic assumption which is only a valid approximation at the bottom of the potential energy minima. For the complex potentials representing interactions in proteins in the water environment, these second derivatives may be obtained through numerical methods which are extremely costly.

Alternatively, \(\mathbf{H}\) may be approximated by the variance-covariance matrix of the protein, obtained from a molecular dynamics (MD) simulation of suitable length. This is relevant as it may be shown through employing statistical mechanics that the average displacements of nodes from a mean structure for an atom \(i, \Delta \mathbf{R}_{i}\) are related to those of all other atoms through the relationship,

It is important to select the length of the MD simulation for this procedure such that the molecule samples only the state of interest, [1] since when more than a single potential energy well along the conformational space of the molecule is sampled, the harmonic assumption would again fail.

As a third alternative for obtaining the Hessian, one may make use of the elastic network property of a folded protein. Here, the assumption is that, once a protein folds to a functionally relevant conformation, its total energy is represented by a simple harmonic potential such that residues are connected to their nearest neighbors via elastic springs. By further employing the assumption that the spring constants, \(\gamma\), are equivalent in all pairs of interactions, one arrives at the anisotropic network model (ANM). [2]

\(\mathbf{H}\) is thus composed of \(N\)x\(N\) super-elements, i.e.,

where each super-element \(\mathbf{H_{ij}}\) is a 3x3 matrix that holds the anisotropic information regarding the orientation of nodes \(i,j\):

Denoting the separation between nodes \(i\) and \(j\) by \(S_{ij}\), the elements of the off-diagonal super-elements \(\mathbf{H_{ij}}\) are given by:

and those of the diagonal super-elements \(\mathbf{H_{ij}}\) are obtained via,

In this representation of the protein, the structure is coarse-grained at the level of a residue, usually through the coordinates of \(\mathbf{C_\alpha}\) or \(\mathbf{C_\beta}\) atoms obtained from the protein data bank. Moreover, pairs of nodes are assumed to interact if they are within a pre-selected cut-off distance, \(r_{c}\). For large structures such as viruses, further coarse graining may be shown to well describe the most collective modes of motion.

The selection of \(r_{c}\) has been the cause of much research. While it is clear that there is a lower bound where the condition of six zero eigenvalues of \(\mathbf{H}\) should be satisfied, distances in the range 10-25 Å have been employed in the literature. It has been shown by a systematic study of increasing \(r_{c}\) that the collective modes of motion do not change beyond a certain value for proteins; i.e. selection of too large \(r_{c}\) is safer than a too small value. [3] The reason for this has been explained by expressing \(\mathbf{H}\) as the sum of an essential and a trivial part. The essential part of \(\mathbf{H}\) includes all the local information necessary for the correct representation of the modes. On the other hand, due to the symmetries in a protein originating from the orientational order of closely packed spheres, the effects from the long range neighbors cancel out.

References

[1] | C Atilgan, OB Okan, AR Atilgan, “Network-based Models as Tools Hinting at Non-evident Protein Functionality,” Annual Review of Biophysics, 41, 205-225 (2012). |

[2] | AR Atilgan, SR Durell, RL Jernigan, MC Demirel, O Keskin, I Bahar, “Anisotropy of Fluctuation Dynamics of Proteins with an Elastic Network Model,” Biophysical Journal, 80, 505-515. |

[3] | C Atilgan, OB Okan, AR Atilgan, “Orientational Order Governs Collectivity of Folded Proteins,” Proteins: Structure, Function, Bioinformatics, 78, 3363-3375 (2010). |

# Principal Component Analysis (PCA)¶

A molecular dynamics (MD) simulation of a protein provides the positional movements of each atom with respect to a fixed reference frame at a given time. The mean squared positional fluctuations (variances) of each atom are readily calculated once the total simulation and sampling times are set. Sufficiency of both total observation period and the sampling rate are crucial in collecting the data so as to identify biologically relevant motions. Let us monitor the variance of each residue’s \(\mathbf{C_\alpha}\) or \(\mathbf{C_\beta}\) atom during a MD simulation of a protein. Suppose that these variances do not change significantly in time, like a stationary process. This suggests that within the period of observation we have recorded the motion about one (native) conformation. Though constant in time for a given residue, the variances do change from one residue to another. It is important to distinguish the differences between the variances of different parts of the protein and to explain the root cause of these differences; e.g. while loop or disordered regions exhibit high values, relatively rigid parts, such as helices or sheets display lower variances.

PCA [4] operates on the variance-covariance matrix, \(\mathbf{C}\), of the protein, obtained from a MD simulation of any length; thus, the observed process need not be stationary. It is useful in distinguishing the different parts of the energy landscape sampled during the MD simulation. To obtain \(\mathbf{C}\), first the protein coordinates are superimposed on a reference structure, usually the initial coordinates, or the average coordinates. The displacement vector for each residue (described by the \(\mathbf{C_\alpha}\) or \(\mathbf{C_\beta}\) coordinates of the residue \(i\)) at a time point \(t, \Delta \mathbf{R}_{i}(t)\) is obtained. For a set of \(M\) recorded coordinates, these are organized in the trajectory fluctuation matrix of order \(3N\)x\(M\):

The \(3N\)x\(3N\) \(\mathbf{C}\) matrix is then obtained via the operation,

If a single energy well along the potential energy surface of a protein is sampled, then \(\mathbf{C}\) approximates the inverse Hessian, \(\mathbf{H}^{-1}\) , as the harmonic approximation applies in this case (see NMA for details). However, if different parts of the landscape are sampled, the decomposition of \(\mathbf{C}\) will carry information on all the regions entered during the simulation. Thus, the diagonalization,

yields the eigenvectors and the corresponding eigenvalues of the \(\mathbf{C}\) matrix. \(\mathbf{\Lambda}\) is the \(3N\)x\(3N\) diagonal matrix holding the eigenvalues \(\lambda_i\) with six zero values corresponding to the translations and rotations of the molecule. The \(i\mathrm{^{th}}\) row of the \(\mathbf{U}\) matrix holding the eigenvector corresponding to the \(i\mathrm{^{th}}\) eigenvalue. The trajectory \(\Delta \mathbf{R}\) may be projected onto the eigenvectors to obtain the principal components, \(q_i\), which are the rows of the \(3N\)x\(M\) \(\mathbf{Q}\) matrix.

Since a few principal components usually carry the largest amount of information of the trajectory, the different regions of the conformational space will manifest as more than one blob in a plot of \(q_i\) versus \(q_j\) where \(i\) and \(j\) are small. Furthermore, the size of the blobs in the plots will provide information on the width of the potential wells sampled. Finally, the time points when passage between different wells occur may be pinpointed by this method. The different implementations of the construction of the \(\mathbf{C}\) matrix and the various ways of decomposing it have been discussed in detail in the literature, [5] and implemented in MODE-TASK.

References

[4] | A Amadei, ABM Linssen, HJC Berendsen, “Essential Dynamics of Proteins,” Proteins: Structure, Function and Genetics, 17, 412-425 (1993). |

[5] | CC David, DJ Jacobs, “Principal component analysis: a method for determining the essential dynamics of proteins,” Methods in Molecular Biology, 1084, 193-226 (2014). |

# Kernel PCA (kPCA)¶

Standard PCA assumes that input data are linearly related. In cases where variables are not intrinsically linearly related the user has option to perform nonlinear generalization of PCA such as Kernel PCA. It is an extention of normal PCA, where different kernel fucntions (such as linear, RBF, polynomial, and cosine) are used to perform non-linear dimensional reduction.

\(N\) points are linearly non-separable in dimension (\(d\) < \(N\)). But there can be a hyperplane dividing them in a higher dimension given \(N\) points, \((Xi)\) it can be maped to an N-dimensional space with

\(\Phi\) is an arbitrary choosen fucntion.

MODE-TASK includes the tools to perform the Kernel PCA on MD trajectory, and offers the user different choices for the kernel. In Kernel PCA the input trajectory is first raised to a higher dimension by a kernel function and then PCA is performed on the elevated data. One should use Kernel PCA with caution as it is difficult to interpret the results since the input trajectory is mapped to a different feature space than conformational space. Nevertheless, Kernel PCA could be an invaluable tool in studying structural mechanisms behind protein dynamics in cases where conventional PCA is not helpful.

# Incremental PCA¶

A major bottleneck in the speed of PCA calculation is the availability of computer memory during the loading of a MD trajectory. IPCA is a memory efficient variant of PCA, where only most substantial singular vectors are used to project the input data to a lower dimension. The IPCA algorithm uses a batch data loading approach and the incremental storage of various variables, thus achieving higher memory efficiency. MODE-TASK has implemented IPCA on MD trajectory, available through scikit-learn Python library on the original algorithm [6] [7].

References

[6] | (1, 2) Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011;12:2825–30. |

[7] | Ross DA, Lim J, Lin R-S, Yang M-H. Incremental Learning for Robust Visual Tracking. Int. J. Comput. Vis. 2008;77:125–41. |

# Multi-dimensional scaling (MDS)¶

MDS is a technique of dimensionality reduction, where measure of dissimilarity in a dataset is used. It places each input point in \(N\)-dimensional space while trying to preserve the original distance matrix as much as possible. MODE-TASK implements metric and nonmetric types of MDS for MD trajectory by using the scikit-learn library [6]. The Euclidean distance between internal coordinates and pairwise RMSD between the MD frames is used as dissimilarity measures in MODE-TASK.

# t-Distributed Stochastic Neighbor Embedding (t-SNE)¶

t-SNE is another dimensionality reduction method for data of high dimensions [8]. t-SNE has been implemented for protein MD trajectories in MODE-TASK. Like MDS, the Euclidean distance between internal coordinates of a protein structure and pairwise RMSD between a set of atoms are used as measures of dissimilarity.

References

[8] | Van der Maaten L, Hinton G. Visualizing Data using t-SNE. J. Mach. Learn. Res. 2008;9:2579–605. |