PCA Scripts¶

Principal component analysis (PCA) is a useful statistical technique that has found applications in detection of correlated motion in MD data. Protein dynamics is manifested as a change in molecular structure, or conformation over a timescale. PCA extracts most important motions from a protein’s MD trajectory using a covariance/correlation matrix (C-matrix) constructed from atomic coordinates. Different types of coordinate systems (Cartesian or internal coordinates) can be employed to define atomic movement in each time frame of a trajectory. Modes describing the protein motions can be constructed by diagonalizing the C-matrix. It leads to a complete set of orthonormal (orthogonal and normalized) collective modes (eigenvectors) with eigenvalues (variance) that characterize the protein dynamics. The largest eigenvalues represent the most collective spatial motion. When the original mean centered data (MD trajectory) is projected on the eigenvectors, the results are called Principal Components (PC). Diagonalization of the C-matrix can be done by Eigenvalue decomposition (EVD) or Singular value decomposition (SVD), with the latter being computationally efficient.

As stated earlier, different representations of protein conformations can be used. One can choose Cartesian coordinates or internal coordinates such as the pairwise distance between atoms, 1-3 angle, torsional angles (\(\Phi\) or \(\Psi\)). Since decomposition of a C-matrix is memory intensive and very often the program will run out of memory, often a coarse graining is required such as selecting CA atoms. The user can select the subset of atoms from the trajectory for the analysis such as CA, backbone atoms or all protein’s atoms. It is highly recommended that the user should strip the water from the trajectory before hand, as it would result in faster loading and alleviate the memory issues.

PCA uses linear transformation which may not be sufficient in cases where variables are non-linearly related. Thus, the user has the option to perform Nonlinear generalization of PCA such as Kernel PCA (kPCA). Caution should be given while interpreting the kPCA results since it is mapped to a feature space which is inherently different than conformational space. Nevertheless, kPCA is useful in understanding the protein’s functions in terms of its conformational dynamics.

General Usage:

To perform PCA on a protein’s MD trajectory we need a sufficiently sampled MD trajectory and a corresponding topology file. This can be achieved by running the following command.

Command:: pca.py -t <MD trajectory> -p <topology file>
To see the all the available options run the following command:: pca.py -h

Inputs:

Input (*required)	Input type	Flag	Description
Trajectory file *	File	`-t`	MD trajectory input file (.xtc, .mdcrd etc.)
Topology file *	File	`-p`	Topology file (.gro, .pdb etc)
Output directory	String	`-out`	Name of the output directory. Default is out, suffixed by trajectory name
Atom group	String	`-ag`	Group of atoms for PCA. Default is CA atoms. Other options are: all = all atoms, backbone = backbone atoms, CA = C-alpha atoms, protein = protein atoms
Reference structure	File	`-r`	Reference structure for RMSD. Default: First frame of MD trajectory
PCA method	String	`-pt`	PCA method. Default is svd (Single Value Decomposition) PCA. Options are: evd, kpca, svd, ipca. If svd is selected, additional arguments can be passed by flag -st. If KernelPCA is selected kernel type can also be defined by flag -k
Number of components	Int	`-nc`	Number of components to keep in a PCA object. Default: All the components will be kept.
Kernel Type	String	`-kt`	Type of kernel for KernalPCA. Default is linear. Options are: linear, poly, rbf, sigmoid, cosine, precomputed
SVD solver type	String	`-st`	Type of svd_solver for SVD (Single Value Decomposition) PCA. Default is auto. Options are: auto, full, arpack, randomized

Outputs:

Output	Description
PC plots	2D Plot of first 3 PCs. It is grace formatted text file
PC plots (.png)	2D Plot of first 3 PCs. Same as above, but points are color coded according to MD time
Scree plot	Scree plot of contribution of first 100 modes (eigenvectors)
RMSD plot	RMSD of selected atoms over the MD time
RMSD Modes	Plot of contribution of each resdiues towards the first 3 modes (eigenvectors)

Besides the above-mentioned plots, it also prints useful information on the terminal such as, information about the trajectory, Kaiser-Meyer-Olkein (KMO) index of the trajectory, and cosine contents of the first few PCs. KMO value range from 0 to 1, 1 indicating that the MD has been sampled sufficiently. The cosine content of PCA projections can be used as an indicator if a simulation is converged. Squared cosine value should be more than 0.5.

Specific Examples:

PCA on Cartesian coordinates¶

Given a trajectory called trajectory.xtc and a topology file called complex.pdb, the following command is used:

pca.py -t trajectory.xtc -p complex.pdb

This will perform the singular value decomposition (SVD) based PCA on CA atoms by default. To use other methods, see the following examples.

SVD PCA¶

To perform SVD PCA on CA atoms of a MD trajectory

Command:: pca.py -t trajectory.xtc -p complex.pdb -ag CA -pt svd

To perform the SVD PCA on backbone atoms

Command:: pca.py -t trajectory.xtc -p complex.pdb -ag backbone -pt svd

Kernel PCA¶

To perform the Kernel PCA with linear kernel

Command:: pca.py -t trajectory.xtc -p complex.pdb -ag CA -pt kpca -kt linear

To perform the Kernel PCA with rbf kernel

Command:: pca.py -t trajectory.xtc -p complex.pdb -ag CA -pt kpca -kt rbf

Incremental PCA¶

Incremental PCA (IPCA) is a variant of usual PCA, which uses low-rank approximation of the input MD trajectory. It uses the amount of memory to store the input trajectory which is independent of trajectory size. IPCA is very useful in case the size of trajectory is larger than that may be handled by the available computer memory.

Command:: pca.py -t trajectory.xtc -p complex.pdb -ag CA -pt ipca

Eigenvalue decomposition (EVD) PCA¶

To perform the PCA by eigenvalue decomposition

Command:: pca.py -t trajectory.xtc -p complex.pdb -ag CA -pt evd

Detailed usage:

Run the following command to see the detailed usage and other options:: pca.py -h

PCA on internal coordinates¶

Users can also perform the PCA on internal coordinates of a MD trajectory. Options are available for different types of internal coordinates such as: pairwise distance between atoms, 1-3 angle between backbone atoms, psi angle, and phi angle.

General Usage:

Command:: internal_pca.py -t <MD trajectory> -p <topology file>

Inputs:

Input (*required)	Input type	Flag	Description
Trajectory file *	File	`-t`	MD trajectory input file (.xtc, .mdcrd, etc.)
Topology file *	File	`-p`	Topology file (.gro, .pdb, etc)
Output directory	String	`-out`	Name of the output directory. Default is out, suffixed by trajectory name
Atom group	String	`-ag`	Group of atom for PCA. Default is CA atoms. Other options are: all = all atoms, backbone = backbone atoms, CA = C-alpha atoms, protein = protein atoms
Coordinate Type	String	`-ct`	Internal cordinate type. Options are: distance, angles, phi, and psi

Outputs:

Output	Description
PC plots	2D Plot of first 3 PCs. It is a grace formatted text file
PC plots (.png)	2D Plot of first 3 PCs. Same as above, but points are color coded according to MD time
Scree plot	Scree plot of the contribution of the first 100 modes (eigenvectors)

Specific Examples:

PCA on pairwise distance between CA atoms:

To perform the PCA on pairwise distance between CA atoms of an MD trajectory trajectory.xtc and a topology file complex.pdb

Command:: internal_pca.py -t trajectory.xtc -p complex.pdb -ag CA -ct distance

PCA on psi angles:

Command:: internal_pca.py -t trajectory.xtc -p complex.pdb -ct psi

Detailed usage:

Run the following command to see the detailed usage and other options:: internal_pca.py -h

MDS (Multi-dimensional scaling) on MD trajectory¶

MDS is a tool to visualize the similarity or dissimilarity in a dataset. Two types of dissimilarity measures can be used in the case of a MD trajectory. The first is Euclidean distance between internal coordinates of a protein structure, the second is pairwise RMSD between a set of atoms over the frames of a MD trajectory.

General Usage:

command:: mds.py -t <MD trajectory> -p <topology file>

Inputs:

Input (*required)	Input type	Flag	Description
Trajectory file *	File	`-t`	MD trajectory input file (.xtc, .mdcrd, etc.)
Topology file *	File	`-p`	Topology file (.gro, .pdb, etc)
Output directory	String	`-out`	Name of the output directory. Default is out, suffixed by trajectory name
Atom group	String	`-ag`	Group of atoms for MDS. Default is CA atoms. Other options are: all = all atoms, backbone = backbone atoms, CA = C-alpha atoms, protein = protein atoms
MDS type	String	`-mt`	Type of MDS. Options are nm = non-metric, metric = metric
Dissimilarity type	String	`-dt`	Type of dissimilarity matrix to use. euc = Euclidean distance between internal coordinates, rmsd = pairwise RMSD. Default is rmsd
Coordinate type	String	`-ct`	Internal coordinate type. Default is pairwise distance. Only used if Dissimilarity type is Euclidean
Atom indices	String	`-ai`	Group of atoms for pairwise distance. Default is CA atoms. Other options are: all = all atoms,backbone = backbone atoms, alpha = C-alpha atoms,heavy = all non-hydrogen atoms, minimal = CA, CB, C, N, O atoms

Outputs:

Output	Description
PC plots	2D Plot of the first 3 PCs. It is a grace formatted text file
PC plots (.png)	2D Plot of the first 3 PCs. Same as above, but points are color coded according to MD time

Specific Examples:

MDS on pairwise RMSD:

To perform MDS on the pairwise RMSD between CA atoms

Command:: mds.py -t trajectory.xtc -p complex.pdb -dt rmsd -ag CA

MDS on internal coordinates:

To perform MDS on the pairwise distance between CA atoms

Command:: mds.py -t trajectory.xtc -p complex.pdb -dt euc -ag CA

Detailed usage:

Run the following command to see the detailed usage and other options:: mds.py -h

t-SNE on MD trajectory¶

t-distributed Stochastic Neighbor Embedding (t-SNE) is a tool for dimensionality reduction. It is a variant of stochastic neighbor embedding technique. t-SNE uses a measure of dissimilarity, which, in the case of MD trajectory, may be the Euclidean distance between internal coordinates or pairwise RMSD.

General Usage:

Command:: tsne.py -t <MD trajectory> -p <topology file>

Inputs:

Input (*required)	Input type	Flag	Description
Trajectory file *	File	`-t`	MD trajectory input file (.xtc, .mdcrd, etc.)
Topology file *	File	`-p`	Topology file (.gro, .pdb, etc)
Output directory	String	`-out`	Name of the output directory. Default is out, suffixed by trajectory name
Atom group	String	`-ag`	Group of atoms for t-SNE. Default is CA atoms. Other options are: all = all atoms, backbone = backbone atoms, CA = C-alpha atoms, protein = protein atoms
Coordinate type	String	`-ct`	Internal coordinates type. Default is pairwise distance . Only used if Dissimilarity type is Euclidean
Dissimilarity type	String	`-dt`	Type of dissimilarity matrix to use. euc = Euclidean distance between internal coordinates, rmsd = pairwise RMSD. Default is rmsd
Atom indices	String	`-ai`	Group of atoms for pairwise distance. Default is CA atoms. Other options are: all = all atoms, backbone = backbone atoms, alpha = C-alpha atoms, heavy = all non-hydrogen atoms, minimal = CA, CB, C, N, O atoms
PERPLEXITY	Float	`-pr`	[t-SNE parameters] The perplexity is related to the number of nearest neighbors that is used in other manifold learning algorithms Default is 30
LEARNING_RATE	Float	`-lr`	[t-SNE parameters] The learning rate for t-SNE. Default is 200
N_ITER	Int	`-ni`	[t-SNE parameters] Number of iteration to run. Default is 300

Outputs:

Output	Description
PC plots	2D Plot of the first 3 PCs. It is grace formatted text file
PC plots (.png)	2D Plot of the first 3 PCs. Same as above, but point are color coded according to MD time

Specific Example:

t-SNE on CA atoms: To perform t-SNE using the pairwise RMSD of CA atoms as index of dissimilarity.

command:: tsne.py -t trajectory.xtc -p complex.pdb -ag CA -dt rmsd

To perform t-SNE using the Euclidean space between pairwise distance of CA atoms as index of dissimilarity.

command:: tsne.py -t trajectory.xtc -p complex.pdb -ag CA -dt euc -ai alpha

Detailed usage:

Run the following command to see the detailed usage and other options:: tsne.py -h