Amino acid properties database
Values for 328 properties/descriptors were calculated for the 20 common amino acids with MOE 2010.10 [11] and were stored within an SQLite database. In particular, the database contains 11 categories of descriptors: i) 33 adjacency and distance matrix descriptors [12–16] (e.g., Balaban’s connectivity topological index [14]); ii) 41 atom/bond count descriptors [17, 18] (e.g., the number of double bonds); iii) 18 conformation dependent charge descriptors [19] (e.g., the water accessible surface area of polar atoms); iv) the 16 Kier and Hall connectivity and kappa shape indices [20, 21] (e.g., the Zagreb index); v) 21 MOPAC descriptors [22] (e.g., the ionization potential); vi) 48 partial charge descriptors (e.g., the total positive partial charge); vii) 12 pharmacophore feature descriptors (e.g., the number of hydrophobic atoms); viii) 11 potential energy descriptors (e.g., the solvation energy); ix) 16 physical properties [18, 23–27] (e.g., the molecular weight); x) 18 subdivided surface areas; xi) 94 surface area, volume, and shape descriptors (e.g., globularity). A detailed explanation of each descriptor is provided in the properties codebook which accompanies the tool. By drawing values from this database, Structuprint can visualize the distribution of a property across protein surfaces. Users can extend it by adding measurements for more chemical components or provide their own custom SQLite database in order to incorporate novel descriptors.
Algorithm
Generation of a mould of the surface of a protein
The main steps of the algorithm implemented by Structuprint are shown in Fig. 1. The tool first produces a mould of the protein structure’s surface in two steps. The structure is initially placed within a 3D grid with cell dimensions of 1 × 1 × 1 Å. Then, one dummy atom is inserted in each empty grid cell that neighbours a single protein atom. This process was previously described by Vlachakis et al. [28] and is extended here, with dummy atoms being assigned the identity of the amino acid to which their neighbouring protein atom belongs. This results to a quite accurate approximation of the underlying protein surface at the level of residue atoms.
Transformation of the mould into a sphere
The next step involves the conversion of the dummy atoms mould to a sphere. To this end, the algorithm calculates the coordinates of the centre of mass of the mould c - i.e., the average position of all atoms -, and the maximum distance of any atom v
i
from the centre of mass (radius):
$$ \mathbf{c}=\left({x}_c,\kern0.75em {y}_c,\kern0.75em {z}_c\right)=\left(\frac{{\displaystyle {\sum}_{i=1}^n}{x}_i}{n},\kern0.75em \frac{{\displaystyle {\sum}_{i=1}^n}{y}_i}{n},\kern0.75em \frac{{\displaystyle {\sum}_{i=1}^n}{z}_i}{n}\right) $$
(1)
$$ radius=\underset{1\le i\le n}{ \max}\sqrt{{\left({x}_i-{x}_c\right)}^2+{\left({y}_i-{y}_c\right)}^2+{\left({z}_i-{z}_c\right)}^2\ } $$
(2)
The coordinates of each atom are normalized with respect to the centre of mass:
$$ {\mathbf{v}}_{\boldsymbol{i}}^{\boldsymbol{\hbox{'}}}=\left({x}_i^{\hbox{'}},\kern0.75em {y}_i^{\hbox{'}},\kern0.75em {z}_i^{\hbox{'}}\right) = \left({x}_i-{x}_c,\kern0.75em {y}_i-{y}_c,\kern0.75em {z}_i-{z}_c\right) $$
(3)
Then, to transfer the dummy atoms onto the surface of a sphere, each vector v
'
i
is scaled to a length equal to the radius:
$$ {\mathbf{w}}_{\boldsymbol{i}}=\left({x}_i^{\hbox{'}\hbox{'}},\kern0.75em {y}_i^{\hbox{'}\hbox{'}},\kern0.75em {z}_i^{\hbox{'}\hbox{'}}\right) = \frac{radius}{\sqrt{{x_i^{\hbox{'}}}^2+{y_i^{\hbox{'}}}^2+{z_i^{\hbox{'}}}^2}}\cdot {\mathbf{v}}_{\boldsymbol{i}}^{\boldsymbol{\hbox{'}}} $$
(4)
Projection of the sphere onto a map
The Cartesian coordinates of each w
i
are converted to latitude/longitude values (in units of radians) using the following set of equations:
$$ \begin{array}{l} latitud{e}_i={ \tan}^{-1}\frac{z_i^{\hbox{'}\hbox{'}}}{\sqrt{{x_i^{\hbox{'}\hbox{'}}}^2+{y_i^{\hbox{'}\hbox{'}}}^2}}\hfill \\ {} longitud{e}_i={ \tan}^{-1}\frac{y_i^{\hbox{'}\hbox{'}}}{x_i^{\hbox{'}\hbox{'}}}\hfill \end{array} $$
(5)
For the two-dimensional projection, several techniques were initially tested (e.g., the sinusoidal projection [29] and the Hammer projection [29, 30]), before deciding on the Miller cylindrical projection [29, 31]:
$$ {\mathbf{m}}_{\boldsymbol{i}}=\left( longitud{e}_i,\kern0.75em \frac{5}{4}\cdot \ln \left[ \tan \left(\frac{\pi }{4}+\frac{2}{5}\cdot latitud{e}_i\right)\right]\right) $$
(6)
This projection was selected on the basis of its simplicity and ease of understanding. It is one of the most popular projections in cartography, as it can depict the entirety of the sphere, including the poles. Latitude and longitude lines are parallel and straight. Projection-induced distortion is zero at the equator, increases gradually towards higher latitudes, and becomes maximal at the poles. This leads to significant overestimation of the distance among atoms at the upper and lower parts of the figure (Fig. 1), similarly to the areal exaggeration of Greenland and Antarctica. Nevertheless, the Miller cylindrical projection introduces less polar distortion than the Mercator projection, on which it is based.
Map smoothing
The previous step resulted in a map of the protein surface with data points coloured by a property of choice. However, this ‘primary’ map is not suitable for detecting areas with an overall concentration of atoms with high or low property values, which is one of the main benefits of this cartographic approach. For instance, a small area with both negatively and positively charged residue atoms would not appear as almost neutrally charged, but as a tiny dipole. To prevent the appearance of small ‘hot spots’ and redistribute the property values among neighbouring data points, the algorithm includes a smoothing step. The map is iteratively divided in grid squares of varying dimensions, from 0.001° × 0.001° to 0.5° × 0.5°, with a step increase of 0.001°. In each iteration of this process, grid cells are assigned the average value of all data points within them. Finally, the value of every data point is defined as the average value of its corresponding grid cell across all iterations. This smoothing method ensures that areas with pronounced accumulation of high or low values are easily discernible from those with a mixed population.
User interfaces
The default interface of Structuprint is a cross-platform, command-line interface (CLI). It consists of two executables: structuprint_frame and structuprint. The structuprint_frame executable produces a TIFF figure from a single input PDB file, using the R package ggplot2 [32] for plotting. The structuprint executable is responsible for processing multiple superimposed PDB files - either serially or in a parallel manner -, generating a TIFF figure per input file and a final GIF animation, rendered with the Imager Perl module [33]. Most parameters of the underlying algorithms can be modified by the user, such as the delay between animation frames, the background colour, and the appearance of ID numbers on final figures. A full descriptive list of the available parameters for both executables can be found in Structuprint’s manual, distributed along with the application and also available from its website.
Other than the CLI, Structuprint also comes with a Graphical User Interface (GUI), available by default only on GNU/Linux systems. The GUI is built with the Gtk2 toolkit and offers a user-friendly interface to all the command line arguments and options. As an example of its capabilities, in Fig. 2 Structuprint’s GUI is producing an animation on a multiprocessor machine using 30 cores.
Parallelism
On Unix-like systems (e.g., GNU/Linux, OS X), Structuprint supports task parallelism when generating animations. Using the Parallel::ForkManager Perl module [34], Structuprint can take advantage of multiple CPU cores by assigning each input PDB file to a different processor. The simultaneous rendering of multiple individual frames considerably reduces the total execution time, allowing for visualization of entire molecular dynamics simulations within a reasonable time frame.