Molecular data analysis with NAMS

This is an extension of the first tutorial on NAMS. Here we will not use NAMS for similarity searching but instead use it to visualise and analyse the molecular space.

In this tutorial, we will use the already computed NAMS file produced in the first tutorial, but will require the call_nams() function defined previously and will be ommited in the current presentation.

The data for analysis is the hidrophobicity data at pH=2 and pH=7 from: http://www.sigmaaldrich.com/life-science/metabolomics/learning-center/amino-acid-reference-chart.html.

aa.txt<-read.table("aminoacids.txt", header=T)
head(aa.txt)

##   aa_id aa_code                     smiles HP7 HP2
## 1     1     Ala           O=C(O)[C@@H](N)C  41  47
## 2     2     Cys           SC[C@H](N)C(O)=O  49  52
## 3     3     Asp      OC(C[C@H](N)C(O)=O)=O -55 -18
## 4     4     Glu    O=C(O)[C@@H](N)CCC(O)=O -31   8
## 5     5     Phe N[C@H](C(O)=O)CC1=CC=CC=C1 100  92
## 6     6     Gly                   O=C(O)CN   0   0

aa_codes<-as.character(aa.txt$aa_code)

Thus, picking up the amino acids databases, we are going to examine their chemical space as provided by NAMS similarity analysis. For this we will be using NAMS -mode 2 which compares each molecule in a database to all the others, and does not require an input file. We are going to use ADM =3 as it tends to produce a more interpretable output, although for such a small and well contained example is probably not very relevant,

out<-call_nams("-mode 2 -opts JLS -ADM 3 -alpha 3.0", file_db="aminoacids.nams", file_out="aa_selfs.txt")
aa_selfs<-read.table("aa_selfs.txt", header=T)
head(aa_selfs)

##   molid1 molid2    ss1    ss2  sim12 Sscore
## 1      1      2 14.519 16.979 13.956 0.7956
## 2      1      3 14.519 21.939 14.309 0.6460
## 3      1      4 14.519 24.219 14.239 0.5812
## 4      1      5 14.519 31.689 14.029 0.4360
## 5      1      6 14.519  9.080  8.979 0.6142
## 6      1      7 14.519 31.909 14.029 0.4330

190 similarities were computed. To analyze this data, first let’s examine how is the distribution of the similarity scores, by doing a density plot of the Sscores

plot(density(aa_selfs$Sscore))
rug(aa_selfs$Sscore)

Typically Sscore values below 0.6 imply a very small discernible similarity between compounds while values above 0.75-0.80 are typically relevant.

However most tools that are used for processing this type of data use distances and not similarities. There are several ways of transforming similarities into distances, but a very common approach is to use the following expression: Dist = -log(Sim). From this transformation we can then proceed to make a distance matrix.

The task at hand is very simplified as the amino acids are numbered from 1 to 20, which facilitates matrix filling. in other circunstances, we would require translating molecule ID to row and column number.

NMols<-20
Daa<- -log(aa_selfs$Sscore)
dist_mat<-matrix(0, NMols, NMols)
for(i in 1:nrow(aa_selfs)) dist_mat[aa_selfs$molid1[i], aa_selfs$molid2[i]]<-Daa[i]
dist_mat<-dist_mat+t(dist_mat)
colnames(dist_mat)<-aa_codes
rownames(dist_mat)<-aa_codes
dist_mat[1:6, 1:6]

##           Ala       Cys       Asp       Glu       Phe       Gly
## Ala 0.0000000 0.2286587 0.4369558 0.5426603 0.8301130 0.4874347
## Cys 0.2286587 0.0000000 0.5252627 0.6226887 0.9303897 0.6514296
## Asp 0.4369558 0.5252627 0.0000000 0.1474565 0.8157669 0.9215546
## Glu 0.5426603 0.6226887 0.1474565 0.0000000 0.7789230 1.0269430
## Phe 0.8301130 0.9303897 0.8157669 0.7789230 0.0000000 1.3149044
## Gly 0.4874347 0.6514296 0.9215546 1.0269430 1.3149044 0.0000000

One of the first steps we may do is to make a hierarchical agglomerative clustering to verify how amino acids group together according to their structures

dists<-as.dist(dist_mat)
hc<-hclust(dists, method="ward.D2")
plot(hc, main="Amino acids clustering with NAMS distances")

This plot shows that using only structural distances it is possible to capture some of the most interesting amino acid characteristics. For instance, the aromatic amino acids (Phe, Tyr and Trp) appear in a separate cluster, and glutamic and aspartic acids (Glu and Asp, respectively) appear grouped, despite their backbone structre being more similar to their amine counterparts (Gln and Asn) . Also, Leucine and Isoleucine appear closely together as well as the smallest amino acids (Ser, Ala and Cys).

A different method of analysis based on the distance matrix is to use Principal Coordinates Analysis (PCooA) to get a visual display of how the aminoacids relate to each other structurally

library(MASS)
pcooa<-isoMDS(dists, trace=F)$points
plot(pcooa, type="n", main="PCooA projection of Amino acids")
text(pcooa, aa_codes, cex=0.8)

Running NAMS for chemical property analysis

Correlating chemical similarity and chemical properties

The main question we will try to answer in this section is: Is there any correlation between chemical similarity and chemical properties?

We can try to map the difference of hidrophobicy scores at pH=7 (HP7) for each amino acid pair and its similarity.

Sims<-aa_selfs$Sscore
hp7_1<-aa.txt$HP7[aa_selfs$molid1]
hp7_2<-aa.txt$HP7[aa_selfs$molid2]
hp7_absdif<-abs(hp7_1- hp7_2)
scatter.smooth(Sims, hp7_absdif, lpars=list(col=2),family="symmetric", main="NAMS similary vs hydrophobicity differences")

cor(data.frame(Sims, hp7_absdif))

##                 Sims hp7_absdif
## Sims        1.000000  -0.138259
## hp7_absdif -0.138259   1.000000

It can clearly be seen in the above map that the highest similarities have closer hidrophobicity scores. The plotted spline further shows the decreasing trend.Also verifying the correlation between both variables, the score is negative which is to be expected if similarity can be used for chemical property estimation and analysis

Using spatial correlation with chemical similarity for chemical propeties analysis

In the following examples we will use map the information from the Hidrophobicy potential at pH=7 (HP7) for each amino acid in the 2D projection computed.

First we will separate the data in two classes according to their HP7 score and the we will make a surface plot that will reflect the actual most likely hydrophobic and hydrophylic regions according to their structure. For that we will use the kde2d function from the MASS library.

First we need to define the bandwidths and common ranges for the whole dataset and will use the default bandwidth.nrd function for that purpose, creating one for the X and Y coordinates

X<-pcooa[,1]
Y<-pcooa[,2]
(bws<-c(bandwidth.nrd(X), bandwidth.nrd(Y)))

## [1] 0.6728713 0.4734853

(lims<-c(range(X), range(Y))*1.05)

## [1] -1.1992541  1.2361718 -0.6130914  0.8676031

Now we will isolate the positive and negative datasets according to their poperties

pos_ids<-which(aa.txt$HP7>0)
neg_ids<-which(aa.txt$HP7<=0)

We can produce the kernel density maps for both sets, preserving consistent bandwidths

img_pos<-kde2d(X[pos_ids], Y[pos_ids], h = bws, n = 256, lims=lims)
img_neg<-kde2d(X[neg_ids], Y[neg_ids], h = bws, n = 256, lims=lims)
par(mfrow=c(1,2))
image(img_pos, main="Hydrophobic 2D Kernel plot")
text(pcooa[pos_ids,], aa_codes[pos_ids], cex=0.8)
image(img_neg, main="Hydrophilic 2D Kernel plot")
text(pcooa[neg_ids,], aa_codes[neg_ids], cex=0.8)

Finally we can mix both maps producing a Maximum a Posteriori surface map indicating which are the most likely regions for both situations. This will require to account for the relative frequencies of each class as an estimation for the existing priors

prior_pos<-length(pos_ids)/NMols
prior_neg<-length(neg_ids)/NMols #it could be 1.0 - prior_pos

img_pos$z<-img_pos$z/sum(img_pos$z)*prior_pos
img_neg$z<-img_neg$z/sum(img_neg$z)*prior_neg

img_map<-img_pos
img_map$z<-sign(img_pos$z-img_neg$z)
image(img_map, main="MAP plot of amino acids HP7 (alpha=3.0")
text(pcooa[neg_ids,], aa_codes[neg_ids], cex=0.8, col=1)
text(pcooa[pos_ids,], aa_codes[pos_ids], cex=0.8, col=4)

Notice that this method is able to clearly define two distinct and separable regions for hydrophobic (in Yellow) and hydrophilic (red) aminoacids. Also, no amino acid is apparently misplaced, which would be a definite possibility with this method if the regions were not be so well defined. The only possible mismatch is Serine, but this is actually striking as Serine is the weakest hydrophilic amino acid (score at pH7= -5), so the method was able to identify the threshold results. Check as well Threonine and Histidine (weak hydrophobic)

It is noteworthy to stress out the importance of the alpha parameter in NAMS for getting meaningful results if the nature of the problem is known. For the present example, as we are dealing with hidrophobicity, typically the presence of certain groups (amine, alcohol, etc) is fundamental to explain this property. Therefore, for the above example alpha was defined as 3.0, thus emphasizing the presence or absence of any possible functional group, reducing the importance of plain structure mapping.

To make this point clear we will run again the relevant code portions to create the same final plot, yet using alpha=0.5, a very low value, that will sacrifice chemical function for structure matching. The results although still meaningfullare not as clear as the previous plot

out<-call_nams("-mode 2 -opts JLS -ADM 3 -alpha 0.5", file_db="aminoacids.nams", file_out="aa_selfs2.txt")
aa_selfs<-read.table("aa_selfs2.txt", header=T)
Daa<- -log(aa_selfs$Sscore)
dist_mat<-matrix(0, NMols, NMols)
for(i in 1:nrow(aa_selfs)) dist_mat[aa_selfs$molid1[i], aa_selfs$molid2[i]]<-Daa[i]
dist_mat<-dist_mat+t(dist_mat)
colnames(dist_mat)<-aa.txt$aa_code
rownames(dist_mat)<-aa.txt$aa_code
dists<-as.dist(dist_mat)
pcooa<-isoMDS(dists, trace=F)$points
X<-pcooa[,1]
Y<-pcooa[,2]
bws<-c(bandwidth.nrd(X), bandwidth.nrd(Y))
lims<-c(range(X), range(Y))*1.05

img_pos<-kde2d(X[pos_ids], Y[pos_ids], h = bws, n = 256, lims=lims)
img_neg<-kde2d(X[neg_ids], Y[neg_ids], h = bws, n = 256, lims=lims)

img_pos$z<-img_pos$z/sum(img_pos$z)*prior_pos
img_neg$z<-img_neg$z/sum(img_neg$z)*prior_neg

img_map<-img_pos
img_map$z<-sign(img_pos$z-img_neg$z)
image(img_map, main="MAP plot of amino acids HP7 (alpha=0.5)")
text(pcooa[neg_ids,], aa_codes[neg_ids], cex=0.8, col=1)
text(pcooa[pos_ids,], aa_codes[pos_ids], cex=0.8, col=4)

Now we see that despite the fact that the overal topological structure is maintained between the amino acids, the hydrophobic and hydrophilic regions are not so well defined and do not separate the amino acids so clearly (E.g. Ser and Met appear out of their functional regions).

Molecular data analysis with NAMS

Andre Falcao

July 14, 2016

Running NAMS for chemical property analysis

Correlating chemical similarity and chemical properties

Using spatial correlation with chemical similarity for chemical propeties analysis