There are 3 R functions available in the package serrsBayes for model-based, statistical analysis of spectroscopy:
fitSpectraMCMC
fitSpectraSMC
fitVoigtPeaksSMC
This vignette will provide recommendations for choosing which function to use for a given purpose. As an example, let’s say I wanted to fit a Raman spectrum of methanol. With a quick Google search, I find that there should be 4 peaks located at wavenumbers 1033 (C-O stretching), 1106 (CH3 rocking), 1149 (CH3 bending), and 1448 cm\(^{-1}\). However, we know that these peaks can shift slightly due to differences in temperature, pressure, spectrometer calibration, etc. My sample is in the liquid phase, so these peaks will not be exactly Gaussian or Lorentzian, but somewhere in the middle. On the other hand, I don’t want to wait for hours for SMC to finish running.
The first point to emphasise is that fitSpectraMCMC
should not be used for real data. It exists mainly to illustrate the differences between MCMC and SMC algorithms, so is only of interest to computational statisticians.
fitSpectraSMC
is the fastest method, but can also be difficult to tune. You need to choose either Lorentzian or Gaussian broadening functions. You also need to use fixed peak locations and the results are sensitive to the choice of priors$bl.smooth
, however if you manage to get it working for your data then there is no real reason to use fitVoigtPeaksSMC
.
The main differences in fitVoigtPeaksSMC
are as follows:
loc.mu
(mean) and loc.sd
(standard deviation), in wavenumbers \(\left(\mathrm{cm}^{-1}\right)\).priors$bl.smooth = 1
if you are using fitVoigtPeaksSMC
, instead of priors$bl.smooth=10^11
or something like that. The baseline estimation is more robust and easier to tune for “difficult” spectra.beta.mu
and beta.sd
). If you leave it out, then a uniform prior will be used by default. However, you can also set informative priors for the amplitude of each peak in the spectrum if that information is available.The first step is always to read in the spectrum and plot it. Normally I would use the R package hyperSpec for this, but the data that we are going to use is included in the R package serrsBayes:
library(serrsBayes)
data("methanol", package = "serrsBayes")
<- methanol$wavenumbers
wavenumbers <- methanol$spectra
spectra
<- c(1033, 1106, 1149, 1448)
peakLocations <- length(peakLocations)
nPK <- numeric(nPK)
pkIdx for (i in 1:nPK) {
<- which.min(wavenumbers < peakLocations[i])
pkIdx[i]
}<- length(wavenumbers)
nWL plot(wavenumbers, spectra[1,], type='l', col=4,
xlab=expression(paste("Raman shift (cm"^{-1}, ")")), ylab="Intensity (a.u.)", main="Observed Raman spectrum for methanol")
points(peakLocations, spectra[1,pkIdx], cex=2, col=2)
text(peakLocations + c(100,20,40,0), spectra[1,pkIdx] + c(0,700,400,700), labels=peakLocations)
We can use exactly the same priors as given in the package README:
<- list(scale.mu=log(11.6) - (0.4^2)/2, scale.sd=0.4, bl.smooth=10^11, bl.knots=50,
lPriors beta.mu=5000, beta.sd=5000, noise.sd=200, noise.nu=4)
<- system.time(result <- fitSpectraSMC(wavenumbers, spectra, peakLocations, lPriors)) tm
print(paste(tm["elapsed"]/60, "minutes"))
## [1] "3.40311666666667 minutes"
We can see there is a problem with the model fit:
<- sample.int(length(result$weights), 200, prob=result$weights)
samp.idx plot(wavenumbers, spectra[1,], type='l', lwd=3,
xlab=expression(paste("Raman shift (cm"^{-1}, ")")), ylab="Intensity (a.u.)", main="Fitted model with Lorentzian peaks")
for (pt in samp.idx) {
<- result$basis %*% result$alpha[,1,pt]
bl.est lines(wavenumbers, bl.est, col="#C3000009")
lines(wavenumbers, bl.est + result$expFn[pt,], col="#00C30009")
}
Adjusting the priors is not going to fix this. The underlying issue is that the peak location of 1448 cm\(^{-1}\) doesn’t match our observed data. Even a small difference in the location of one of the peaks can adversely effect the entire model fit. This is the kind of issue that the function fitVoigtPeaksSMC
is designed to handle.
The priors here are very similar to what we used for TAMRA in the other vignette. However, we don’t need as many bl.knots
because we have a narrower range of wavenumbers. Instead, we will use bl.knots = 50
, the same as what we used for fitSpectraSMC
above.
<- list(loc.mu=peakLocations, loc.sd=rep(50,nPK), scaG.mu=log(16.47) - (0.34^2)/2,
lPriors2 scaG.sd=0.34, scaL.mu=log(25.27) - (0.4^2)/2, scaL.sd=0.4, noise.nu=5,
noise.sd=50, bl.smooth=1, bl.knots=50)
data("result2", package = "serrsBayes")
if(!exists("result2")) {
<- Sys.time()
tm2 <- fitVoigtPeaksSMC(wavenumbers, spectra, lPriors2, npart=3000, rate=0.499,mcSteps=50)
result2 $time <- Sys.time() - tm2
result2save(result2, file="Figure 2/result2.rda")
}
fitVoigtPeaksSMC
takes more than twice as long to fit the model, due to estimating the additional parameters. However, we obtain a much better fit to the observed spectrum:
print(paste(format(result2$time), "for",length(result2$ess),"SMC iterations."))
## [1] "7.083923 mins for 19 SMC iterations."
<- 1:nrow(result2$location)
samp.idx plot(wavenumbers, spectra[1,], type='l', lwd=3,
xlab=expression(paste("Raman shift (cm"^{-1}, ")")), ylab="Intensity (a.u.)", main="Fitted model with Voigt peaks")
<- resid.mat <- matrix(0,nrow=length(samp.idx), ncol=nWL)
samp.mat <- samp.lambda <- numeric(length=nrow(samp.mat))
samp.sigi for (pt in 1:length(samp.idx)) {
<- samp.idx[pt]
k <- mixedVoigt(result2$location[k,], result2$scale_G[k,],
samp.mat[pt,] $scale_L[k,], result2$beta[k,], wavenumbers)
result2<- result2$sigma[k]
samp.sigi[pt] <- result2$lambda[k]
samp.lambda[pt]
<- spectra[1,] - samp.mat[pt,]
Obsi <- length(Obsi) * samp.lambda[pt] * result2$priors$bl.precision
g0_Cal <- crossprod(result2$priors$bl.basis) + g0_Cal
gi_Cal <- as.vector(solve(gi_Cal, crossprod(result2$priors$bl.basis, Obsi)))
mi_Cal
<- result2$priors$bl.basis %*% mi_Cal # smoothed residuals = estimated basline
bl.est lines(wavenumbers, bl.est, col="#C3000009")
lines(wavenumbers, bl.est + samp.mat[pt,], col="#00C30009")
<- Obsi - bl.est[,1]
resid.mat[pt,] }
The model fit could be further improved. For example, it appears that there is a shoulder peak around 1550 cm\(^{-1}\). However, we can see that the posterior distribution for the fourth peak location is very far away from the theoretical value of 1448:
for (j in 1:nPK) {
hist(result2$location[,j], main=paste("Peak",j),
xlab=expression(paste("Peak location (cm"^{-1}, ")")),
freq=FALSE, xlim=range(result2$location[,j], peakLocations[j]))
lines(density(result2$location[,j]), col=4, lty=2, lwd=3)
abline(v=peakLocations[j], col=2, lty=3, lwd=3)
}
This explains why the results from fitSpectraSMC
were so poor. You really need to have very precise information about all of the peak locations in order for that function to perform well. Otherwise, fitVoigtPeaksSMC
will usually give much better fit to the observed spectrum.
We can use the posterior from fitVoigtPeaksSMC
to obtain priors for running fitSpectraSMC
. This can be useful, for example, for fitting the model to multiple datasets that were obtained under the same experimental conditions.
The first question is whether to use Lorentzian or Gaussian broadening functions. We can transform the posteriors for \(\psi_G\) and \(\psi_L\) to obtain a number \(\eta\) between 0 and 1 for each peak, where 0 implies pure Gaussian and 1 implies Lorentzian. The Wikipedia page provides the equations (Thompson, Cox, and Hastings 1987): \[
f_G = 2\psi_G\sqrt{2\log\{2\}} \\
f_L = 2\psi_L \\
f_V \approx 0.5346 f_L + \sqrt{0.2166f_L^2 + f_G^2} \\
f_{total} \approx \left(f_G^5 + 2.69269f_G^4 f_L + 2.42843f_G^3 f_L^2 + 4.47163f_G^2 f_L^3 + 0.07842f_G f_L^4 + f_L^5 \right)^{1/5} \\
\eta \approx 1.33603 \frac{f_L}{f_{total}} - 0.47719 \left(\frac{f_L}{f_{total}} \right)^2 + 0.11116 \left(\frac{f_L}{f_{total}} \right)^3
\] The above equations are implemented by the function getVoigtParam
:
<- nrow(result2$beta)
nPart $voigt <- result2$FWHM <- matrix(nrow=nPart, ncol=nPK)
result2for (k in 1:nPart) {
$voigt[k,] <- getVoigtParam(result2$scale_G[k,], result2$scale_L[k,])
result2<- 2*result2$scale_G[k,]*sqrt(2*log(2))
f_G <- 2*result2$scale_L[k,]
f_L $FWHM[k,] <- 0.5346*f_L + sqrt(0.2166*f_L^2 + f_G^2)
result2
}for (j in 1:nPK) {
hist(result2$voigt[,j], main=paste("Peak",j),
xlab=expression(eta), freq=FALSE, xlim=c(0,1))
lines(density(result2$voigt[,j]), col=4, lty=2, lwd=3)
}
The two largest peaks definitely look Lorentzian, while there is more uncertainty about the line shape for the smaller peaks. This gives us confidence that fitSpectraSMC
should be able to give us a good fit to the data, as long as we get the peak locations closer to the correct values. We can also use moment matching to set informative priors for the amplitudes:
<- list(scale.mu=log(11.6) - (0.4^2)/2, scale.sd=0.4, bl.smooth=10^11, bl.knots=50, noise.sd=200, noise.nu=4)
lPriors3 $beta.mu <- mean(result2$beta)
lPriors3$beta.sd <- sd(result2$beta)
lPriors3<- c(colMeans(result2$location), 1550)
pkLocNew <- system.time(result3 <- fitSpectraSMC(wavenumbers, spectra, pkLocNew, lPriors3, npart=2000)) tm3
print(paste(tm3["elapsed"]/60, "minutes"))
## [1] "0.86755 minutes"
<- sample.int(length(result3$weights), 200, prob=result3$weights)
samp.idx plot(wavenumbers, spectra[1,], type='l', lwd=3,
xlab=expression(paste("Raman shift (cm"^{-1}, ")")), ylab="Intensity (a.u.)", main="Fitted model with informative priors")
for (pt in samp.idx) {
<- result3$basis %*% result3$alpha[,1,pt]
bl.est lines(wavenumbers, bl.est, col="#C3000009")
lines(wavenumbers, bl.est + result3$expFn[pt,], col="#00C30009")
}rug(pkLocNew, col=4, lwd=2)