Removing non-resonant background from CARS spectra via deep learning

Broadband Coherent Anti-Stokes Raman Scattering (B-CARS) is a powerful label-free nonlinear spectroscopy technique allowing one to measure the full vibrational spectrum of molecules and solids. B-CARS spectra, however, suffer from the presence of a spurious signal, called non-resonant background (NRB), which interferes with the resonant vibrational one, distorting the line shapes and degrading the chemical information. While several numerical techniques are available to remove this unwanted contribution and extract the resonant vibrational signal of interest, they all require the user’s intervention and sensitively depend on the spectral shape of the NRB, which needs to be measured independently. In this work, we present a novel approach to remove NRB from B-CARS spectra based on deep learning. Thanks to the high generalization capability offered by the deep architecture of the designed neural network, trained through realistic simulated spectra, our fully automated model is able to process B-CARS spectra in real time and independently of the detailed shape of the NRB spectrum. This results in fast extraction of vibrational spectra without requiring user intervention or the measurement of reference spectra. We expect that this model will signiﬁcantly simplify and speed-up the analysis of B-CARS spectra in spectroscopy and microscopy


I. INTRODUCTION
In the last two decades, Coherent Raman Scattering (CRS) 1 has emerged as a class of powerful third-order nonlinear spectroscopy techniques capable of measuring the vibrational response of molecules and solids.CRS has found applications in biomedical optics, for label-free imaging of cells and tissues, 2,3 as well as in materials science and nanotechnology. 4CRS uses two synchronized pulses, the pump (at frequency ωp) and the Stokes (at frequency ωS), detuned by ωp − ωS = Ω, to create a coherent superposition of vibrational oscillators at a frequency Ω.7][8] In CARS, a further interaction with the pump pulse generates a signal at the anti-Stokes frequency ωaS = ωp + Ω = 2ωp − ωS.In SRS, the nonlinear signal is read as amplification of the Stokes beam (stimulated Raman gain) or depletion of the Stokes beam (stimulated Raman loss).Both CARS and SRS work very well in the narrowband mode, where the signal at only one vibrational frequency is detected, reaching imaging speeds up to the video rate. 9,10However, the limited amount of spectroscopic information available with narrowband CRS has prompted efforts to extend it to broadband operation. 11roadband CARS (B-CARS) combines a narrow-band (10-20 cm −1 bandwidth) pump pulse with a broadband (≈3000 cm −1 bandwidth) Stokes pulse in order to generate a coherence in an ensemble of vibrational levels simultaneously.The two pulses, spatially and temporally overlapped, are focused onto the sample.The CARS signal is detected with a spectrometer after removal of pump and Stokes light, through steep-edge short-pass filters.Since CARS LETTER scitation.org/journal/app is a homodyne technique, 12 the measured signal is proportional to R (ω))χ (3) where χ (3) R (ω) is the complex vibrational susceptibility that contains the chemical information about the sample.This quantity can be modeled as where, for the ith resonance, the amplitude Ai ∝ σiCi is proportional to the cross section (σi) and to the concentration of scatterers (Ci), Ωi represents the vibrational frequency, and Γi represents the linewidth.On the other hand, χ NR (ω) can generally be assumed as a purely real contribution that mediates the nonlinear interaction of the excitation beams with the sample and the surrounding environment.One should note that, for certain pigments or metabolites of biological tissues, this assumption may not hold, since two-photon absorption may occur, bringing also an imaginary contribution to χ (3) NR .
To identify the vibrational frequencies, quantify their linewidths and amplitudes, and allow a direct comparison with spontaneous Raman spectra, one would like to retrieve the imaginary part of the resonant susceptibility, Im(χ , which consists of a series of Lorentzian peaks at frequencies Ωi.Unfortunately, χ (3) NR (ω) (from here onward assumed for simplicity to be purely real) gives rise to a broadband optical signal, the non-resonant background (NRB), which mixes with the resonant one such that the resulting CARS spectrum shows a rather complicated profile, depending on the relative ratio of the two components.This effect is particularly relevant in the so-called low wavenumber (LWN) or fingerprint region (600-1800 cm −1 ), which is the most important frequency interval for chemical identification.In the LWN, very often, one has χ NR (ω) so that the B-CARS spectrum becomes In this case, the presence of the NRB has two effects: (i) it heavily distorts the vibrational peaks, which now become proportional to Re(χ R (ω)) and thus assume a dispersive line shape, and (ii) it acts as a local oscillator, phase-locked to the resonant signal, which results in its heterodyne amplification through the term 2Re(χ NR (ω) and allows the detection of comparatively weak vibrational peaks. 13everal techniques have been adopted to deal with the NRB.For the narrowband operating mode, interferometric CARS, 14,15 frequency modulation CARS, 16 and polarization CARS 17 allow suppressing the NRB but at the cost of a significant experimental complication and/or signal reduction.For the broadband mode, timeresolved CARS 18 and Fourier transform CARS [19][20][21] allow NRB suppression but again at the cost of increased experimental complexity and reduced signal levels.An alternative approach is to use numerical techniques to extract Im(χ (3) R (ω)) from the measured B-CARS spectra.Among them, the maximum entropy method (MEM 22,23 ) and the Time-Domain Kramers-Kronig transform (TDKK 24 ) tackle the problem from different standpoints but were demonstrated to be functionally equivalent. 25he MEM method requires measuring the CARS spectrum of a reference sample without any vibrational resonances to obtain the normalized spectrum This spectrum is then approximated as where ν is a normalized frequency and the complex coefficients {β, ai, i = 1 . . .M} are retrieved by solving a Toeplitz set of linear equations with coefficients obtained by a discrete Fourier transform of the CARS spectrum at a set of frequencies.The spectral phase can be then retrieved as ψ(ν) = arg(AM(ν)).A polynomial fit of its baseline is then used to estimate the background phase ϕ(ν), calculate the true vibrational phase as θ(ν) = ϕ(ν) + ψ(ν), and finally retrieve Im(χ The main issue of this method is related to the compromise in the choice of M, i.e., the order of the approximation, which has to be balanced between the contrasting requirements of large information content (large M) and low noise (small M).Moreover, the assumption under which the phase correction is justified is rather restrictive, requiring Im(χ )), which is not always the case, especially for broad spectra, with a non-uniform χ The TDKK method is based on the Kramers-Kronig transform, which links the real and imaginary parts of a complex function.The method aims at estimating the spectral phase from log(ICARS(ω)), through a discrete Hilbert transform, taking into account the causality condition.The TDKK approach is more suited to describe the case of the non-uniform spectral density of the NRB.Even TDKK requires an independent measurement of the NRB to be used as a reference, thus some a priori knowledge about the investigated sample, as well as pre-processing steps to be applied to the measured spectra, which may affect the computation speed. 26tatistical Learning and related techniques (machine learning and deep learning) are experiencing an unprecedented growth in terms of application fields and accuracy of the results, thanks to the ever-increasing availability of computing capacities of modern hardware solutions and to their high generalization capabilities. 27Even in the broad field of spectroscopy, the applications of these techniques are numerous: molecular excitation spectroscopy, 28 calibration invariant spectroscopy, 29 vibrational spectroscopy, 30,31 and scanning probe microscopy. 32Deep Learning (DL), in particular, is attracting a significant interest from the scientific community, thanks to its ability to address heavily nonlinear problems by automatically extracting information from big amounts of data, 33 requiring minimal feature extraction procedures, or even analyzing directly the raw data.Deep learning has also been applied to CARS microscopy 34 to perform automated lung cancer diagnosis.
In this paper, we present a deep learning-based approach to extract Im(χ (3) R (ω)) directly from a measured B-CARS spectrum, without the need of external measurements or complex preprocessing.Our model, built as a convolutional neural network (CNN) with seven hidden layers, is tested on a series of solvents and accurately retrieves their Raman spectra.After suitable training, the program is able to autonomously process a B-CARS spectrum in a time as short as 0.1 ms.

II. EXPERIMENTAL SETUP
The scheme of our B-CARS setup is shown in Fig. 1.A fiberbased ytterbium laser system (Monaco, Coherent) provides ≈300 fs pulses at 1030 nm at 2 MHz repetition rate and an average power, for this experiment, of ≈5 W. A fraction of the fundamental beam is used to generate the narrow-band pump beam.In order to achieve enough spectral resolution, the beam is frequency filtered by passing it through an etalon (TecOptics), which narrows its bandwidth down to 10 cm −1 (≈1.1 nm), to match the typical linewidths of vibrational peaks.Another portion of the fundamental beam produces the broadband Stokes pulse via bulk White Light Continuum (WLC) generation 35 in a 6-mm-thick YAG crystal.The red shifted portion of the WLC, selected by a long-pass filter and covering the 1050-1500 nm wavelength range, acts as a broadband Stokes pulse.The pump-Stokes frequency detuning thus spans 450 cm −1 -3050 cm −1 and is sufficient to cover both the LWN (600 cm −1 -1800 cm −1 ) and the High Wavenumber (HWN) region (2500 cm −1 -3000 cm −1 ).Pump and Stokes pulses are matched in space and time by a delay line and a dichroic combiner and focused through a pair of objectives (20×, 0.3 NA, ≈1 cm working distance) on the sample; the power is ≈60 mW for the pump pulse and ≈30 mW for the Stokes pulse.After the interaction, the CARS signal is spectrally selected by a pair of short-pass filters and focused into a spectrometer (OceanInsight, USB2000) with a 50 mm lens.The acquisition time for the B-CARS spectra is 3 ms, limited by the minimum exposure time of the spectrometer.The detected signal is then processed by the neural network presented in Sec.III.

III. DEEP LEARNING CARS
A neural network (NN) is a nonlinear mathematical model able to approximate a map between a set of given inputs x and a set of given outputs y, which constitute a dataset D = {(xi, yi), i = 1 . . .M}.In general, NNs are composed by a group of nodes, or neurons, stacked together into layers.Each neuron takes as input a weighted sum of the output of the neurons of the previous layer and applies to it a nonlinear activation function to compute its output, which is then propagated to the following layers.Layers between the input and the output are called hidden layers, and their number determines the depth of the network.NNs learn from the data they are provided by a training process, in which a given functional cost (loss) of the network output is optimized, over the given dataset D, through the so-called backpropagation 36 algorithm.
Several specialized NN architectures and layers were proposed in the literature 37 to address different problems, ranging from timeseries prediction 38 to image recognition 39 and synthetic data generation. 40For our purposes, a Convolutional Neural Network 41 (CNN) has been developed, as illustrated schematically in Fig. 2, with the task of removing NRB from CARS spectra.
A CNN comprises, in its most basic form, two main types of layers: convolutional layers (CLs) followed by fully connected layers (FCs).CLs are generally used in image analysis to recognize patterns regardless of their positioning in the input (e.g., identify if a certain object is in any position of the input picture).In CLs, each neuron (also called filter in CNNs) is connected only to a limited subset of neighboring neurons from the previous layer and shares its weights with all the other neurons of the layer.The main advantage of such a kind of layer is the significant reduction in the number of network parameters, which in turn simplifies the back propagation.Another advantage, derived from the shared weights among filters, is the ability to extract information from the input in a spatially invariant way and is of particular interest for the problem presented in this paper as Raman peaks can appear in any location of the spectrum.CNNs were observed to be able to automatically generate different representations of the input data, 42 referred to as feature maps in the literature, whose complexity increases with the number of layers and the depth of the network.On the other hand, FCs have no restrictions on the connections to the previous layer and their corresponding weights; they are utilized to generalize the information discovered previously by CLs and to correlate the feature maps derived from the input.The importance of FCs is related to the corner stone of deep-learning modeling: the universal approximation theorem. 43ccording to such a theorem, a NN with a single hidden FC with an infinite number of neurons and a non-polynomial activation function is a universal approximator.To mimic such an infinitely large hidden layer, more finite layers can be stacked sequentially into deep architectures.The complexity of a model can be tuned by acting on its hyperparameters, 44 e.g., by choosing the amount of nodes and layers, to achieve a high prediction accuracy while avoiding overfitting, which is the tendency of the model to specialize too much its weights to the training dataset only and losing generalization capability.
We developed a DL model called SpecNet to extract vibrational information from B-CARS spectra.Its architecture is inspired from the most classical CNN architectures, such as LeNets, 45 and exploits both the richness of representations obtained by the CLs and the correlating capabilities of FCs.The related code is entirely available online. 46The training of the model (described in Appendix A) is performed exploiting a large dataset of simulated spectra, based on Eqs. ( 1) and (2).Each element of the training dataset is built by randomly sampling the number of Lorentzian components N (up to 15) and, for each of them, the corresponding amplitude, resonance frequency, and linewidth.The NRB is simulated by the combination of two sigmoid functions (see Appendix A) whose parameters are also randomly sampled.This allows us to produce a model with high generalization capability, by learning from a dataset that encompasses a rather large number of experimental scenarios, especially for what concerns the NRB spectral contribution.In fact, although it is often modeled as frequency independent, the NRB displays a far from trivial frequency dependence χ  two simulated spectra together with the corresponding imaginary components and model predictions.Note how the frequencydependent background contribution produces different levels of spectral distortion, depending on the local ratio χ

NR (ω).
Note also how the NRB contribution is different between the two examples, in terms of spectral shape.The capability of a model to generalize with respect to different background conditions is crucial for B-CARS experiments performed with WLC, generated either in bulk or in a fiber, [47][48][49] since the optimal condition for the generation of the broadband pulse and its detailed spectral content may vary with time according to the environmental conditions.Finally, note that the normalization of the spectrum does not fix its maximum value to be 1.This is crucial in order to be able to use the model to process a batch of spectra, as, for example, in a B-CARS image, whose intensities have to be compared, by normalizing all the curves to the global maximum of the batch such that all the other amplitudes are in the (0, 1) range.
Thanks to the capability of convolutional layers to handle spatial invariance, i.e., to recognize similar structures in different locations of the input, although the model was trained with a maximum number of features N = 15, it performs well even if the number of features is larger.In Fig. 4, a simulated B-CARS spectrum with 30 vibrational features is processed with our CNN.The predicted Im(χ shows a very good agreement with the true one.The final architecture was obtained after a model selection procedure based on a 10-fold cross-validation on a training dataset of 30 000 simulated spectra of 640 points each.SpecNet consists of five 1-dimensional CLs with 128, 64, 16, 16, and 16 filters of dimensions 32, 16, 8, 8, and 8, respectively, followed by three FCs of 32, 16, and 640 neurons (as the output is expected to have the same dimensions as the input).All layers had a ReLU 44 (rectified Linear Unit) activation function.The resulting total number of trainable parameters was approximately six × 10 6 .The loss function chosen was the mean squared error (MSE) between the true target vector y and the predicted one ŷ.To avoid overfitting and reduce the sensitivity of the model to noise, we utilized L2 weight regularization 44

LETTER scitation.org/journal/app
on the weights of the first fully connected layer with a weight of 5 ⋅ 10 −4 .The back propagation was performed using Adam 44 with a batch size of 256 examples.The entire training procedure required 10 epochs, each taking about 5 s [running on an RTX 2060 graphics processing unit (GPU)].The computing time required to process a spectrum, averaged over 100 000 examples, is about 0.1 ms, which is much shorter than the current state of the art for the acquisition time of a B-CARS spectrum, around 3.5 ms for a biological tissue sample. 26The NN was implemented by Keras, 50 and the reader is referred to the code available online, 46 together with the trained model, for additional details on the network implementation.

IV. EXPERIMENTAL RESULTS
In this section, we demonstrate the ability of our model to extract the vibrational spectrum, Im(χ R (ω)), from measured B-CARS spectra.We used pure solvents, held in a 1 mm-thick quartz cuvette, as test samples.Figure 5 reports the B-CARS measurement on acetone.From the spectrum, it is clear that the vibrational features in the LWN region are heavily distorted by the NRB contribution, while in the HWN region, around 2900 cm −1 , its contribution is reduced.These vibrations, in fact, are excited by the 1500 nm component of the Stokes beam, which lies in the tail of the WLC spectral lobe and therefore has a lower spectral density.The red curve in Fig. 5 corresponds to Im(χ (3) R (ω)), predicted by the Spec-Net.The model is able to accurately retrieve all the vibrational features present in the spectrum.Notably, some of the peaks in the fingerprint region are very small, even if their presence is clear by looking at the CARS spectrum.This effect is due to the interference term of Eq. ( 1) that results in a heterodyne amplification to the signal.All the retrieved peaks are in agreement with spontaneous Raman spectra in the literature. 51,52The correspondence for the most clear peaks is reported in Table I.Due to the limited spectral resolution, not all the features are clearly resolved, especially in congested regions (e.g., around 2900 cm −1 ).Nevertheless, it is possible to see the shoulder-like shapes that correspond to those vibrational modes.Some spurious contributions are retrieved in the so-called "silent-region" of the vibrational spectrum, between 2000 cm −1 and R (ω)) (red curves).
of the cuvette, excited together with the sample due to the long working distance (≈1 cm) of the employed objectives, resulting in a long Rayleigh range.This contribution is also present in the measurements performed on other solvents (shown in Fig. 6) with different amplitudes due to variations in the alignment of the B-CARS signal in the spectrometer.
To validate the capability of SpecNet to handle different NRB spectral shapes, the experimental conditions under which each solvent was measured were slightly modified (e.g., by longitudinal translation of the YAG crystal used for WLC generation).The results show that the model is able to correctly retrieve Im(χ independently of the details of the NRB.

V. CONCLUSIONS
We presented a deep learning model called SpecNet that enables one to remove the non-resonant background contribution from broadband CARS spectra.The model is built as a convolutional neural network with seven hidden layers.The training was performed on a simulated dataset that allows a high generalization capability to different spectral shapes of the non-resonant background.We presented an experimental setup for broadband CARS, covering both the fingerprint region and the high wavenumber region.The performances of SpecNet were assessed on real measured data, which the model was able to correctly process retrieving all the relevant vibrational peaks of different solvent specimens.Once trained, the model retrieves Im(χ R (ω)) from the CARS spectrum in a time of 0.1 ms, which is faster than the time required to record the spectrum, and does not need the independent measurement of a reference sample or any manual intervention by the operator.We believe that our approach will significantly speed-up B-CARS imaging, allowing for an on-line retrieval of the vibrational features.This could be implemented, for example, in parallel during the integration time of the next pixel so that no time is lost for the spectral processing at all, thus considerably reducing the processing time 26 of CARS hyperspectral images.

LETTER scitation.org/journal/app
The corresponding imaginary part, which represents the quantity to be output by the model, is given by y ≙ Im(r χR).(A3)

APPENDIX B: SNR ANALYSIS
The SpecNet model is able to retrieve small vibrational features, as shown in Figs. 5 and 6.There are two factors that limit the correct retrieval, namely, the amount of NRB and the random noise that affects the measurements.If χ factors on the retrieval capability of the deep learning model, we performed a study on a simulated spectrum with a single feature as a function of the ratio R ≙ χ NR , computed in resonance (ω = Ω), and the rms noise added to the spectrum.The goodness of the predicted outcome is evaluated as the mean square error (MSE).In Fig. 7 are reported the results.The Gaussian noise varies along the columns, while R varies along the rows.For each configuration, the corresponding MSE between the true and retrieved spectra is reported.This analysis shows that even for very noisy spectral traces, the retrieval works fine, provided that the vibrational signal is not completely overwhelmed by NRB.

FIG. 2 .
FIG. 2.Concept of the neural network used to remove NRB from a B-CARS spectrum (blue curve).CL: convolutional layers, with different kernel size and number of filters; FC: fully connected layers.The output is a clean spectrum (red curve).

( 3 )
NR (ω), which is determined by the specific characteristics of each experimental setup.Such a spectrum, to which also Gaussian noise is added to simulate a realistic B-CARS spectrum, is the input vector x for the neural network [see Eq. (A2)].The target vector, y, which the model aims to reproduce, is y ≡ Im(χ (3) R (ω)).Model's prediction is named as ŷ.

FIG. 5 .
FIG. 5. Measured B-CARS (blue) and retrieved (red) vibrational spectrum of acetone.Gray vertical lines represent the Raman resonances for acetone reported in Ref. 51.

RFIG. 7 .
FIG. 7.Analysis of the retrieval.In each box are reported the simulated spectrum (blue curve), the true corresponding spontaneous Raman spectrum (green curve), and the one retrieved by the model (red curve).The Mean Square Error (MSE) is reported in each panel to evaluate the performances of the model for each noise configuration.

TABLE I .
53,54tude, resonance, and linewidths of the retrieved vibrational features of acetone., due to some oscillatory features in the WLC spectrum that are translated to the NRB shape.The small peaks expected at 393 cm −1 , 493 cm −1 , and 530 cm −1 are overwhelmed by a strong contribution around 467 cm −1 , which is attributed to the quartz53,54