### Experimental setup

Figure 7 illustrates our experimental setup for PL spectroscopy measurements. The blue path highlights the incident beam which excited the sample and induced photoluminescence. The central wavelength of our laser (SF-AW210 with TTL driver, InsaneWare) depends on the laser power and varied between 402 nm and 404 nm. To narrow down the excitation bandwidth, the generated light passed through an excitation filter with a central wavelength and a bandwidth of 405 nm and 10 nm, respectively. A dichroic mirror directed the light to lens 1 which focused incident light on the sample’s surface. The path taken by the emitted photoluminescence light is highlighted in red. Starting from the sample’s surface, this light was collected and collimated by lens 1 and passed through the dichroic mirror. To ensure that the excitation light was completely removed from the emission path, we used a long-pass filter with a cut-on wavelength of 420 nm. Finally, lens 2 focused the light onto an optical fiber, which directed the light to our spectrometer (LR2, Lasertack GmbH).

Both the laser and the spectrometer were controlled with a microcontroller (Mega 2560, Arduino) which, in turn, was connected to a computer. This arrangement made it possible to control the laser power, exposure time, and the time between sample excitation and signal acquisition. The latter was set to 500 ms.

### Samples and measurement parameters

Samples from the environment show a large diversity since interactions with the environment can alternate the chemical composition. Therefore, spectral libraries can always be considered as unbalanced and incomplete as it is impossible to reflect the sample variety in a single data set. To account for these conditions in our study, we generated our spectral data set from 46 samples, which consisted of non-plastic materials from the marine environment and plastics from different manufacturers and retail products. A summary of the data set is presented in Table 3. For each sample, we adjusted the laser power (mathrm {P_{laser}}) and the exposure time (mathrm {t_{ex}}) to acquire a signal with a low background noise. A list of these measurement parameters is given in Table 4. To introduce additional inhomogeneities in the spectral library, we included an additional measurement for eight samples, where we readjusted the alignment of the optical components. For these samples, the two sets of spectra represent variations when the components are aligned or not. All spectra were measured over the full range of the spectrometer, i.e. between 200 nm and 1000 nm. For each sample and setup, we took 9 to 20 measurements to capture spectral variations due to sample inhomogeneity. In total, all samples were measured 29 times, with the exception of four non-plastic samples, which were measured 19 times. We also took measurements of the background noise, which was required in our ML model building process.

### Dimensional reduction and SDCM

Dimensional reduction (DR) aims to project high-dimensional data, e.g. spectra measured over a large number of wavelength bins, onto a lower-dimensional space. In this work, we used both a conventional method called Principal Component Analysis (PCA) and a novel method called Signal Dissection by Correlation Maximization (SDCM) to achieve a DR in our data.

SDCM is an unsupervised algorithm for detection of superposed correlations in high-dimensional data sets38. Conceptually, it can be thought of as an extension of PCA for non-orthogonal axes of correlation, where instead of projecting out detected dimensions, the discovered axes of correlation are iteratively subtracted (dissected) from the data. Initially developed for the application in bioinformatics for the clustering of gene expression data, it can be generically applied on any high-dimensional data containing (overlapping) subspaces of correlated measurements.

We denote by ({mathbb {M}}^{N_f,N_m}) the set of real valued (N_f times N_m) matrices, where (N_f) is the number of features in the data and (N_m) the number of measurements. The (N_f) row vectors and the (N_m) column vectors belong to different vector spaces referred to as feature space and measurement space, respectively.

The main assumption of SDCM is that the input data, ({mathcal {D}} in {mathbb {M}}^{N_f, N_m}), is a superposition ({mathcal {D}} = sum _{k=1}^n E_k + eta) of submatrices (E_k in {mathbb {M}}^{N_f, N_n}) (also called signatures) and residual noise (eta). We interpret (E_k) as a physically meaningful hypothesis in the data, e.g. a common physical or chemical property, due to which some samples and features are correlated. As superposing is a non-bijective operation, we need further conditions to dissect ({mathcal {D}}) into separate (E_k). We assume that each (E_k) is bimonotonic, i.e. that there exists an ordering (I_f) of the (N_f) indices and an ordering (I_m) of the (N_m) indices such that the reordered matrix (tilde{E}_k = E_k(I_f, I_m)) is monotonic along all rows and columns. Thus, after reordering, the correlations follow monotonic curves in both feature- and measurement space. While this bimonotonic requirement restricts the applicability of the algorithm, it allows an unambiguous dissection of ({mathcal {D}}) into the (E_k) components. In contrast to PCA, it also allows detection of non-linear (bi)monotonic correlations, whose axes are non-orthogonal.

SDCM dissects the data in four steps:

1. 1.

Detection of initial representatives for an axis of correlation. in both feature and measurement space.

2. 2.

Calculation of the signature axes by maximizing the correlation.

3. 3.

Estimation of the bimonotonic, possibly non-linear, correlation curves (eigensignal) in both feature and measurement space. For this purpose, a non-parametric regression is used.

4. 4.

Subtraction of the data points belonging to the eigensignal from the data set.

These four steps are performed iteratively until no more representatives of axes can be found. SDCM treats rows and columns completely symmetrically. Each feature and sample is given a strength value s and a weight value w for every signature. The strength value (in units of the input data) quantifies the position along the eigensignal. The weight (win [-1,1]) quantifies how strongly the feature or the sample participates in the signature, i.e. how close to the eigensignal it is. Typically, the number of signatures detected will be orders of magnitudes smaller than the number of input features, and in this way give rise to an effective DR of the data.

### ML model generation

To generate our ML models for PL-based sample identifications, we chose a combination of supervised and unsupervised learning methods. In the following sections, we describe all steps used to generate these models.

#### Data format

We saved the information of a spectrum in two different files: one file that contains the absolute intensity as a function of wavelength; and one that contains details about the sample and the measurement. The latter provides labels for all spectra, which are central for the evaluation of the classifier’s performance. We use the following categories:

• Type: material type of the sample.

• Origin: name of the manufacturer or location. All retail samples have the same label.

• Color: color of the sample.

• is plastic: specifies if the sample material is a plastic or not.

• Sample ID: unique ID identifying the sample from which multiple replicate measurements have been taken.

All categories are discrete and finite valued. In the following, i enumerates the set of features (spectral bins) (f_i in mathcal {F}), (N_f := |mathcal {F}|) and j enumerates the set of measurements (m^j in mathcal {M}), (N_m := |mathcal {M}|).

#### Prediction categories

We aggregated the 19 distinct material types from Table 3 by combining all non-plastics into the type nonplastic and LDPE, HDPE and PE into the type PE.

#### Preparing the spectral data

In the following, we describe the data preparations applied to the spectral data before it is passed to the classification pipeline. The data preparation pipeline is explained in Fig. 8. The references (P1) to (P5) refer to the respective nodes in the flowchart.

To treat all spectra equally in the ML process, we needed to preprocess our data first (P1). We started by interpolating the spectral data and the corresponding background measurement onto a common spectral axis. The number of spectral bins was kept equal to the mean of the number of bins in the overall set. Then, we subtracted the background measurement from the sample spectrum. Once all spectra were processed in this way, we concatenated the data into a single matrix.

Since we did not expect any signal below the laser peak, we estimated the offset from the baseline (O^j) for for the j-th measurement by calculating the median intensity in the range 294 nm 400 nm. Similarly, we estimated the noise level (eta ^j) by calculating the standard deviation in the same range. As we regarded any offset of the spectra as systematic, we subtracted it from the data.

To remove overexposed or noisy spectra, we applied a process that automatically filters out all data, which do not satisfy our conditions. We singled out measurements with experimental overexposure by determining for each spectrum the maximum (M^j) of the smoothed spectrum of (m^j). For the smoothing, we used a running median with a window size of 20 nm. The exposure level was then calculated as (E^j = frac{O^j}{M^j}). We then discarded overexposed measurements with (E^j < 0.5). To detect noisy spectra, we calculated the signal-to-noise-ratio, (textsf{SNR}^j), with the expression

begin{aligned} textsf{SNR}^j = frac{P^j}{eta ^j}text {.} end{aligned}

Here (P^j) is the power of the spectrum given by

begin{aligned} P^j = sqrt{frac{1}{N_f}sum _{j=1ldots N_f} bigg ({s}^i_jbigg )^2}text {,} end{aligned}

and ({s}^i_j) is the i-th spectral bin of (m^j). We considered a measurement to be noisy if the signal-to-noise ratio is less than 2. Such measurements were then discarded.

To generate the model, we only considered the spectral information in the range 410 nm 680 nm, which contains most information about the sample. Each spectrum was then normalized such that the integral over its absolute values is one. This is particularly important for SDCM to ensure that the regression steps converge within a reasonable time.

#### Cross-validation splits

In our classification model, we split the data at the (unsupervised) DR stage and at the (supervised) classification stage.

In a real world application, the trained classifier pipeline is applied onto the novel data, which was not part of the DR or learning process. To properly assess our model’s performance, the data needs to be split into batches on which the model is trained, and batches on which its performance is evaluated. As SDCM is computationally expensive, we applied a two-step process in which the data was first split several times into multiple dimensional reduction batches (DRB) and validation batches with the DR method being applied to DRB. Each DRB batch was again split into multiple training and testing batches. The model was then trained on each training batch, and its performance was evaluated on the corresponding testing and validation batches. Figure 9 illustrates the conceptual differences in the different splits. This has the additional benefit of providing a comparison of the classifier’s performance on measurements, which have been part of the DR (testing) and novel measurements (validation). We note that there is no meaningful difference between testing and training if no DR method is applied.

The data was cross-validated 25 times with an 80%-20% split into a DRB and a validation set (P2). We used type and sample ID as stratification variables, i.e. we aimed to keep the relative proportions of each type and sample ID equal in DRB and validation. For stratification, we used the cvpartition methods of the MATLAB Statistics and Machine Learning toolbox (The MathWorks, Inc., Natick, Massachusetts, United States).

We calculated the median of each spectral bin in DRB and subtracted it from each measurement in both DRB and validation set (P3a and P3b). This centers the data at the zero level of DRB in each spectral bin. We additionally performed classifications of spectra and PCA without median subtraction. An evaluation showed that this process does not significantly influence the performance of the classifier. The final dimensions (N_f times N_m) of the DRB and the validation set for each cross-validation were (1394 times 1036) and (1394 times 258), respectively.

#### Dimensional reduction methods

In our study, we used SDCM and PCA as DR methods and compared them to the baseline when no DR was performed. The three input types are referred to as SDCM and PCA and no DR. SDCM and PCA reduce the data into signatures and principal components (PCs). From the processed data, we could derive strengths and PC coefficients for input into the classification pipeline.

We applied the DR methods on the DRB data (P4). For no DR, the data was just passed through. SDCM terminates with a median of 130 identified signatures over all cross-validation splits. PCA does not terminate on its own but rather produces a number of PCs equal to the number of measurements. To achieve an effective DR, we chose the first 130 PCs for further analysis.

Once SDCM finds a set of signatures in DRB, the strengths and weights relative to these signatures for validation needed to be determined. This is a non-trivial task, since the signature axes can be non-orthogonal and the method is dissecting and not projecting. The standard procedure is to repeat the dissection on the new data, while fixing the signature axes to the previously detected values. This, however, might still regress to a different eigensignal, which skews the prediction results. To circumvent this, we performed a weighted projection of the data onto the DRB signature axes (P5b), where the projection weights were the signature weights for each spectral bin calculated on DRB. This removes some of the precision obtained by SDCM, as spectral features explained by a single signature can still produce significant projection values in other signatures, if the axes are not sufficiently orthogonal. However, it ensures that all signature strengths are obtained relative to the same axes. For consistency, DRB is also projected onto the signature axes (P5a).

#### Sample classification

Our classification pipeline consists of optimization of each classifier, followed by a cross-validated training and scoring. This pipeline is illustrated in Fig. 10. The individual steps are numerated from (C1) to (C4). To set up the classification pipeline, we used the python module scikit-learn51.

The DRB set and the validation set were fed into the classification pipeline. For each classifier in Table 1, the DRB set was used to optimize the classification parameters via a parameter grid search (C1). Here, multiple additional cross-validations were performed, which are not displayed in Fig. 10.

We performed a 144-fold 80%-20% cross-validation split on the DRB set to generate the training and testing sets (C2). The classifier was trained on training with the optimized parameters (C3) and the performance was evaluated on training, testing and validation (C4).

To analyze the model performance, we calculated the following four classification metrics:

• (text {accuracy} = frac{t_p + t_n}{t_p + t_n + f_p + f_n})

• (text {precision} = frac{t_p}{t_p + f_p})

• (text {recall} = frac{t_p}{t_p + f_n})

• (f_1 = 2 cdot frac{text {Precision} cdot text {Recall}}{text {Precision} + text {Recall}})

Here (t_p), (t_n), (f_p), (f_n) are the number of true positives, true negatives, false positives and false negatives in the classification. As precision and recall are binary metrics, they were calculated for each material type individually and then averaged.

The classification metrics and errors were calculated over all (25times 144) evaluations. For each classification, we generated a confusion matrix normalized along the rows based on the predictions of the classifier.

### Detecting identifying characteristics in PL spectra

We are interested in finding spectral fingerprints of certain sample properties in the data. The properties are supplied as labels for each measurement in the metadata.

We attempt to link individual SDCM signatures or PCs to specific sample properties. As both DR methods are unsupervised, no validation set is necessary, nor do we need to re-project the data onto the discovered cluster axes. The data preparation workflow hence only consists of the steps (P1), (P2), (P3a) and (P4) in Fig. 8. PCA and SDCM are applied only once to the data. In contrast to the previous section, we do not aggregate all non-plastic material types into a single label.

In the following, we refer to both SDCM signatures and PCA PCs as clusters. To be able to interpret the discovered clusters, we imposed the restriction that for each measurement all data labels must be provided. The dimensions of the data considered are (N_f times N_m = 1394 times 1243).

#### Cluster weights

To match clusters to properties, we needed to determine if a measurement is part of cluster. For each cluster k and measurement (m^j), SDCM provides weights (w^{k,j} in big [-1,1big ]), which can be used to quantify how strongly a measurement is associated with a cluster k. We consider a measurement to be part of a cluster if (|w^{k,j}|approx 1), whereas if (|w^{k,j}|approx 0) there is no link present. Each cluster has an axis, along which the measurements are clustered, whose median (after median subtraction) is located at 0. This separates the axis into one part with negative weights and one with positive weights.

Since the implementation of PCA does not provide a comparable metric, we needed to define such a quantity for PCs. Let (C^{k, j}_{text {PCA}}) be the coefficient of the j-th measurement in the k-th PC. We defined the PC weight as:

begin{aligned} w^{k,j}_{text {PCA}} = frac{C^{k, j}_{text {PCA}}}{max _{jin mathcal {M}} |C^{k, j}_{text {PCA}}|} end{aligned}

where (mathcal {M}) is the set of all measurements. The interpretations of positive and negative weights are identical to the one previously described for SDCM.

We increased the number of clusters to test for associations to sample properties by dividing each cluster k into three subclusters: one which includes all measurements with (w>0), one with measurements that fulfill (w<0) and one that is identical to k. For SDCM, the weights are closely distributed around either (pm 1) or 0. This motivates determining cluster membership of a sample by a threshold (tau in big [0,1big ]). We say

begin{aligned} k^- ,text {contains}, m^j&iff w^{k,j} < 0 quad text {and} quad |w^{k,j}| ge tau \ k^+, text {contains}, m^j&iff w^{k,j} > 0 quad text {and} quad |w^{k,j}| ge tau \ k, text {contains}, m^j&iff |w^{k,j}| ge tau end{aligned}

For the remaining part of this chapter, we refer to all subclusters as (k^*).

The optimal (tau) for both PCA and SDCM can be found empirically by testing the association between subclusters and labels for multiple values. SDCM is fairly robust under different choices of the threshold. Here, (tau) can be reliably set to 1. PCA, in contrast, is more sensitive to this choice and performs best when (tau) is between 0.05 and 0.4 (see SI for a discussion). The best value for (tau) can be determined by optimizing the number of matches at (f_1 ge 90) (see below for an explanation for the calculation of (f_1)).

#### Quantifying the association between a subcluster and a set of labels

Let l be a label and (k^*) a subcluster. l is associated to (k^*) if it is likely for a measurement belonging to (k^*) to have carry the label l and vice versa. We can describe this relationship in a contingency table T (see Table 5). A strong association should lead to a large number of true positives/negatives relative to the false positives/negatives. Mathematically, we are interested in the recall (fraction of measurements carrying l that belong to (k^*)) and precision (fraction of measurements belonging to (k^*) that carry l) of the contingency table. The (f_1) score is the harmonic mean of recall and precision and therefore a suitable score to summarize both values. We always have (0 le f_1 le 100) and (f_1 = 100) for a perfect association with no false positives/negatives. We interpret (f_1) as a measure of how well L matches to (k^*).

#### Associating one cluster with one label

We searched for subcluster–label associations by exhaustively calculating (f_1) for every subcluster and every available label l. To do so, we first constructed the set of all theoretically possible labels ({mathcal {L}}), which is given by all Cartesian products of all category sets. As the number of all labels generated in this way is much larger than the number of labels that can be realistically recorded experimentally, real labels are often related. For example, if all samples of type “PVC” have the color “red” and all “red” samples are of type “PVC”, then the labels “PVC”, “red” and “PVC, red” are equivalent descriptions of the underlying set of measurements.

We say that two labels, (l_1, l_2 in {mathcal {L}}), are equivalent, (l_1 sim l_2), if they describe the same set of measurements. We define ({mathcal {L}}/sim) as the set of equivalence classes (ECs) induced by this equivalence relation. As every label of an EC belongs to the same set of measurements, it is sufficient to calculate the (f_1) score for just one representative of each class.

For every subcluster and every (text {EC} in {mathcal {L}}/sim), the (f_1) score was calculated via the contingency table presented in Table 5. When the number of measurements associated to a label is large relative to total number of measurements, the (f_1) score may grow large by random association. Hence a p-value was calculated with a hypergeometric test. Only matches with (p<0.005) were kept for further analysis. For every EC, the highest scoring subcluster was chosen.

If an EC contain more than one label, the interpretation of the subcluster–label match is ambiguous. To recover interpretable subcluster–label matches, we chose label representatives from the ECs via the selection rules (A) and (B) as defined in the results. If the selection is not unique, the EC was dismissed.