Introduction
Metagenomic analysis was conducted on our oral biofilm model to
assess the viability of the model as a proxy for dental calculus.
Materials
A total of 35 biofilm model samples were taken throughout the course
of the experiment and includes saliva inoculate, media from the wells,
and the biofilm end-product. DNA extraction was performed at the
archaeogenetic facility at the Max Planck Institute for the Science of
Human History (Jena, Germany).
Extractions were performed in duplicates. A total of
DNA extracts.
DNeasy PowerSoil Kit from QIAGEN. C2 inhibitor removal step skipped,
going directly to C3 step.
were paired-end sequenced on a NextSeq (2 color chemistry) to
150bp
Methods
Preprocessing
Processing of the raw DNA reads was conducted using the
nf-core/eager, v2.4.4 pipeline [@yatesEAGER2020]. Adapter removal and read
merging was performed using AdapterRemoval, v2.3.2 [@AdapterRemovalv2]. Merged reads were mapped to
the human reference genome, GRCh38, using BWA, v0.7.17-r1188 [@BWA] with default settings (-n 0.01; -l 32),
and unmapped reads were extracted using Samtools, v1.12.
Metagenomic classification was conducted in kraken, v2.1.2 using the
Standard database (https://benlangmead.github.io/aws-indexes/k2).
Kraken output reports were combined and converted to OTU tables, and
only species-level assignments were selected for downstream analysis 02-scripts/00-comb-kraken-reports.R.
The OTU tables were further filtered by removing species with relative
abundance lower than 0.001%.
Authentication
SourceTracker [@knightsSourceTracker2011] was used to estimate
source composition of the oral biofilm model samples using a Bayesian
framework. Samples were compared with oral and environmental controls to
detect potential external contamination.
Potential contaminants were identified using the frequency and
prevalence method in the decontam v1.16.0 [@R-decontam] R package. 02-scripts/02-authentication.R.
Samples from indoor_air, skin, and sediment
were used as negative controls for the prevalence method with a
probability threshold of 0.01. DNA
concentrations were used for the ‘frequency’ method with a
probability threshold of 0.99 and negative controls.
Putative contaminants were filtered out of the OTU tables for all
downstream analyses.
See 06-reports/metagen-authentication.Rmd
for more details.
Differential abundance analysis
Results
Authentication
SourceTracker
Results from SourceTracker indicated a large portion of species from
most samples are from an oral origin (plaque, saliva, or calculus). Some
samples contained a large proportion of species from potential
contaminants (indoor air) and of unknown origin. Potential contaminants
were compared to a database of oral bacteria to see the proportion of
known oral species were present in the samples. Many of the species
assigned to indoor_air
are known oral species (Figure
@ref(fig:st-plot)).
Based on these results, samples SYN015.F0101, SYN015.G0101,
SYN015.H0101, SYN017.F0101, SYN017.G0101, SYN018.H0101, SYN013.I0101,
SYN016.I0101 were removed from further analysis. The removed samples
were all sampled late in the experiment (day 18+).
decontam
A total of 5424 potential contaminants were removed. Remaining counts
of species in the biofilm samples ranged between 88 and 284 with a mean
of 181.59.
---
title: "Summary of metagenomic analysis"
output: html_notebook
---

```{r}
#| label: setup
#| include: false
library(here)
library(tidyverse)
library(cuperdec)
library(patchwork)
library(mixOmics)
# set document chunk options
knitr::opts_chunk$set(
  echo = FALSE,
  warning = FALSE,
  message = FALSE
)
# metadata
metadata <- read_tsv(here("01-documentation/metadata.tsv"))
experiment_metadata <- read_tsv(here("01-documentation/experiment-metadata.tsv"))
analysis_metadata <- read_tsv(here("01-documentation/analysis-metadata.tsv"))
software <- read_tsv(here("01-documentation/software_versions.csv"), col_names = c("software", "version"))
conda_env <- read_tsv(here("01-documentation/conda_versions.tsv"), skip = 2)
lib_conc <- read_tsv(here("01-documentation/SYN_DNA_concentrations.tsv"))
# analysis results
sourcetracker2 <- read_tsv(here("04-analysis/sourcetracker/sourcetracker2_output/mixing_proportions.txt"))
all_data_long <- read_tsv(here("04-analysis/sourcetracker/source-comb_long.tsv"))
list_of_contaminants <- read_tsv(here("04-analysis/decontam/list-of-contaminants.txt"), col_names = F)
oral_taxa <- cuperdec_database_ex %>%
  filter(isolation_source == "oral")
bac_properties <- read_tsv(here("01-documentation/species-properties.tsv"))
otu_table <- read_tsv(here("05-results/post-decontam_taxatable.tsv"))
alpha_div <- read_tsv(here("05-results/alpha-diversity.tsv"))
load(here("05-results/spca_byoc.rda"))
load(here("05-results/spca_species.rda"))

# convert post-decontam OTU table to long format for easier wrangling
species_counts_long <- otu_table %>%
  pivot_longer(cols = where(is.numeric), names_to = "sample", values_to = "count")

# Helper objects
day_order <- experiment_metadata %>%
  mutate(Env = factor(
    Env, 
    levels = c("saliva", "medium", "byoc_calculus") # force level order so it doesn't order alphabetically
    ) 
  ) %>% 
  group_by(Env) %>%
  arrange(day, .by_group = T) %>%
  filter(`#SampleID` %in% colnames(sourcetracker2)) %>%
  mutate(rm = if_else(`#SampleID` %in% analysis_metadata$`#SampleID`, F, T),
         col = case_when(rm == T ~ "red", # excluded samples coloured red
                         rm == F ~ "black"))
treatment_colours <- viridisLite::inferno(n = length(unique(experiment_metadata$treatment))) # define pallette and number of colours
env_colours <- viridisLite::magma(n = length(unique(analysis_metadata$Env))) # define pallette and number of colours

experiment_colours <- experiment_metadata %>%
  arrange(desc(day)) %>% 
  mutate(treat_col = case_when(
           treatment == "potato" ~ treatment_colours[1],
           treatment == "wheat" ~ treatment_colours[2],
           treatment == "mix" ~ treatment_colours[3],
           treatment == "control" ~ treatment_colours[4],
           is.na(treatment) ~ "grey"),
         env_col = case_when(
           Env == "saliva" ~ env_colours[5],
           Env == "byoc_calculus" ~ env_colours[1],
           Env == "medium" ~ env_colours[8]
         ))
analysis_colours <- analysis_metadata %>%
  mutate(env_col = case_when(
    Env == "saliva" ~ env_colours[5],
    Env == "byoc_calculus" ~ env_colours[1],
    Env == "medium" ~ env_colours[8],
    Env == "buccal_mucosa" ~ env_colours[6],
    Env == "subgingival_plaque" ~ env_colours[2],
    Env == "supragingival_plaque" ~ env_colours[3],
    Env == "modern_calculus" ~ env_colours[4],
    Env == "vitro_biofilm" ~ env_colours[7]
         ))
```


## Introduction

Metagenomic analysis was conducted on our oral biofilm model to assess the
viability of the model as a proxy for dental calculus.

## Materials

```{r}
metadata %>%
  filter(Study == "this_study")

byoc_samples <- metadata %>% 
  filter(str_detect(`#SampleID`, "SYN"))
```

A total of `r nrow(byoc_samples)` biofilm model samples were taken throughout the
course of the
experiment and includes saliva inoculate, media from the wells, and the biofilm
end-product. DNA extraction was performed at the archaeogenetic facility at the
Max Planck Institute for the Science of Human History (Jena, Germany). 

Extractions were
performed in duplicates<!-- ?? -->. A total of <!-- sample size --> DNA extracts.

DNeasy PowerSoil Kit from QIAGEN. C2 inhibitor removal step skipped, going directly
to C3 step.

were paired-end sequenced on a NextSeq (2 color chemistry) to 150bp

### Comparative samples

```{r}
comp_samples <- metadata %>%
  filter(Project != "this_study")

comp_samples
```

## Methods

### Preprocessing

Processing of the raw DNA reads was conducted using the
`r paste(filter(software, str_detect(software, "eager")))`
pipeline [@yatesEAGER2020].
Adapter removal and read merging was performed using
`r paste(filter(software, str_detect(software, regex("adapterremoval", ignore_case = T))))`
[@AdapterRemovalv2]. Merged reads were mapped to the human reference genome,
GRCh38, using
`r paste(filter(software, str_detect(software, regex("bwa", ignore_case = T))))`
[@BWA] with default settings (-n 0.01; -l 32), and unmapped reads were extracted
using
`r paste(filter(software, str_detect(software, regex("samtools", ignore_case = T))))`.

Metagenomic classification was conducted in
`r paste(filter(software, str_detect(software, regex("kraken", ignore_case = T))))`
using the Standard database (https://benlangmead.github.io/aws-indexes/k2).

Kraken output reports were combined and converted to OTU tables, and only
species-level assignments were selected for downstream analysis
[02-scripts/00-comb-kraken-reports.R](02-scripts/00-comb-kraken-reports.R).
The OTU tables were further filtered by removing species with relative abundance
lower than 0.001%.

### Authentication

```{r}
#| label: lib-conc-table
lib_conc %>%
  knitr::kable()
```

SourceTracker [@knightsSourceTracker2011] was used to estimate source composition
of the oral biofilm model samples using a Bayesian framework. Samples were compared
with oral and environmental controls to detect potential external contamination.

Potential contaminants were identified using the frequency and prevalence method
in the **decontam** v`r packageVersion("decontam")` [@R-decontam] R package.
[02-scripts/02-authentication.R](../02-scripts/02-authentication.R). Samples
from *indoor_air*, *skin*, and *sediment* were used as negative controls for
the prevalence method with a probability threshold of 0.01. 
[DNA concentrations](../01-documentation/SYN_DNA_concentrations.tsv) were used
for the 'frequency' method with a probability threshold of 0.99
and negative controls. 

Putative contaminants were filtered out of the
[OTU tables](../05-results/post-decontam_taxatable.tsv) for all downstream analyses.

See [06-reports/metagen-authentication.Rmd](../06-reports/metagen-authentication.Rmd)
for more details.
<!-- cuperdec [@yatesOralMicrobiome2021]-->

### Community composition

Genus- and species-level OTU tables were prepared from the decontaminated OTU
table, and relative abundance of species in a sample was calculated as recommended
for compositional data [@gloorMicrobiomeDatasets2017].

Information on the oxygen tolerance of bacterial species was downloaded from 
[BacDive](https://bacdive.dsmz.de) on
`r str_extract(list.files("../03-data/", pattern = "bacdive"), "\\d{4}-\\d{2}-\\d{2}")`.
A list of amylase-binding streptococci (ABS) was created based on
[@nikitkovaStarchBiofilms2013].

Alpha-diversity, specifically the Shannon Index, was calculated to compare species
richness and diversity across experimental and comparative oral samples. Shannon
Index was calculated using the R package **vegan** v`r packageVersion("vegan")`
[@Rvegan].

Sparse principal components analysis (sPCA) was conducted on centered-log-ratio
(CLR) transformed species-level counts using the R package **mixOmics**
v`r packageVersion("mixOmics")` [@RmixOmics]. The **mixOmics** implementation of
sPCA uses a LASSO penalisation to eliminate unimportant variables.
Two separate analyses were conducted: 1) on experiment samples to assess the
difference in sample types and biofilm age; and 2) on oral comparative samples
and biofilm model end-products to explore community differences between *in vitro*
and *in vivo* biofilms.

Loadings obtained from the sPCA analyses were used in combination with
CLR-transformed counts to create heatmaps of species- and genus-level counts
within the experiment and across comparative samples.

Beta-diversity, Bray Curtis dissimilarity.

### Differential abundance analysis

## Results

```{r}
#| label: results-setup
sourcetracker2_long <- sourcetracker2 %>%
  pivot_longer(cols = where(is.numeric), 
               values_to = "proportion", 
               names_to = "SampleID") %>%
  rename(source = ...1)

removed_samples <- metadata %>%
  anti_join(analysis_metadata) %>%
  rename(samples = `#SampleID`)
```

### Authentication

#### SourceTracker

Results from SourceTracker indicated a large portion of species from most samples
are from an oral origin (plaque, saliva, or calculus). Some samples contained a
large proportion of species from potential contaminants (indoor air) and of
unknown origin. Potential contaminants were compared to a database of oral
bacteria to see the proportion of known oral species were present in the samples.
Many of the species assigned to `indoor_air` are known oral species
(Figure \@ref(fig:st-plot)).

```{r}
#| label: st-plot
#| fig-cap: "Plot of estimated contributions of various sources to the artificial calculus and artificial saliva samples. Samples are arranged from top to bottom by how late in the experiment they were sampled, with bottom being the earliest samples. Sample names in red indicate samples that were removed from further analysis due to contamination."
st_plot <- sourcetracker2_long %>% 
  ggplot(aes(y = SampleID, x= proportion, fill = source)) +
    geom_col() +
    theme_minimal() +
    scale_y_discrete(limits = day_order$`#SampleID`) +
    scale_fill_viridis_d(option = "C") +
    theme(
      panel.grid.major.y = element_blank(),
      axis.title = element_blank(),
      axis.text.y = element_text(vjust = 0.5, hjust = 0.5, colour = day_order$col),
      legend.position = "top"
      )

oral_plot <- all_data_long %>% 
  filter(count > 0) %>% 
  mutate(oral_source = if_else(taxon %in% oral_taxa$species, "oral", "other")) %>%
  ggplot(aes(y = SampleID, x = count, fill = oral_source)) +
    geom_col(position = "fill") +
    scale_y_discrete(limits = day_order$`#SampleID`) +  
    scale_fill_viridis_d(option = "C") +
    theme_minimal() +
    theme(
      panel.grid.major.y = element_blank(),
      axis.text.y = element_blank(),
      #axis.text.y = element_text(colour = day_order$col),
      axis.title = element_blank(),
      legend.position = "top"
      )

st_plot + oral_plot + plot_annotation(tag_levels = "A")
```

Based on these results, samples
`r paste(removed_samples$samples, collapse = ", ")`
were removed from further analysis. The removed samples were all sampled late in
the experiment (day 18+).

```{r}
removed_samples %>%
  left_join(experiment_metadata, by = c("samples" = "#SampleID", "Env")) %>%
  dplyr::select(samples, Env, day) %>%
  arrange(day)
```


#### decontam

```{r}
#| label: decontam-byoc-summ
#| include: false
byoc_summ <- species_counts_long %>%
  filter(str_detect(sample, "SYN"),
         count > 0) %>%
  group_by(sample) %>%
  count(`#OTU ID`) %>%
  summarise(count = sum(n))
```

A total of `r nrow(list_of_contaminants)` potential contaminants were removed.
Remaining counts of species in the biofilm samples ranged between
`r paste(range(byoc_summ$count), collapse = " and ")`
with a mean of
`r paste(round(mean(byoc_summ$count), 2))`.

```{r}
#| label: decontam-plot
#| fig-cap: "Remaining number of species per sample following removal of contaminants."
byoc_summ %>%
  ggplot(aes(x = count, y = sample)) +
    geom_col() +
    theme_minimal() +
    theme(axis.title.y = element_blank())
```

### Community composition

Alpha-diversity. Days are grouped to increase sample sizes:

- Inoculation (`inoc`) = days 0,3,5
- Treatment (`treatm`) = days 7,9,12,15
- End-product (`final`) = day 24

```{r}
#| label: comp-setup
alpha_div_long <- alpha_div %>%
  pivot_longer(cols = where(is.numeric), names_to = "index") %>%
  left_join(metadata, by = c("sample" = "#SampleID")) %>%
  left_join(experiment_metadata, by = c("sample" = "#SampleID", "Env")) %>%
  mutate(day_grouped = case_when( # group days to increase sample size
    day < 6 ~ "inoc",
    day > 6 & day < 24 ~ "treatm",
    day == 24 ~ "final"),
    day_grouped = factor(day_grouped, levels = c("inoc", "treatm", "final"))
    )
```
  
```{r}
#| label: alpha-experiment-plot
#| fig-cap: "Plot of alpha-diversity indices across experiment samples grouped by sampling time."

alpha_div_long %>%
  filter(str_detect(sample, "SYN")) %>%
  #mutate(day = as.factor(day)) %>%
  ggplot(aes(x = day_grouped, y = value)) +
    geom_violin(aes(col = day_grouped)) +
    geom_boxplot(width = 0.12) +
    facet_wrap(~ index, scales = "free_y") +
    theme_bw() +
    theme(
      axis.title = element_blank(),
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank()
    )
```

```{r}
#| label: alpha-compar-plot
alpha_div_long %>%
  filter(Env != "skin",
         Env != "sediment",
         Env != "stool",
         Env != "indoor_air") %>%
  mutate(Env = case_when(str_detect(Env, "plaque") ~ "plaque",
                         TRUE ~ Env)) %>%
  ggplot(aes(x = Env, y = value)) +
    geom_violin(aes(col = Env)) +
    geom_boxplot(width = 0.1) +
    facet_wrap(~ index, scales = "free_y") +
    theme_bw() +
    theme(
      axis.title = element_blank(),
      axis.text.x = element_blank(),
      axis.ticks.x = element_blank()
    )
```

```{r}
#| label: alpha-div-tbl
alpha_div_long %>%
  filter(
    Env != "skin",
    Env != "sediment",
    Env != "stool",
    Env != "indoor_air") %>%
  group_by(index, Env) %>%
  summarise(mean = mean(value),
            sd = sd(value)) %>%
  knitr::kable()
```


Alpha-diversity sees a slight decrease over the course of the experiment. The
model calculus is less variable than the initial and middle biofilm samples.
Compared to other oral samples, there is lower species diversity and richness
in the medium and model calculus.

```{r}
#| label: spca-setup
#| include: false
# explained variance
byoc_explain_var <- spca_byoc$prop_expl_var$X
# loadings of samples
byoc_princomp <- spca_byoc$x %>%
  as_tibble(rownames = "sample") %>%
  left_join(experiment_metadata, by = c("sample" = "#SampleID"))
# loadings of species
exp_pca_loadings <- spca_byoc$loadings$X %>%
  as_tibble(rownames = "species") %>%
  left_join(bac_properties, by = "species") %>% 
  dplyr::select(species,PC1,PC2, `Oxygen tolerance`) %>%
  arrange(desc(abs(PC1)))
exp_pca_loadings
# set common ggplot theme for loading plots
loading_theme <-  theme_minimal() +
    theme(
      panel.grid.major.x = element_blank(),
      panel.grid.minor.x = element_blank(),
      panel.grid.major.y = element_blank(),
      axis.ticks.x.bottom = element_line(size = 1),
      axis.text.y = element_text(),
      axis.text.x = element_text(vjust = -0.2),
      axis.line.x = element_line(),
      legend.position = "bottom"
          )
```


```{r}
#| label: scree-byoc
#| fig-cap: "Scree plot of principal components from the sPCA on experiment samples."
plot(spca_byoc, type = "l")
```

```{r}
#| label: byoc-spca-plot
#| fig-cap: "sPCA on species-level counts from experiment samples only."
byoc_spca_plot <- byoc_princomp %>%
  ggplot(aes(
    x = PC1, y = PC2, 
    col = as_factor(day),
    shape = Env
    )
  ) +
    geom_point(size = 4, stroke = 1) +
    labs(
      x = paste(
        "PC1", 
        scales::percent(byoc_explain_var[[1]], accuracy = 0.1)),
      y = paste(
        "PC2", 
        scales::percent(byoc_explain_var[[2]], accuracy = 0.1)),
      col = "Sampling day",
      shape = "Sample type"
      ) +
    theme_bw() +
    scale_shape_discrete(solid = F) +
    scale_colour_viridis_d()

exp_pca_loadings %>%
  mutate(species = fct_reorder(species, desc(abs(PC1)))) %>%
  slice_head(n = 20) %>%
  ggplot(aes(x = PC1, y = species, fill = `Oxygen tolerance`)) +
    geom_bar(stat = "identity", width = 0.8) +
    scale_x_continuous(limits = c(-0.2,0.2)) +
    scale_fill_manual(
      values = viridisLite::turbo(n = length(levels(as.factor(bac_properties$`Oxygen tolerance`))))) +
    labs(title = "component 1", x = "") +
    loading_theme

exp_pca_loadings %>%
  arrange(desc(abs(PC2))) %>%
  mutate(species = fct_reorder(species, desc(abs(PC2)))) %>%
  slice_head(n = 20) %>%
  ggplot(aes(x = PC2, y = species, fill = `Oxygen tolerance`)) +
    geom_bar(stat = "identity", width = 0.8) +
    scale_x_continuous(limits = c(-0.18,0.18)) +
  scale_fill_manual(
      values = viridisLite::turbo(
        n = length(levels(as.factor(bac_properties$`Oxygen tolerance`)))
        )
      ) +
    labs(title = "component 2", x = "") +
    loading_theme
```








