Case Study: Cheese Communities • MolPad

library(MolPad)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

Introduction

This vignette provides a comprehensive guide on using MolPad for a case study, including data pre-processing, network generation, and result analysis.

Washed-Rind Cheese Microbial Communities

Cheese making is an ancient craft that involves the coagulation of milk proteins to form curds, which are then separated from the liquid whey. The curds are processed, shaped, and aged to develop the desired texture and flavor. Various factors, such as the type of milk, specific cultures of bacteria, and aging conditions, contribute to the unique characteristics of each cheese. One crucial step in this process is the aging practice, where regular washing with a brine solution plays a significant role.

In the process of producing cheese, regular washing with a brine solution is an aging practice that can homogenize the microbial communities on the cheese’ surface and facilitate intermicrobial interactions. In the following parts, we will analyze a longitudinal data set of three washed-rind cheese communities collected during cheese ripening.

Data

For the analysis, we use a data set contained in the study by Smith et al. (2022) on microbial communities in cheese. The original research investigated the successional dynamics that occur within cheese rind microbial communities using a combination of 16S rRNA amplicon, Illumina, and PacBio sequencing. We functionally and taxonomically annotate (using eggNOG (21) and MMseqs2 (22)) the contigs they have generated from the Illumina reads, to demonstrate the utility of MolPad using a single-omic. Specifically, we focus on Cheese Sample A and Cheese Sample C. For more detailed information on our attached data, please check this documentation.

data("cheese")
str(cheese)
#> tibble [106,239 × 18] (S3: tbl_df/tbl/data.frame)
#>  $ ID     : chr [1:106239] "1" "2" "3" "4" ...
#>  $ A_1    : int [1:106239] 38 23 24 3 58 12 1 14 1 3 ...
#>  $ A_2    : int [1:106239] 23 6 5 2 14 9 1 7 1 1 ...
#>  $ A_3    : int [1:106239] 27 4 37 4 45 14 0 14 3 5 ...
#>  $ A_4    : int [1:106239] 5 0 10 2 13 4 1 4 4 0 ...
#>  $ A_5    : int [1:106239] 11 9 19 16 32 13 0 4 1 1 ...
#>  $ C_1    : int [1:106239] 13 21 3 56 82 2 4 17 7 2 ...
#>  $ C_3    : int [1:106239] 1 1 0 1 3 0 0 0 0 1 ...
#>  $ C_4    : int [1:106239] 3 0 1 7 8 1 0 1 2 2 ...
#>  $ C_5    : int [1:106239] 0 2 4 17 3 4 0 1 8 6 ...
#>  $ GO_ID  : chr [1:106239] NA NA NA NA ...
#>  $ KEGG_ID: chr [1:106239] NA NA NA NA ...
#>  $ domain : chr [1:106239] NA NA NA NA ...
#>  $ phylum : chr [1:106239] NA NA NA NA ...
#>  $ class  : chr [1:106239] NA NA NA NA ...
#>  $ order  : chr [1:106239] NA NA NA NA ...
#>  $ family : chr [1:106239] NA NA NA NA ...
#>  $ genus  : chr [1:106239] NA NA NA NA ...
str(annotations)
#> tibble [86,156 × 9] (S3: tbl_df/tbl/data.frame)
#>  $ ID     : chr [1:86156] "9" "10" "11" "12" ...
#>  $ GO_ID  : chr [1:86156] "-" "-" "-" "-" ...
#>  $ KEGG_ID: chr [1:86156] "-" "-" "-" "-" ...
#>  $ domain : chr [1:86156] "Bacteria" "Bacteria" "Bacteria" "Bacteria" ...
#>  $ phylum : chr [1:86156] "Pseudomonadota" "Pseudomonadota" "Pseudomonadota" "Pseudomonadota" ...
#>  $ class  : chr [1:86156] "Alphaproteobacteria" "Alphaproteobacteria" "Alphaproteobacteria" "Alphaproteobacteria" ...
#>  $ order  : chr [1:86156] "Caulobacterales" "Caulobacterales" "Hyphomicrobiales" "Hyphomicrobiales" ...
#>  $ family : chr [1:86156] "Caulobacteraceae" "Caulobacteraceae" "Bartonellaceae" "Brucellaceae;-_Brucella/Ochrobactrum group" ...
#>  $ genus  : chr [1:86156] "Caulobacter" "Caulobacter;-_unclassified Caulobacter" "Bartonella" "Brucella" ...

Pre-process

Data and Annotations

Here we select a ‘type’ column with phylum to describe the characteristic of cheese data. Also, the columns phylum and class are taken as tags for elemental composition.

In this section, we will introduce data preparing steps for analysis. The annotations dataset contains various columns that describe the characteristics and properties of samples. First, we select the ‘type’ column with phylum to provide a broad categorization of the microbial communities present on the cheese surface. This categorization helps in understanding the overall composition and diversity of microbes at a high taxonomic level. Then we run the pre_process() function to clean and standardize the data.

cheesedata <- cheese |> 
         select(ID, A_1:C_5, phylum) |> 
         rename(type=phylum) |> 
            pre_process()

To further annotate the dataset, we also use the columns phylum and class as tags for the elemental composition of the microbial communities. The phylum column represents a major taxonomic rank, giving us a broad overview of the microbial distribution. The class column provides more detailed information, allowing us to delve deeper into the specific types of microbes present.

pathchee <- gAnnotation(annotations,"phylum","class")

After pre-processing, here is what the two datasets that we will put into the dashboard look like:

# data
cheesedata[112:115,]
#> # A tibble: 4 × 11
#>   ID      A_1    A_2    A_3    A_4    A_5    C_1    C_3    C_4    C_5 type 
#>   <chr> <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl> <chr>
#> 1 112   0.943 -0.471 -0.471 -0.471 -0.471  2.36  -0.471 -0.471 -0.471 Other
#> 2 113   0.786 -0.124  0.126 -0.623 -0.637  2.33  -0.667 -0.593 -0.593 Other
#> 3 114   1.22  -1.09  -0.430 -0.829 -0.170  1.43  -1.30   0.455  0.715 Other
#> 4 115   2.67  -0.333 -0.333 -0.333 -0.333 -0.333 -0.333 -0.333 -0.333 Other
# annotation
pathchee[112:115,]
#> # A tibble: 4 × 9
#>   ID    GO_ID KEGG_ID          domain Pathway taxonomic.scope order family genus
#>   <chr> <chr> <chr>            <chr>  <chr>   <chr>           <chr> <chr>  <chr>
#> 1 145   -     ko:K00004        Bacte… Pseudo… Unknown         NA    NA     NA   
#> 2 147   -     ko:K00004        Bacte… Actino… Actinomycetes   NA    NA     NA   
#> 3 148   -     ko:K00004        Bacte… Bacill… Bacilli         Baci… Bacil… Lysi…
#> 4 149   -     ko:K00004,ko:K0… Bacte… Actino… Actinomycetes   NA    NA     NA

Cluster Input

In this section, we generate clusters from the first dataset above using the gClusters function. This function takes the cheese dataset (cheesedata) as input and generates clusters based on specified parameters. Here, we set the number of clusters to 10 (ncluster = 10) and specify the maximum number of clusters to consider for determining the optimal number of clusters (elbow.max=15).

cluschee <- gClusters(cheesedata,ncluster = 10,elbow.max=15)

network input

After generating the clusters for major patterns, we proceed to obtain network results from the clusters. Taking the cluster centroids as the nodes, the prediction process for edges is divided into individual regression tasks, where each cluster centroid is independentally predicted from the expression of all the other cluster centroids, using random forests. We pick the top 3 related predictors for each cluster centroid and save the network output for future use. This is achieved using the gNetwork() function.

networkchee <- gNetwork(cluschee,ntop = 3)

To gain more insight into the network results, we can visualize the details using the gNetwork_view() function, as shown below.

gNetwork_view(networkchee)

Run Dashboard

Once the clusters and network results are obtained, we can proceed to run the dashboard. This involves declaring the annotations and then executing the dashboard using the gDashboard() function. Here, we pass the cheese dataset (cheesedata), cluster results (cluschee), network results (networkchee), and specify column names and types for annotation identifiers.

gDashboard(cheesedata,
           cluschee,
           pathchee,
           networkchee,
           id_colname = c("GO_ID","KEGG_ID"),
           id_type = c("GO","KEGG"))

Results

For example, when we filter for Actinomycetota (Actinobacteria) as the functional group, we see that there are no edges connecting to group 10 and group 3- the clusters that have the most features associated with Actinomycetoa for Cheese sample A (Fig 1.3.A).

Fig 1 Fig 1: Dashboard showing Actinomycetota filtered network (A) with enrichment pattern for Cheese Sample-A (B) and Cheese Sample-C (C); Cluster pattern for Group 9, which also is enriched for Type IV secretion genes (D).

Looking at the pattern traces of these groups, (Fig 1.3.B), there is a peak in samples A4 (week 9) and A5 (week 13), which mirrors the 16S rRNA results of Saak et al. Since these two clusters do not have edges connecting them to other groups, this suggests that the Actinomycetoa features found in these groups follow distinct longitudinal succession patterns that are independent. When looking at Actinomycetoa within Cheese Sample C we see a different pattern. Groups 2 and 5, have the most features associated with Actinomycetoa, but they are highly connected to the other groups (Fig 1.3.A). From these results, we can hypothesize that though Actinomycetoa features are more abundant in later time points for both cheese samples, their dynamics are differentially influenced.

The authors found that Type VI secretion was enriched in Pseudomonadota bacteria (specifically, Psychrobacter), and hypothesized this enrichment was due to the importance of physical species interactions that occur with this habitat. Using MolPad, we searched for other secretion systems associated genes, to understand their dynamics within the community. Focusing on KEGG annotated Type IV secretion genes, we found that Group 9 contained 12/13 of these genes. Within this group, features that cluster are ones that peak in Cheese sample C5 (week 13, Fig 1.3.D). This is also the most taxonomically diverse sample. From this, we can hypothesize that increased taxonomic diversity is also associated with increases in genes that are related to competitive species interactions.