Working With Data Hosted on Google Cloud Storage

Google Bucket Setup

Output files are contained in the private Google Cloud Bucket gs://metagenomics-mac at the moment, the user must authenticate with a service account keyfile in the following way:

googleCloudStorageR::gcs_auth("full/path/to/keyfile.json")

To obtain this keyfile, the owner of a service account affiliated with the Google Cloud Project containing gs://metagenomics-mac must create it according to the process detailed in the Google Cloud Guide “Create and delete service account keys”.

Additionally, the default bucket for GoogleCloudStorageR must be set to the Google Bucket name, in this case metagenomics-mac. This is done automatically upon loading the package, but if an error occurs can be manually done through the following:

googleCloudStorageR::gcs_global_bucket("metagenomics-mac")

Data Retrieval

All objects stored in gs://metagenomics-mac can then be viewed with the following. This operation can take several minutes due to the number of objects stored in gs://metagenomics-mac.

googleCloudStorageR::gcs_list_objects()

Alternatively, listMetagenomicData will provide a table of objects that are compatible with the cacheMetagenomicData streamlined downloading and caching function.

file_tbl <- listMetagenomicData()

To access output files, use cacheMetagenomicData. This function takes UUID and data type arguments, downloads the corresponding output files, and stores them in a local cache. If the same files are requested again through this function, they will not be re-downloaded unless explicitly specified, in order to reduce excessive downloads. cacheMetagenomicData returns a tibble of cached file paths and cache IDs for each requested file.

We will first select samples from a particular study that we want to investigate:

selected_samples <- sampleMetadata |>
    filter(study_name == "ZhangM_2023") |>
    select(where(~ !any(is.na(.x))))

Then we will supply these to cacheMetagenomicData.

cache_tbl <- cacheMetagenomicData(uuids = selected_samples$uuid,
                                  data_type = "relative_abundance")
#> Resource with rname = 'results/cMDv4/0807eb2a-a15e-4647-8e19-2600d8fda378/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/e0fbb54f-0249-4917-a4d7-bd68acb89c62/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/25172837-2849-4db3-be91-d54d6a815d00/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/39ddb5e7-97f6-4d3c-812b-9653b03f99b3/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/7b152a7d-e244-4e2b-b924-7195c7ecfb10/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/dd30f93b-7999-47a4-93fb-21971b899939/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/1406666f-04a8-43c9-983b-4ed62fd6da4a/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/8707e374-5ddb-4220-8cbf-364b8b0e7be1/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/22848a9c-66a6-4993-9058-cb6464edb42f/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/08e2b754-78e2-4cb4-8ff2-95fd7b0ff44a/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/9baef0b2-93d2-4a40-8082-d357c7f8156a/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/09a9303d-d87d-4556-9672-04cbbcaf3d37/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/ac9f3532-90d8-412c-9c80-491037f0bcc2/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/eda61949-02dc-40ae-8dbe-bea2add85a52/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/1f007260-be6c-4a21-800a-ad9c36129a0d/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/e47a59bb-443a-405f-9c5d-02659d80e9e5/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/b3eaf3ab-43ef-4830-ab6d-12bafed3c61e/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/28f7352f-fe23-4003-93e1-41f4fedc6232/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/b07e2362-5851-4181-ba9a-15d9109ee4dd/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/677be4e3-722b-4e43-bd5a-36d8fbed6f86/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/0c817272-f873-475f-a401-dfe46a679a9f/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/7a3945d9-21bb-434a-9a4e-bfcdeb6194de/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.
#> Resource with rname = 'results/cMDv4/56aa2ad5-007d-407c-a644-48aac1e9a8f0/metaphlan_lists/metaphlan_unknown_list.tsv.gz' found in cache, proceeding with most recent version.

Data Handling

Automatic Experiment Setup (MetaPhlAn and HUMAnN output files only)

The above table can then be supplied to loadMetagenomicData and the cached files will be parsed into a single SummarizedExperiment object with sample metadata. At this point, only the metaphlan_lists data types, viral_clusters and relative_abundance, as well as all humann data_types, are compatible with this function.

merged_experiment <- loadMetagenomicData(cache_tbl)

#> class: SummarizedExperiment 
#> dim: 2915 24 
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(2915): UNCLASSIFIED k__Bacteria ...
#>   k__Bacteria|p__Firmicutes|c__CFGB3009|o__OFGB3009|f__FGB3009|g__GGB31234|s__GGB31234_SGB14869|t__SGB14869
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium_CHKCI006|t__SGB7261
#> rowData names(2): ncbi_tax_id additional_species
#> colnames(24): 0807eb2a-a15e-4647-8e19-2600d8fda378
#>   e0fbb54f-0249-4917-a4d7-bd68acb89c62 ...
#>   7a3945d9-21bb-434a-9a4e-bfcdeb6194de
#>   56aa2ad5-007d-407c-a644-48aac1e9a8f0
#> colData names(445): study_name BioProject ...
#>   uncurated_Day_of_stool_collection_digestion_issue
#>   uncurated_Day_of_stool_collection_constipation

Stepwise Experiment Setup

Alternatively, we can parse the files, add metadata, and merge the SummarizedExperiment objects in separate steps.

parse_metaphlan_list completes the first step for files with ‘data_type’ equal to “relative_abundance” or “viral_clusters”. parse_humann is also available for HUMAnN output files.

parsed_rel_ab_list <- vector("list", nrow(cache_tbl))
names(parsed_rel_ab_list) <- cache_tbl$uuid

for (i in 1:nrow(cache_tbl)) {
    parsed_rel_ab_list[[i]] <- parse_metaphlan_list(sample_id = cache_tbl$uuid[i],
                                                file_path = cache_tbl$cache_path[i],
                                                data_type = cache_tbl$data_type[i])
    
}

Once the files have been loaded as SummarizedExperiment objects, matching assays from multiple samples can be merged together with mergeExperiments.

merged_rel_abs <- mergeExperiments(parsed_rel_ab_list)

#> class: SummarizedExperiment 
#> dim: 2915 24 
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(2915): UNCLASSIFIED k__Bacteria ...
#>   k__Bacteria|p__Firmicutes|c__CFGB3009|o__OFGB3009|f__FGB3009|g__GGB31234|s__GGB31234_SGB14869|t__SGB14869
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium_CHKCI006|t__SGB7261
#> rowData names(2): ncbi_tax_id additional_species
#> colnames(24): 0807eb2a-a15e-4647-8e19-2600d8fda378
#>   e0fbb54f-0249-4917-a4d7-bd68acb89c62 ...
#>   7a3945d9-21bb-434a-9a4e-bfcdeb6194de
#>   56aa2ad5-007d-407c-a644-48aac1e9a8f0
#> colData names(0):

The corresponding metadata from sampleMetadata is then added as colData with the function add_metadata.

rel_abs_with_metadata <- add_metadata(sample_ids = colnames(merged_rel_abs),
                                      id_col = "uuid",
                                      experiment = merged_rel_abs)

#> class: SummarizedExperiment 
#> dim: 2915 24 
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(2915): UNCLASSIFIED k__Bacteria ...
#>   k__Bacteria|p__Firmicutes|c__CFGB3009|o__OFGB3009|f__FGB3009|g__GGB31234|s__GGB31234_SGB14869|t__SGB14869
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Eubacteriales_unclassified|g__Eubacteriales_unclassified|s__Clostridiales_bacterium_CHKCI006|t__SGB7261
#> rowData names(2): ncbi_tax_id additional_species
#> colnames(24): 0807eb2a-a15e-4647-8e19-2600d8fda378
#>   e0fbb54f-0249-4917-a4d7-bd68acb89c62 ...
#>   7a3945d9-21bb-434a-9a4e-bfcdeb6194de
#>   56aa2ad5-007d-407c-a644-48aac1e9a8f0
#> colData names(445): study_name BioProject ...
#>   uncurated_Day_of_stool_collection_digestion_issue
#>   uncurated_Day_of_stool_collection_constipation