First 15 Minutes

This vignette is a fast path to retrieving microbiome data as a TreeSummarizedExperiment.

Vignette guide

Vignette	Audience	Typical time	Use when
`Data Codebook`	All users	Reference	You need to understand variable definitions, data types, and column structures
`Available File Types`	All users	Reference	You want to understand the structure, columns, and examples of raw pipeline output files
`Full Data Preparation`	Advanced users/Developers	Reference	You want to understand how raw sequences were processed, transformed into Parquet files, and how metadata was curated
`First 15 Minutes`	New users, biology-first workflows	10-15 min	You want your first successful data retrieval quickly
`Full Workflow`	Intermediate/advanced users	30-45 min	You want full control over sample, feature, and query setup
`Piecewise Workflow`	Advanced users	30-60 min	You need direct control of `accessParquetData()` and `loadParquetData()`
`Working with Large Parquet Files`	Users working with large data types	10-20 min	You need to query genefamilies_stratified or other large files efficiently

What you’ll do

Explore available data types and their descriptions.
Pick samples from sampleMetadata.
Pick features from a reference table.
Retrieve data with returnSamples().

What you’ll get

returnSamples() returns a TreeSummarizedExperiment:

assay data (microbial measurements)
sample metadata in colData
feature metadata in rowData

Quick start

You can view metadata about all samples in sampleMetaData. Later in example 1, you’ll see how to filter rows of this table to choose which samples you want to download data for.

data("sampleMetadata", package = "parkinsonsMetagenomicData")
dim(sampleMetadata)
#> [1] 3535 1177
sampleMetadata[1:5, 1:10]
#>                       curation_id    study_name    sample_id        subject_id
#> 1 AsnicarF_2021:predict1_MTG_0001 AsnicarF_2021 SAMEA7041133 predict1_MTG_0001
#> 2 AsnicarF_2021:predict1_MTG_0002 AsnicarF_2021 SAMEA7041134 predict1_MTG_0002
#> 3 AsnicarF_2021:predict1_MTG_0003 AsnicarF_2021 SAMEA7041135 predict1_MTG_0003
#> 4 AsnicarF_2021:predict1_MTG_0004 AsnicarF_2021 SAMEA7041136 predict1_MTG_0004
#> 5 AsnicarF_2021:predict1_MTG_0005 AsnicarF_2021 SAMEA7041137 predict1_MTG_0005
#>   target_condition target_condition_ontology_term_id body_site
#> 1             <NA>                              <NA>     feces
#> 2             <NA>                              <NA>     feces
#> 3             <NA>                              <NA>     feces
#> 4             <NA>                              <NA>     feces
#> 5             <NA>                              <NA>     feces
#>   body_site_ontology_term_id host_species host_species_ontology_term_id
#> 1             UBERON:0001988 Homo sapiens                NCBITaxon:9606
#> 2             UBERON:0001988 Homo sapiens                NCBITaxon:9606
#> 3             UBERON:0001988 Homo sapiens                NCBITaxon:9606
#> 4             UBERON:0001988 Homo sapiens                NCBITaxon:9606
#> 5             UBERON:0001988 Homo sapiens                NCBITaxon:9606

Exploring Available Data Types

Before retrieving data, you can explore what data types are available. Each data type represents different measurements or processing levels from various bioinformatics tools:

available_data <- get_hf_parquet_urls()

Use the search boxes above each column to filter by data type, tool, or keywords in the description. For example, search for “pathway” to find all pathway-related data types, or “CPM” to find normalized abundance measures.

Quick start examples

Example 1: Taxonomic relative abundance

Fetch MetaPhlAn4 relative abundance data for all Faecalibacterium species across all samples in the ZhangM_2023 study, directly from remote storage:

# In any subset of samples there will be many rows of all NA,
# so the select function below is usually wanted.
sample_table <- sampleMetadata |>
    filter(study_name == "ZhangM_2023") |>
    select(where(~ !all(is.na(.x))))

# All terminal taxa from the Faecalibacterium genus.
feature_table <- load_ref("clade_name_ref") |>
    filter(grepl("Faecalibacterium", clade_name_genus)) |>
    filter(!is.na(clade_name_species)) |>
    filter(!is.na(clade_name_terminal))

tse <- returnSamples(
    data_type = "relative_abundance",
    sample_data = sample_table,
    feature_data = feature_table,
    repo = "waldronlab/metagenomics_mac",
    include_empty_samples = TRUE,
    dry_run = FALSE
)

tse
#> class: TreeSummarizedExperiment 
#> dim: 9 24 
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(9):
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_SGB15346|t__SGB15346
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15318
#>   ...
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15323
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_CLA_AA_H233|t__SGB15315
#> rowData names(19): clade_name clade_name_kingdom ...
#>   NCBI_tax_id_terminal additional_species
#> colnames(24): fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd
#>   39ddb5e7-97f6-4d3c-812b-9653b03f99b3 ...
#>   1406666f-04a8-43c9-983b-4ed62fd6da4a
#>   677be4e3-722b-4e43-bd5a-36d8fbed6f86
#> colData names(56): uuid db_version ...
#>   ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

This just retrieved all Faecalibacterium species across all samples in the ZhangM_2023 study, downloading from a remote repo.

Example 2: Functional pathway data

You can also retrieve functional data like metabolic pathways. First, discover which reference tables are available:

# See what reference tables are available
ref_info <- get_ref_info()

Now load the pathway reference and browse for pathways of interest:

# Load the pathway reference
pathway_ref <- load_ref("pathway_ref")

# Search for butyrate-related pathways (relevant to gut-brain health in PD)
butyrate_pathways <- pathway_ref |>
    filter(grepl("butanoate|butyrat", pathway, ignore.case = TRUE)) |>
    filter(!grepl("\\|", pathway)) |>  # Exclude stratified versions
    select(pathway) |>
    distinct()

butyrate_pathways
#> # A tibble: 10 × 1
#>    pathway                                                                      
#>    <chr>                                                                        
#>  1 ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate deg…
#>  2 CENTFERM-PWY: pyruvate fermentation to butanoate                             
#>  3 GLUDEG-II-PWY: L-glutamate degradation VII (to butanoate)                    
#>  4 P163-PWY: L-lysine fermentation to acetate and butanoate                     
#>  5 PWY-5022: 4-aminobutanoate degradation V                                     
#>  6 PWY-5109: propanoate fermentation to 2-methylbutanoate                       
#>  7 PWY-5130: 2-oxobutanoate degradation I                                       
#>  8 PWY-5676: acetyl-CoA fermentation to butanoate II                            
#>  9 PWY-5677: succinate fermentation to butanoate                                
#> 10 PWY-7218: photosynthetic 3-hydroxybutanoate biosynthesis (engineered)

Now fetch the abundance of a specific butyrate biosynthesis pathway across all 3535 samples in parkinsonsMetagenomicData:

# Use all samples (no filtering by study)
# Select the acetyl-CoA fermentation to butanoate pathway
feature_table_pathway <- pathway_ref |>
    filter(pathway == "PWY-5676: acetyl-CoA fermentation to butanoate II") |>
    select(pathway)

tse_pathway <- returnSamples(
    data_type = "pathabundance_unstratified",
    sample_data = sampleMetadata,
    feature_data = feature_table_pathway,
    repo = "waldronlab/metagenomics_mac",
    include_empty_samples = TRUE,
    dry_run = FALSE
)

tse_pathway
#> class: TreeSummarizedExperiment 
#> dim: 1 3535 
#> metadata(0):
#> assays(1): abundance
#> rownames(1): PWY-5676: acetyl-CoA fermentation to butanoate II
#> rowData names(1): pathway
#> colnames(3535): 0872fa8c-5d4d-445a-abda-07acfb968bf8
#>   238f9773-0ea8-459b-bb44-f1140c437ed8 ...
#>   d8136243-5acc-407a-a4ec-613a97fd764f
#>   e3607760-0767-41e8-9d3b-293714e1d84d
#> colData names(141): uuid humann_header ...
#>   WallenZD_2022_uncurated_Day_of_stool_collection_digestion_issue
#>   WallenZD_2022_uncurated_Day_of_stool_collection_constipation
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

You can select samples by phenotype instead of study, and select a few features across many samples, or many features across a few samples. If you want many features across many samples, you will need to download the data and query locally, which is covered in the Piecewise Workflow vignette.

Inspect the result

Both examples return a TreeSummarizedExperiment object with assays, sample metadata, and feature metadata:

# Inspect the taxonomic data from Example 1
assayNames(tse)
head(colData(tse)) # Rownames are sample unique identifiers (UUIDs)
head(rowData(tse)) # Rownames are taxonomic lineages

# Inspect the pathway data from Example 2
assayNames(tse_pathway)
head(rowData(tse_pathway)) # Rownames are pathway IDs

If this fails

If you ?returnSamples fails:

HTTP 429, slow internet access, or queries that will return a lot of data: download then use local parquet files with local_files.
To inspect SQL before executing: set dry_run = TRUE and run dplyr::show_query().

Next step

See Full Workflow for detailed configuration and discovery tooling.
See Piecewise Workflow for manual control over connection and loading steps.
See Working with Large Parquet Files for strategies on querying large data types like genefamilies_stratified.

Reference

Metadata curation: https://github.com/ASAP-MAC/parkinsonsManualCuration
Nextflow pipeline for processing metagenomic data: https://github.com/ASAP-MAC/metagenomicsNextflowMAC
Datasets found in the literature and efforts to obtain them: https://github.com/ASAP-MAC/parkinsons_data_search

sessionInfo()
#> R Under development (unstable) (2026-04-12 r89873)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats4    stats     graphics  grDevices utils     datasets  methods  
#> [8] base     
#> 
#> other attached packages:
#>  [1] DT_0.34.0                        dplyr_1.2.1                     
#>  [3] TreeSummarizedExperiment_2.19.0  Biostrings_2.79.5               
#>  [5] XVector_0.51.0                   SingleCellExperiment_1.33.2     
#>  [7] SummarizedExperiment_1.41.1      Biobase_2.71.0                  
#>  [9] GenomicRanges_1.63.2             Seqinfo_1.1.0                   
#> [11] IRanges_2.45.0                   S4Vectors_0.49.1-1              
#> [13] BiocGenerics_0.57.0              generics_0.1.4                  
#> [15] MatrixGenerics_1.23.0            matrixStats_1.5.0               
#> [17] parkinsonsMetagenomicData_0.99.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] httr2_1.2.2         xfun_0.57           bslib_0.10.0       
#>  [4] htmlwidgets_1.6.4   lattice_0.22-9      tzdb_0.5.0         
#>  [7] crosstalk_1.2.2     yulab.utils_0.2.4   vctrs_0.7.3        
#> [10] tools_4.7.0         curl_7.0.0          parallel_4.7.0     
#> [13] tibble_3.3.1        blob_1.3.0          pkgconfig_2.0.3    
#> [16] Matrix_1.7-5        dbplyr_2.5.2        desc_1.4.3         
#> [19] assertthat_0.2.1    lifecycle_1.0.5     stringr_1.6.0      
#> [22] compiler_4.7.0      treeio_1.35.0       textshaping_1.0.5  
#> [25] codetools_0.2-20    htmltools_0.5.9     sass_0.4.10        
#> [28] yaml_2.3.12         lazyeval_0.2.3      pkgdown_2.2.0      
#> [31] pillar_1.11.1       crayon_1.5.3        jquerylib_0.1.4    
#> [34] tidyr_1.3.2         BiocParallel_1.45.0 DelayedArray_0.37.1
#> [37] cachem_1.1.0        abind_1.4-8         nlme_3.1-169       
#> [40] tidyselect_1.2.1    digest_0.6.39       stringi_1.8.7      
#> [43] duckdb_1.5.2        purrr_1.2.2         arrow_23.0.1.2     
#> [46] fastmap_1.2.0       grid_4.7.0          cli_3.6.6          
#> [49] SparseArray_1.11.13 magrittr_2.0.5      S4Arrays_1.11.1    
#> [52] utf8_1.2.6          ape_5.8-1           withr_3.0.2        
#> [55] readr_2.2.0         rappdirs_0.3.4      bit64_4.6.0-1      
#> [58] rmarkdown_2.31      bit_4.6.0           otel_0.2.0         
#> [61] hms_1.1.4           ragg_1.5.2          evaluate_1.0.5     
#> [64] knitr_1.51          rlang_1.2.0         Rcpp_1.1.1-1       
#> [67] glue_1.8.0          tidytree_0.4.7      DBI_1.3.0          
#> [70] vroom_1.7.1         jsonlite_2.0.0      R6_2.6.1           
#> [73] systemfonts_1.3.2   fs_2.0.1