First 15 Minutes
This vignette is a fast path to retrieving microbiome data as a
TreeSummarizedExperiment.
Vignette guide
| Vignette | Audience | Typical time | Use when |
|---|---|---|---|
Data Codebook |
All users | Reference | You need to understand variable definitions, data types, and column structures |
First 15 Minutes |
New users, biology-first workflows | 10-15 min | You want your first successful data retrieval quickly |
Full Workflow |
Intermediate/advanced users | 30-45 min | You want full control over sample, feature, and query setup |
Piecewise Workflow |
Advanced users | 30-60 min | You need direct control of accessParquetData() and
loadParquetData()
|
Working with Large Parquet Files |
Users working with large data types | 10-20 min | You need to query genefamilies_stratified or other large files efficiently |
What you’ll do
- Explore available data types and their descriptions.
- Pick samples from
sampleMetadata. - Pick features from a reference table.
- Retrieve data with
returnSamples().
What you’ll get
returnSamples() returns a
TreeSummarizedExperiment:
- assay data (microbial measurements)
- sample metadata in
colData - feature metadata in
rowData
Quick start
You can view metadata about all samples in
sampleMetaData. Later in example 1, you’ll see how to
filter rows of this table to choose which samples you want to download
data for.
data("sampleMetadata", package = "parkinsonsMetagenomicData")
dim(sampleMetadata)
#> [1] 3535 1177
sampleMetadata[1:5, 1:10]
#> curation_id study_name sample_id subject_id
#> 1 AsnicarF_2021:predict1_MTG_0001 AsnicarF_2021 SAMEA7041133 predict1_MTG_0001
#> 2 AsnicarF_2021:predict1_MTG_0002 AsnicarF_2021 SAMEA7041134 predict1_MTG_0002
#> 3 AsnicarF_2021:predict1_MTG_0003 AsnicarF_2021 SAMEA7041135 predict1_MTG_0003
#> 4 AsnicarF_2021:predict1_MTG_0004 AsnicarF_2021 SAMEA7041136 predict1_MTG_0004
#> 5 AsnicarF_2021:predict1_MTG_0005 AsnicarF_2021 SAMEA7041137 predict1_MTG_0005
#> target_condition target_condition_ontology_term_id body_site
#> 1 <NA> <NA> feces
#> 2 <NA> <NA> feces
#> 3 <NA> <NA> feces
#> 4 <NA> <NA> feces
#> 5 <NA> <NA> feces
#> body_site_ontology_term_id host_species host_species_ontology_term_id
#> 1 UBERON:0001988 Homo sapiens NCBITaxon:9606
#> 2 UBERON:0001988 Homo sapiens NCBITaxon:9606
#> 3 UBERON:0001988 Homo sapiens NCBITaxon:9606
#> 4 UBERON:0001988 Homo sapiens NCBITaxon:9606
#> 5 UBERON:0001988 Homo sapiens NCBITaxon:9606Exploring Available Data Types
Before retrieving data, you can explore what data types are available. Each data type represents different measurements or processing levels from various bioinformatics tools:
available_data <- get_hf_parquet_urls()Use the search boxes above each column to filter by data type, tool, or keywords in the description. For example, search for “pathway” to find all pathway-related data types, or “CPM” to find normalized abundance measures.
Quick start examples
Example 1: Taxonomic relative abundance
Fetch MetaPhlAn4 relative abundance data for all Faecalibacterium species across all samples in the ZhangM_2023 study, directly from remote storage:
# In any subset of samples there will be many rows of all NA,
# so the select function below is usually wanted.
sample_table <- sampleMetadata |>
filter(study_name == "ZhangM_2023") |>
select(where(~ !all(is.na(.x))))
# All terminal taxa from the Faecalibacterium genus.
feature_table <- load_ref("clade_name_ref") |>
filter(grepl("Faecalibacterium", clade_name_genus)) |>
filter(!is.na(clade_name_species)) |>
filter(!is.na(clade_name_terminal))
tse <- returnSamples(
data_type = "relative_abundance",
sample_data = sample_table,
feature_data = feature_table,
repo = "waldronlab/metagenomics_mac",
include_empty_samples = TRUE,
dry_run = FALSE
)
tse
#> class: TreeSummarizedExperiment
#> dim: 9 24
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(9):
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_SGB15346|t__SGB15346
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15318
#> ...
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15323
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_CLA_AA_H233|t__SGB15315
#> rowData names(19): clade_name clade_name_kingdom ...
#> NCBI_tax_id_terminal additional_species
#> colnames(24): fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd
#> 39ddb5e7-97f6-4d3c-812b-9653b03f99b3 ...
#> 1406666f-04a8-43c9-983b-4ed62fd6da4a
#> 677be4e3-722b-4e43-bd5a-36d8fbed6f86
#> colData names(56): uuid db_version ...
#> ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLThis just retrieved all Faecalibacterium species across all samples in the ZhangM_2023 study, downloading from a remote repo.
Example 2: Functional pathway data
You can also retrieve functional data like metabolic pathways. First, discover which reference tables are available:
# See what reference tables are available
ref_info <- get_ref_info()Now load the pathway reference and browse for pathways of interest:
# Load the pathway reference
pathway_ref <- load_ref("pathway_ref")
# Search for butyrate-related pathways (relevant to gut-brain health in PD)
butyrate_pathways <- pathway_ref |>
filter(grepl("butanoate|butyrat", pathway, ignore.case = TRUE)) |>
filter(!grepl("\\|", pathway)) |> # Exclude stratified versions
select(pathway) |>
distinct()
butyrate_pathways
#> # A tibble: 10 × 1
#> pathway
#> <chr>
#> 1 ARGDEG-PWY: superpathway of L-arginine, putrescine, and 4-aminobutanoate deg…
#> 2 CENTFERM-PWY: pyruvate fermentation to butanoate
#> 3 GLUDEG-II-PWY: L-glutamate degradation VII (to butanoate)
#> 4 P163-PWY: L-lysine fermentation to acetate and butanoate
#> 5 PWY-5022: 4-aminobutanoate degradation V
#> 6 PWY-5109: propanoate fermentation to 2-methylbutanoate
#> 7 PWY-5130: 2-oxobutanoate degradation I
#> 8 PWY-5676: acetyl-CoA fermentation to butanoate II
#> 9 PWY-5677: succinate fermentation to butanoate
#> 10 PWY-7218: photosynthetic 3-hydroxybutanoate biosynthesis (engineered)Now fetch the abundance of a specific butyrate biosynthesis pathway across all 3535 samples in parkinsonsMetagenomicData:
# Use all samples (no filtering by study)
# Select the acetyl-CoA fermentation to butanoate pathway
feature_table_pathway <- pathway_ref |>
filter(pathway == "PWY-5676: acetyl-CoA fermentation to butanoate II") |>
select(pathway)
tse_pathway <- returnSamples(
data_type = "pathabundance_unstratified",
sample_data = sampleMetadata,
feature_data = feature_table_pathway,
repo = "waldronlab/metagenomics_mac",
include_empty_samples = TRUE,
dry_run = FALSE
)
tse_pathway
#> class: TreeSummarizedExperiment
#> dim: 1 3535
#> metadata(0):
#> assays(1): abundance
#> rownames(1): PWY-5676: acetyl-CoA fermentation to butanoate II
#> rowData names(1): pathway
#> colnames(3535): 0872fa8c-5d4d-445a-abda-07acfb968bf8
#> 238f9773-0ea8-459b-bb44-f1140c437ed8 ...
#> d8136243-5acc-407a-a4ec-613a97fd764f
#> e3607760-0767-41e8-9d3b-293714e1d84d
#> colData names(141): uuid humann_header ...
#> WallenZD_2022_uncurated_Day_of_stool_collection_digestion_issue
#> WallenZD_2022_uncurated_Day_of_stool_collection_constipation
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLYou can select samples by phenotype instead of study, and select a
few features across many samples, or many features across a few samples.
If you want many features across many samples, you will need to download
the data and query locally, which is covered in the
Piecewise Workflow vignette.
Inspect the result
Both examples return a TreeSummarizedExperiment object
with assays, sample metadata, and feature metadata:
# Inspect the taxonomic data from Example 1
assayNames(tse)
head(colData(tse)) # Rownames are sample unique identifiers (UUIDs)
head(rowData(tse)) # Rownames are taxonomic lineages
# Inspect the pathway data from Example 2
assayNames(tse_pathway)
head(rowData(tse_pathway)) # Rownames are pathway IDsIf this fails
If you ?returnSamples fails:
-
HTTP 429, slow internet access, or queries that will return a lot of data: download then use local parquet files withlocal_files. - To inspect SQL before executing: set
dry_run = TRUEand rundplyr::show_query().
Next step
- See
Full Workflowfor detailed configuration and discovery tooling. - See
Piecewise Workflowfor manual control over connection and loading steps. - See
Working with Large Parquet Filesfor strategies on querying large data types like genefamilies_stratified.
Reference
- Metadata curation: https://github.com/ASAP-MAC/parkinsonsManualCuration
- Nextflow pipeline for processing metagenomic data: https://github.com/ASAP-MAC/metagenomicsNextflowMAC
- Datasets found in the literature and efforts to obtain them: https://github.com/ASAP-MAC/parkinsons_data_search
sessionInfo()
#> R Under development (unstable) (2026-03-28 r89738)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] DT_0.34.0 dplyr_1.2.0
#> [3] TreeSummarizedExperiment_2.19.0 Biostrings_2.79.5
#> [5] XVector_0.51.0 SingleCellExperiment_1.33.2
#> [7] SummarizedExperiment_1.41.1 Biobase_2.71.0
#> [9] GenomicRanges_1.63.1 Seqinfo_1.1.0
#> [11] IRanges_2.45.0 S4Vectors_0.49.0
#> [13] BiocGenerics_0.57.0 generics_0.1.4
#> [15] MatrixGenerics_1.23.0 matrixStats_1.5.0
#> [17] parkinsonsMetagenomicData_0.99.0
#>
#> loaded via a namespace (and not attached):
#> [1] httr2_1.2.2 xfun_0.57 bslib_0.10.0
#> [4] htmlwidgets_1.6.4 lattice_0.22-9 tzdb_0.5.0
#> [7] crosstalk_1.2.2 yulab.utils_0.2.4 vctrs_0.7.2
#> [10] tools_4.7.0 curl_7.0.0 parallel_4.7.0
#> [13] tibble_3.3.1 blob_1.3.0 pkgconfig_2.0.3
#> [16] Matrix_1.7-5 dbplyr_2.5.2 desc_1.4.3
#> [19] assertthat_0.2.1 lifecycle_1.0.5 stringr_1.6.0
#> [22] compiler_4.7.0 treeio_1.35.0 textshaping_1.0.5
#> [25] codetools_0.2-20 htmltools_0.5.9 sass_0.4.10
#> [28] yaml_2.3.12 lazyeval_0.2.2 pkgdown_2.2.0
#> [31] pillar_1.11.1 crayon_1.5.3 jquerylib_0.1.4
#> [34] tidyr_1.3.2 BiocParallel_1.45.0 DelayedArray_0.37.0
#> [37] cachem_1.1.0 abind_1.4-8 nlme_3.1-169
#> [40] tidyselect_1.2.1 digest_0.6.39 stringi_1.8.7
#> [43] duckdb_1.5.1 purrr_1.2.1 arrow_23.0.1.2
#> [46] fastmap_1.2.0 grid_4.7.0 cli_3.6.5
#> [49] SparseArray_1.11.11 magrittr_2.0.4 S4Arrays_1.11.1
#> [52] utf8_1.2.6 ape_5.8-1 withr_3.0.2
#> [55] readr_2.2.0 rappdirs_0.3.4 bit64_4.6.0-1
#> [58] rmarkdown_2.31 bit_4.6.0 otel_0.2.0
#> [61] hms_1.1.4 ragg_1.5.2 evaluate_1.0.5
#> [64] knitr_1.51 rlang_1.1.7 Rcpp_1.1.1
#> [67] glue_1.8.0 tidytree_0.4.7 DBI_1.3.0
#> [70] vroom_1.7.0 jsonlite_2.0.0 R6_2.6.1
#> [73] systemfonts_1.3.2 fs_2.0.1