Parquet File Overview and Setup
Hugging Face
Individual pipeline output files have been combined into parquet files and hosted in the metagenomics_mac repo on Hugging Face. Smaller versions of those same files, comprising only 10 samples per file, are available in the Hugging Face repo metagenomics_mac_examples. These are publicly accessible, and are able to be easily read using DuckDB.
Standard Workflow
In the standard workflow, we use the main wrapper function
returnSamples. This function takes tables with sample and
feature information as well as a remote repo name or a vector of paths
to locally stored parquet files and retrieves the relevant data as a
TreeSummarizedExperiment.
Here is a brief overview of each argument to be provided to
returnSamples, please call ?returnSamples for
additional info.
- data_type: the output file type of interest
- sample_data: a table of sample metadata, used to specify which samples to retrieve data for
- feature_data: a table of feature data, used to specify how to filter the raw data
- repo: the identifier of a remote repo where the raw parquet files are stored
- local_files: paths of locally stored parquet files, as an alternative to retrieval from a remote repo
-
include_empty_samples: sometimes none of the
features specified in
feature_dataare found in one or more samples specified insample_data. Should the samples still be included in the result with NA values for each feature (TRUE) or should they be omitted (FALSE)? Omitting these samples may increase response speed when accessing remote repos. -
dry_run: a dry run returns a
tbl_duckdb_connectionobject that contains all of the SQL code necessary to return the requested data. This SQL can be viewed by passing the object intodplyr::show_query(). This is useful for evaluating query efficiency prior to actually executing.
File Selection
Both remote and local parquet files can be queried, though not at the
same time. get_repo_info() gives the names and URLs of the
available repos, and get_hf_parquet_urls() gives
information on the files contained in those repos. Those files can also
be downloaded separately and then provided to the
local_files argument of returnSamples(). This
may be desirable if internet connection is unstable or limited, or if a
particular query runs into the repo’s rate limits.
At this point, we also want to make a note of which
data_type values we are interested in. The
data_type associated with each parquet file is listed in
the output of get_hf_parquet_urls(). For this example, we
will be looking at the ‘relative_abundance’ data type, which corresponds
to the files “relative_abundance_uuid.parquet” and
“relative_abundance_clade_name_species.parquet”. The difference between
these two files is the column by which they are sorted:
“relative_abundance_uuid.parquet” is sorted by the ‘uuid’ column, while
“relative_abundance_clade_name_species.parquet” is sorted by the
‘clade_name_species’ column. Internal functions will choose which one to
use, we will simply provide “relative_abundance” as our chosen data
type. If you are downloading the files locally to avoid rate limits, you
would download all of the files associated with your chosen data
type.
get_hf_parquet_urls(repo_name = "waldronlab/metagenomics_mac")Exploring Data Types
For easier browsing, we can view just the unique data types with their descriptions:
available_data <- get_hf_parquet_urls(repo_name = "waldronlab/metagenomics_mac")Use the search boxes to filter by keywords. For example: - Search “pathway” to find pathway abundance and coverage data - Search “stratified” to find species-level breakdowns - Search “viral” to find viral cluster data - Search “genefamilies” to find gene family data
Sample Table
We can then examine the available samples and optionally select a
subset to query. This is done by browsing the
sampleMetadata object. Here is an example of how a set of
data might be selected for a meta-analysis with some sampling
parameters. Alternatively, this step can be left off altogether. If the
sample table is not provided when calling returnSamples(),
data for all samples will be returned.
Feature Table
For the feature table, we first have to find the reference file that
goes with the data type we are interested in. To do this, call
get_ref_info(). Search the ‘general_data_type’ column for
your data type and identify the reference file of interest. For a more
granular view of what is in each file, you can call
parquet_colinfo() with your data type and see which columns
will be included. Here, we can see that “clade_name_ref” is associated
with “relative_abundance”.
Understanding Reference Table Structure
To see exactly which columns are available in the reference table and
understand how you can filter your features, use
parquet_colinfo(). This shows all columns in both the data
files and their associated reference table:
parquet_colinfo("relative_abundance")This information is useful for:
-
Filtering features: Identify which columns you can
use to filter your feature table (e.g.,
clade_name_genus,clade_name_species,clade_name_terminal) - Understanding data structure: See what metadata is available for each feature
- Verifying your reference: Confirm you’re using the correct reference table for your data type
Once we have found the correct reference file and understand its
structure, we can filter it to only contain features we are interested
in. Here, we are interested in values for all taxonomic nodes in the
genus “Faecalibacterium”. We load the reference file with
load_ref(), and use a basic grepl() filtering
method.
clade_name_ref <- load_ref("clade_name_ref")
feature_table <- clade_name_ref %>%
filter(grepl("Faecalibacterium", clade_name_genus))As an extra consideration, MetaPhlAn relative abundance output
includes aggregate values for each taxonomic level. We can see above
that the first row returned has a value of NA for the
“clade_name_species” column, and a few other rows have NA
in the “clade_name_terminal” column. You may want to remove these rows
based on your analysis. To do so, we would simply re-filter the
file:
returnSamples()
We are now ready to retrieve our data. We simply pass our arguments
into returnSamples(). It may take a minute, depending on
the amount of data requested or your available resources.
experiment <- returnSamples(data_type = "relative_abundance",
sample_data = sample_table,
feature_data = feature_table,
repo = "waldronlab/metagenomics_mac",
local_files = NULL,
include_empty_samples = TRUE,
dry_run = FALSE)
experiment
#> class: TreeSummarizedExperiment
#> dim: 9 24
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(9):
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_SGB15346|t__SGB15346
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15318
#> ...
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15323
#> k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_CLA_AA_H233|t__SGB15315
#> rowData names(19): clade_name clade_name_kingdom ...
#> NCBI_tax_id_terminal additional_species
#> colnames(24): fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd
#> 39ddb5e7-97f6-4d3c-812b-9653b03f99b3 ...
#> 1406666f-04a8-43c9-983b-4ed62fd6da4a
#> 677be4e3-722b-4e43-bd5a-36d8fbed6f86
#> colData names(56): uuid db_version ...
#> ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLIf you are finding that this function is failing due to rate limiting
(HTTP 429 error), hanging, or simply taking longer than you would like,
you can download the files associated with the data type you are
interested in and supply their paths to the “local_files” argument in
lieu of specifying the “repo” argument. Recall that you can use
get_hf_parquet_urls(repo_name = "waldronlab/metagenomics_mac")
to find URLs to download full files.
Here’s a working example using the small example parquet files included in the package:
# Locate the example parquet files included in the package
local_files <- c(
file.path(system.file("extdata", package = "parkinsonsMetagenomicData"),
"pathcoverage_unstratified_pathway.parquet"),
file.path(system.file("extdata", package = "parkinsonsMetagenomicData"),
"pathcoverage_unstratified_uuid.parquet")
)
# Load the pathway reference for pathcoverage data
pathway_ref_local <- load_ref("pathway_ref",
file_path = file.path(system.file("extdata",
package = "parkinsonsMetagenomicData"),
"pathway_ref.parquet"))
# Select a few pathways
feature_table_local <- pathway_ref_local |>
head(5) |>
select(pathway)
# The example files contain data for 10 samples
# For this example, we can query all samples by omitting sample_data,
# or specify a subset using UUIDs
sample_uuids <- c("8793b1dc-3ba1-4591-82b8-4297adcfa1d7",
"cc1f30a0-45d9-41b1-b592-7d0892919ee7",
"fb7e8210-002a-4554-b265-873c4003e25f")
sample_table_local <- data.frame(uuid = sample_uuids)
# Query the local files
experiment_local <- returnSamples(
data_type = "pathcoverage_unstratified",
sample_data = sample_table_local,
feature_data = feature_table_local,
repo = NULL,
local_files = local_files,
include_empty_samples = TRUE,
dry_run = FALSE
)
experiment_local
#> class: TreeSummarizedExperiment
#> dim: 1 3
#> metadata(0):
#> assays(1): coverage
#> rownames(1): 3-HYDROXYPHENYLACETATE-DEGRADATION-PWY:
#> 4-hydroxyphenylacetate degradation
#> rowData names(1): pathway
#> colnames(3): 8793b1dc-3ba1-4591-82b8-4297adcfa1d7
#> fb7e8210-002a-4554-b265-873c4003e25f
#> cc1f30a0-45d9-41b1-b592-7d0892919ee7
#> colData names(263): uuid humann_header ...
#> WallenZD_2022_uncurated_Day_of_stool_collection_digestion_issue
#> WallenZD_2022_uncurated_Day_of_stool_collection_constipation
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLThis demonstrates the complete local file workflow using the small example files included with the package. The example files contain 10 samples and can be queried just like full-sized parquet files. For larger datasets, you would download the full parquet files from Hugging Face and provide their paths in the same way.
Finally, you can check exactly which query is being called on the raw
parquet data by passing “TRUE” to the “dry_run” argument. This returns a
tbl_duckdb_connection object that can be passed to
dplyr::show_query(). To demonstrate:
query_only <- returnSamples(data_type = "relative_abundance",
sample_data = sample_table,
feature_data = feature_table,
repo = "waldronlab/metagenomics_mac",
local_files = NULL,
include_empty_samples = FALSE,
dry_run = TRUE)
dplyr::show_query(query_only)
#> <SQL>
#> SELECT q01.*
#> FROM (
#> SELECT relative_abundance_clade_name_species.*
#> FROM relative_abundance_clade_name_species
#> WHERE (clade_name_species = 's__Faecalibacterium_SGB15346')
#>
#> UNION ALL
#>
#> SELECT relative_abundance_clade_name_species.*
#> FROM relative_abundance_clade_name_species
#> WHERE (clade_name_species = 's__Faecalibacterium_prausnitzii')
#>
#> UNION ALL
#>
#> SELECT relative_abundance_clade_name_species.*
#> FROM relative_abundance_clade_name_species
#> WHERE (clade_name_species = 's__Faecalibacterium_sp_An122')
#>
#> UNION ALL
#>
#> SELECT relative_abundance_clade_name_species.*
#> FROM relative_abundance_clade_name_species
#> WHERE (clade_name_species = 's__Faecalibacterium_sp_CLA_AA_H233')
#>
#> UNION ALL
#>
#> SELECT relative_abundance_clade_name_species.*
#> FROM relative_abundance_clade_name_species
#> WHERE (clade_name_species = 's__Faecalibacterium_sp_HTFF')
#> ) q01
#> WHERE
#> (clade_name_kingdom = 'k__Bacteria') AND
#> (clade_name_phylum = 'p__Firmicutes') AND
#> (clade_name_class = 'c__Clostridia') AND
#> (clade_name_order = 'o__Eubacteriales') AND
#> (clade_name_family = 'f__Oscillospiraceae') AND
#> (clade_name_genus = 'g__Faecalibacterium') AND
#> (NCBI_tax_id_kingdom = '2') AND
#> (NCBI_tax_id_phylum = '1239') AND
#> (NCBI_tax_id_class = '186801') AND
#> (NCBI_tax_id_order = '186802') AND
#> (NCBI_tax_id_family = '216572') AND
#> (NCBI_tax_id_genus = '216851') AND
#> (NCBI_tax_id_terminal = '') AND
#> (NCBI_tax_id IN ('2|1239|186801|186802|216572|216851||', '2|1239|186801|186802|216572|216851|853|', '2|1239|186801|186802|216572|216851|1965551|', '2|1239|186801|186802|216572|216851|2881266|', '2|1239|186801|186802|216572|216851|2929491|')) AND
#> (NCBI_tax_id_species IN ('', '853', '1965551', '2881266', '2929491')) AND
#> (clade_name IN ('k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_SGB15346|t__SGB15346', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15316', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15317', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15318', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15322', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15323', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15332', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15339', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15342', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_An122|t__SGB15312', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_CLA_AA_H233|t__SGB15315', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_HTFF|t__SGB15340')) AND
#> (clade_name_terminal IN ('t__SGB15346', 't__SGB15316', 't__SGB15317', 't__SGB15318', 't__SGB15322', 't__SGB15323', 't__SGB15332', 't__SGB15339', 't__SGB15342', 't__SGB15312', 't__SGB15315', 't__SGB15340')) AND
#> (uuid IN ('0807eb2a-a15e-4647-8e19-2600d8fda378', 'e0fbb54f-0249-4917-a4d7-bd68acb89c62', '25172837-2849-4db3-be91-d54d6a815d00', '39ddb5e7-97f6-4d3c-812b-9653b03f99b3', '7b152a7d-e244-4e2b-b924-7195c7ecfb10', 'dd30f93b-7999-47a4-93fb-21971b899939', '1406666f-04a8-43c9-983b-4ed62fd6da4a', 'fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd', '8707e374-5ddb-4220-8cbf-364b8b0e7be1', '22848a9c-66a6-4993-9058-cb6464edb42f', '08e2b754-78e2-4cb4-8ff2-95fd7b0ff44a', '9baef0b2-93d2-4a40-8082-d357c7f8156a', '09a9303d-d87d-4556-9672-04cbbcaf3d37', 'ac9f3532-90d8-412c-9c80-491037f0bcc2', 'eda61949-02dc-40ae-8dbe-bea2add85a52', '1f007260-be6c-4a21-800a-ad9c36129a0d', 'e47a59bb-443a-405f-9c5d-02659d80e9e5', 'b3eaf3ab-43ef-4830-ab6d-12bafed3c61e', '28f7352f-fe23-4003-93e1-41f4fedc6232', 'b07e2362-5851-4181-ba9a-15d9109ee4dd', '677be4e3-722b-4e43-bd5a-36d8fbed6f86', '0c817272-f873-475f-a401-dfe46a679a9f', '7a3945d9-21bb-434a-9a4e-bfcdeb6194de', '56aa2ad5-007d-407c-a644-48aac1e9a8f0'))Additional Data Type Examples
The following examples demonstrate retrieving different types of metagenomic data beyond taxonomic relative abundance. Each showcases unique features of the package and different bioinformatics tools.
Example 2: Pathway Abundance (Stratified vs Unstratified)
HUMAnN pathway data comes in two forms: - Unstratified: Total pathway abundance across all contributing organisms - Stratified: Pathway abundance broken down by individual species
This example shows how to retrieve unstratified pathway abundance for butyrate biosynthesis pathways, which are relevant to gut-brain health in Parkinson’s Disease.
Retrieve Unstratified Pathway Data
# Select a single pathway
feature_table_pathway <- pathway_ref |>
filter(pathway == "PWY-5676: acetyl-CoA fermentation to butanoate II") |>
select(pathway)
# Use samples from a single study
sample_table_pathway <- sampleMetadata |>
filter(study_name == "ZhangM_2023") |>
select(where(~ !any(is.na(.x))))
# Retrieve unstratified pathway abundance
tse_pathway <- returnSamples(
data_type = "pathabundance_unstratified",
sample_data = sample_table_pathway,
feature_data = feature_table_pathway,
repo = "waldronlab/metagenomics_mac",
include_empty_samples = TRUE,
dry_run = FALSE
)
tse_pathway
#> class: TreeSummarizedExperiment
#> dim: 1 24
#> metadata(0):
#> assays(1): abundance
#> rownames(1): PWY-5676: acetyl-CoA fermentation to butanoate II
#> rowData names(1): pathway
#> colnames(24): 56aa2ad5-007d-407c-a644-48aac1e9a8f0
#> 677be4e3-722b-4e43-bd5a-36d8fbed6f86 ...
#> eda61949-02dc-40ae-8dbe-bea2add85a52
#> 25172837-2849-4db3-be91-d54d6a815d00
#> colData names(52): uuid humann_header ...
#> ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLUnderstanding Stratified Data
If you wanted to see WHICH bacterial species contribute to this
pathway’s abundance, you would use
data_type = "pathabundance_stratified" instead. Stratified
pathways have the format PATHWAY|SPECIES, for example:
PWY-5676: acetyl-CoA fermentation to butanoate II|g__Faecalibacterium.s__Faecalibacterium_prausnitzii
# Example: Get stratified version (species-level contributions)
# Note: This may return more rows as each pathway-species pair is a feature
feature_table_stratified <- pathway_ref |>
filter(grepl("^PWY-5676:", pathway)) |>
filter(grepl("Faecalibacterium", pathway)) |> # Just Faecalibacterium contributions
select(pathway)
tse_pathway_stratified <- returnSamples(
data_type = "pathabundance_stratified",
sample_data = sample_table_pathway,
feature_data = feature_table_stratified,
repo = "waldronlab/metagenomics_mac",
include_empty_samples = TRUE,
dry_run = FALSE
)Example 3: Viral Clusters
MetaPhlAn can identify viral sequences in metagenomic samples. This example retrieves viral cluster data.
Retrieve Viral Cluster Data
# Select a few viral genomes to query
feature_table_viral <- genome_name_ref |>
head(5) |> # Select first 5 viral genomes
select(genome_name)
# Use the same sample set
tse_viral <- returnSamples(
data_type = "viral_clusters",
sample_data = sample_table,
feature_data = feature_table_viral,
repo = "waldronlab/metagenomics_mac",
include_empty_samples = TRUE,
dry_run = FALSE
)
#> 0 rows returned but empty samples exist. TreeSummarizedExperiment will include colData as applicable.
tse_viral
#> class: TreeSummarizedExperiment
#> dim: 0 24
#> metadata(0):
#> assays(3): breadth_of_coverage depth_of_coverage_mean
#> depth_of_coverage_median
#> rownames(0):
#> rowData names(6): genome_name m_group_cluster ...
#> first_genome_in_cluster other_genomes
#> colnames(24): 0807eb2a-a15e-4647-8e19-2600d8fda378
#> 09a9303d-d87d-4556-9672-04cbbcaf3d37 ...
#> e47a59bb-443a-405f-9c5d-02659d80e9e5
#> fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd
#> colData names(55): uuid db_version ...
#> ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLExample 4: Gene Family Abundance
HUMAnN quantifies gene families (groups of homologous genes) using the UniRef database. This example shows gene family abundance data.
Note: Gene family queries can be large. For demonstration, we’ll query a small number of gene families.
Discover Gene Family Reference
gene_family_ref <- load_ref("gene_family_ref")
# Browse the structure
gene_family_ref |>
select(gene_family) |>
head(20)
#> # A tibble: 20 × 1
#> gene_family
#> <chr>
#> 1 UNMAPPED
#> 2 UniRef90_A0A009EC87
#> 3 UniRef90_A0A009EC87|unclassified
#> 4 UniRef90_A0A009EHH0
#> 5 UniRef90_A0A009EHH0|unclassified
#> 6 UniRef90_A0A009EMH9
#> 7 UniRef90_A0A009EMH9|unclassified
#> 8 UniRef90_A0A009EQY8
#> 9 UniRef90_A0A009EQY8|unclassified
#> 10 UniRef90_A0A009ES08
#> 11 UniRef90_A0A009ES08|unclassified
#> 12 UniRef90_A0A009EU90
#> 13 UniRef90_A0A009EU90|unclassified
#> 14 UniRef90_A0A009EY40
#> 15 UniRef90_A0A009EY40|unclassified
#> 16 UniRef90_A0A009EY59
#> 17 UniRef90_A0A009EY59|unclassified
#> 18 UniRef90_A0A009F206
#> 19 UniRef90_A0A009F206|unclassified
#> 20 UniRef90_A0A009F5F7Retrieve Gene Family Data
# Select specific gene families (using a small set for performance)
feature_table_genes <- gene_family_ref |>
head(10) |> # Select first 10 gene families
select(gene_family)
# Use fewer samples for gene family queries (they can be large)
sample_table_small <- sampleMetadata |>
filter(study_name == "ZhangM_2023") |>
head(5) |> # Use only 5 samples
select(where(~ !any(is.na(.x))))
tse_genes <- returnSamples(
data_type = "genefamilies_unstratified",
sample_data = sample_table_small,
feature_data = feature_table_genes,
repo = "waldronlab/metagenomics_mac",
include_empty_samples = TRUE,
dry_run = FALSE
)
#> 'genefamilies_unstratified' is a large data type, and collecting the query can take a while. To avoid going through the Hugging Face API, download the source file hf://datasets/waldronlab/metagenomics_mac/genefamilies_unstratified_uuid.parquet and provide it to accessParquetData() in the 'local files' argument.
tse_genes
#> class: TreeSummarizedExperiment
#> dim: 1 5
#> metadata(0):
#> assays(1): rpk_abundance
#> rownames(1): UNMAPPED
#> rowData names(1): gene_family
#> colnames(5): 0807eb2a-a15e-4647-8e19-2600d8fda378
#> e0fbb54f-0249-4917-a4d7-bd68acb89c62
#> 25172837-2849-4db3-be91-d54d6a815d00
#> 39ddb5e7-97f6-4d3c-812b-9653b03f99b3
#> 7b152a7d-e244-4e2b-b924-7195c7ecfb10
#> colData names(52): uuid humann_header ...
#> ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULLLarge Query Considerations: Gene family stratified
data (genefamilies_stratified) can be very large. For large
queries, consider: 1. Download parquet files locally and use them via
the local_files argument 2. Use very selective filters
(specific gene families + specific samples) 3. Query in batches 4. See
the Working with Large Parquet Files vignette for more
strategies
Comparing Data Types
Here’s a summary of what we retrieved in these examples:
| Example | Data Type | Tool | Features | What It Measures |
|---|---|---|---|---|
| 1 | relative_abundance | MetaPhlAn | Bacterial taxa | Which bacteria are present and their relative abundance |
| 2 | pathabundance_unstratified | HUMAnN | Metabolic pathways | Total abundance of metabolic pathways |
| 3 | viral_clusters | MetaPhlAn | Viral genomes | Presence/abundance of viral sequences |
| 4 | genefamilies_unstratified | HUMAnN | Gene families | Abundance of functional gene groups |
All return TreeSummarizedExperiment objects with the
same structure, making it easy to apply consistent analysis workflows
across different data types.
sessionInfo()
#> R Under development (unstable) (2026-03-28 r89738)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] DT_0.34.0 DBI_1.3.0
#> [3] dplyr_1.2.0 parkinsonsMetagenomicData_0.99.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.41.1 httr2_1.2.2
#> [3] xfun_0.57 bslib_0.10.0
#> [5] htmlwidgets_1.6.4 Biobase_2.71.0
#> [7] lattice_0.22-9 tzdb_0.5.0
#> [9] crosstalk_1.2.2 yulab.utils_0.2.4
#> [11] vctrs_0.7.2 tools_4.7.0
#> [13] generics_0.1.4 curl_7.0.0
#> [15] stats4_4.7.0 parallel_4.7.0
#> [17] tibble_3.3.1 blob_1.3.0
#> [19] pkgconfig_2.0.3 Matrix_1.7-5
#> [21] dbplyr_2.5.2 desc_1.4.3
#> [23] S4Vectors_0.49.0 assertthat_0.2.1
#> [25] lifecycle_1.0.5 stringr_1.6.0
#> [27] compiler_4.7.0 treeio_1.35.0
#> [29] textshaping_1.0.5 Biostrings_2.79.5
#> [31] Seqinfo_1.1.0 codetools_0.2-20
#> [33] htmltools_0.5.9 sass_0.4.10
#> [35] yaml_2.3.12 lazyeval_0.2.2
#> [37] pkgdown_2.2.0 pillar_1.11.1
#> [39] crayon_1.5.3 jquerylib_0.1.4
#> [41] tidyr_1.3.2 BiocParallel_1.45.0
#> [43] SingleCellExperiment_1.33.2 DelayedArray_0.37.0
#> [45] cachem_1.1.0 abind_1.4-8
#> [47] nlme_3.1-169 tidyselect_1.2.1
#> [49] digest_0.6.39 stringi_1.8.7
#> [51] duckdb_1.5.1 purrr_1.2.1
#> [53] arrow_23.0.1.2 TreeSummarizedExperiment_2.19.0
#> [55] fastmap_1.2.0 grid_4.7.0
#> [57] cli_3.6.5 SparseArray_1.11.11
#> [59] magrittr_2.0.4 S4Arrays_1.11.1
#> [61] utf8_1.2.6 ape_5.8-1
#> [63] withr_3.0.2 readr_2.2.0
#> [65] rappdirs_0.3.4 bit64_4.6.0-1
#> [67] rmarkdown_2.31 XVector_0.51.0
#> [69] matrixStats_1.5.0 bit_4.6.0
#> [71] otel_0.2.0 hms_1.1.4
#> [73] ragg_1.5.2 evaluate_1.0.5
#> [75] knitr_1.51 GenomicRanges_1.63.1
#> [77] IRanges_2.45.0 rlang_1.1.7
#> [79] Rcpp_1.1.1 glue_1.8.0
#> [81] tidytree_0.4.7 BiocGenerics_0.57.0
#> [83] vroom_1.7.0 jsonlite_2.0.0
#> [85] R6_2.6.1 MatrixGenerics_1.23.0
#> [87] systemfonts_1.3.2 fs_2.0.1