Skip to contents

Parquet File Overview and Setup

Hugging Face

Individual pipeline output files have been combined into parquet files and hosted in the metagenomics_mac repo on Hugging Face. Smaller versions of those same files, comprising only 10 samples per file, are available in the Hugging Face repo metagenomics_mac_examples. These are publicly accessible, and are able to be easily read using DuckDB.

DuckDB in R

DuckDB is very easy to use in R through the duckdb R package. The DBI and dplyr/dbplyr packages combine with it to provide a streamlined way to work with remote data by selectively querying it before bringing it into your R session.

Relevant tools:

Standard Workflow

In the standard workflow, we use the main wrapper function returnSamples. This function takes tables with sample and feature information as well as a remote repo name or a vector of paths to locally stored parquet files and retrieves the relevant data as a TreeSummarizedExperiment.

Here is a brief overview of each argument to be provided to returnSamples, please call ?returnSamples for additional info.

  • data_type: the output file type of interest
  • sample_data: a table of sample metadata, used to specify which samples to retrieve data for
  • feature_data: a table of feature data, used to specify how to filter the raw data
  • repo: the identifier of a remote repo where the raw parquet files are stored
  • local_files: paths of locally stored parquet files, as an alternative to retrieval from a remote repo
  • include_empty_samples: sometimes none of the features specified in feature_data are found in one or more samples specified in sample_data. Should the samples still be included in the result with NA values for each feature (TRUE) or should they be omitted (FALSE)? Omitting these samples may increase response speed when accessing remote repos.
  • dry_run: a dry run returns a tbl_duckdb_connection object that contains all of the SQL code necessary to return the requested data. This SQL can be viewed by passing the object into dplyr::show_query(). This is useful for evaluating query efficiency prior to actually executing.

File Selection

Both remote and local parquet files can be queried, though not at the same time. get_repo_info() gives the names and URLs of the available repos, and get_hf_parquet_urls() gives information on the files contained in those repos. Those files can also be downloaded separately and then provided to the local_files argument of returnSamples(). This may be desirable if internet connection is unstable or limited, or if a particular query runs into the repo’s rate limits.

At this point, we also want to make a note of which data_type values we are interested in. The data_type associated with each parquet file is listed in the output of get_hf_parquet_urls(). For this example, we will be looking at the ‘relative_abundance’ data type, which corresponds to the files “relative_abundance_uuid.parquet” and “relative_abundance_clade_name_species.parquet”. The difference between these two files is the column by which they are sorted: “relative_abundance_uuid.parquet” is sorted by the ‘uuid’ column, while “relative_abundance_clade_name_species.parquet” is sorted by the ‘clade_name_species’ column. Internal functions will choose which one to use, we will simply provide “relative_abundance” as our chosen data type. If you are downloading the files locally to avoid rate limits, you would download all of the files associated with your chosen data type.

get_hf_parquet_urls(repo_name = "waldronlab/metagenomics_mac")

Exploring Data Types

For easier browsing, we can view just the unique data types with their descriptions:

available_data <- get_hf_parquet_urls(repo_name = "waldronlab/metagenomics_mac")

Use the search boxes to filter by keywords. For example: - Search “pathway” to find pathway abundance and coverage data - Search “stratified” to find species-level breakdowns - Search “viral” to find viral cluster data - Search “genefamilies” to find gene family data

Sample Table

We can then examine the available samples and optionally select a subset to query. This is done by browsing the sampleMetadata object. Here is an example of how a set of data might be selected for a meta-analysis with some sampling parameters. Alternatively, this step can be left off altogether. If the sample table is not provided when calling returnSamples(), data for all samples will be returned.

data("sampleMetadata", package = "parkinsonsMetagenomicData")
sample_table <- sampleMetadata |>
    filter(study_name == "ZhangM_2023") |>
    select(where(~ !any(is.na(.x))))

Feature Table

For the feature table, we first have to find the reference file that goes with the data type we are interested in. To do this, call get_ref_info(). Search the ‘general_data_type’ column for your data type and identify the reference file of interest. For a more granular view of what is in each file, you can call parquet_colinfo() with your data type and see which columns will be included. Here, we can see that “clade_name_ref” is associated with “relative_abundance”.

Understanding Reference Table Structure

To see exactly which columns are available in the reference table and understand how you can filter your features, use parquet_colinfo(). This shows all columns in both the data files and their associated reference table:

parquet_colinfo("relative_abundance")

This information is useful for:

  • Filtering features: Identify which columns you can use to filter your feature table (e.g., clade_name_genus, clade_name_species, clade_name_terminal)
  • Understanding data structure: See what metadata is available for each feature
  • Verifying your reference: Confirm you’re using the correct reference table for your data type

Once we have found the correct reference file and understand its structure, we can filter it to only contain features we are interested in. Here, we are interested in values for all taxonomic nodes in the genus “Faecalibacterium”. We load the reference file with load_ref(), and use a basic grepl() filtering method.

clade_name_ref <- load_ref("clade_name_ref")
feature_table <- clade_name_ref %>%
    filter(grepl("Faecalibacterium", clade_name_genus))

As an extra consideration, MetaPhlAn relative abundance output includes aggregate values for each taxonomic level. We can see above that the first row returned has a value of NA for the “clade_name_species” column, and a few other rows have NA in the “clade_name_terminal” column. You may want to remove these rows based on your analysis. To do so, we would simply re-filter the file:

feature_table <- clade_name_ref %>%
    filter(grepl("Faecalibacterium", clade_name_genus)) %>%
    filter(!is.na(clade_name_species)) %>%
    filter(!is.na(clade_name_terminal))

returnSamples()

We are now ready to retrieve our data. We simply pass our arguments into returnSamples(). It may take a minute, depending on the amount of data requested or your available resources.

experiment <- returnSamples(data_type = "relative_abundance",
                            sample_data = sample_table,
                            feature_data = feature_table,
                            repo = "waldronlab/metagenomics_mac",
                            local_files = NULL,
                            include_empty_samples = TRUE,
                            dry_run = FALSE)
experiment
#> class: TreeSummarizedExperiment 
#> dim: 9 24 
#> metadata(0):
#> assays(1): relative_abundance
#> rownames(9):
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_SGB15346|t__SGB15346
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15318
#>   ...
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15323
#>   k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_CLA_AA_H233|t__SGB15315
#> rowData names(19): clade_name clade_name_kingdom ...
#>   NCBI_tax_id_terminal additional_species
#> colnames(24): fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd
#>   39ddb5e7-97f6-4d3c-812b-9653b03f99b3 ...
#>   1406666f-04a8-43c9-983b-4ed62fd6da4a
#>   677be4e3-722b-4e43-bd5a-36d8fbed6f86
#> colData names(56): uuid db_version ...
#>   ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

If you are finding that this function is failing due to rate limiting (HTTP 429 error), hanging, or simply taking longer than you would like, you can download the files associated with the data type you are interested in and supply their paths to the “local_files” argument in lieu of specifying the “repo” argument. Recall that you can use get_hf_parquet_urls(repo_name = "waldronlab/metagenomics_mac") to find URLs to download full files.

Here’s a working example using the small example parquet files included in the package:

# Locate the example parquet files included in the package
local_files <- c(
    file.path(system.file("extdata", package = "parkinsonsMetagenomicData"),
              "pathcoverage_unstratified_pathway.parquet"),
    file.path(system.file("extdata", package = "parkinsonsMetagenomicData"),
              "pathcoverage_unstratified_uuid.parquet")
)

# Load the pathway reference for pathcoverage data
pathway_ref_local <- load_ref("pathway_ref",
                               file_path = file.path(system.file("extdata",
                                                     package = "parkinsonsMetagenomicData"),
                                                     "pathway_ref.parquet"))

# Select a few pathways
feature_table_local <- pathway_ref_local |>
    head(5) |>
    select(pathway)

# The example files contain data for 10 samples
# For this example, we can query all samples by omitting sample_data,
# or specify a subset using UUIDs
sample_uuids <- c("8793b1dc-3ba1-4591-82b8-4297adcfa1d7",
                  "cc1f30a0-45d9-41b1-b592-7d0892919ee7",
                  "fb7e8210-002a-4554-b265-873c4003e25f")
sample_table_local <- data.frame(uuid = sample_uuids)

# Query the local files
experiment_local <- returnSamples(
    data_type = "pathcoverage_unstratified",
    sample_data = sample_table_local,
    feature_data = feature_table_local,
    repo = NULL,
    local_files = local_files,
    include_empty_samples = TRUE,
    dry_run = FALSE
)

experiment_local
#> class: TreeSummarizedExperiment 
#> dim: 1 3 
#> metadata(0):
#> assays(1): coverage
#> rownames(1): 3-HYDROXYPHENYLACETATE-DEGRADATION-PWY:
#>   4-hydroxyphenylacetate degradation
#> rowData names(1): pathway
#> colnames(3): 8793b1dc-3ba1-4591-82b8-4297adcfa1d7
#>   fb7e8210-002a-4554-b265-873c4003e25f
#>   cc1f30a0-45d9-41b1-b592-7d0892919ee7
#> colData names(263): uuid humann_header ...
#>   WallenZD_2022_uncurated_Day_of_stool_collection_digestion_issue
#>   WallenZD_2022_uncurated_Day_of_stool_collection_constipation
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

This demonstrates the complete local file workflow using the small example files included with the package. The example files contain 10 samples and can be queried just like full-sized parquet files. For larger datasets, you would download the full parquet files from Hugging Face and provide their paths in the same way.

Finally, you can check exactly which query is being called on the raw parquet data by passing “TRUE” to the “dry_run” argument. This returns a tbl_duckdb_connection object that can be passed to dplyr::show_query(). To demonstrate:

query_only <- returnSamples(data_type = "relative_abundance",
                            sample_data = sample_table,
                            feature_data = feature_table,
                            repo = "waldronlab/metagenomics_mac",
                            local_files = NULL,
                            include_empty_samples = FALSE,
                            dry_run = TRUE)
dplyr::show_query(query_only)
#> <SQL>
#> SELECT q01.*
#> FROM (
#>   SELECT relative_abundance_clade_name_species.*
#>   FROM relative_abundance_clade_name_species
#>   WHERE (clade_name_species = 's__Faecalibacterium_SGB15346')
#> 
#>   UNION ALL
#> 
#>   SELECT relative_abundance_clade_name_species.*
#>   FROM relative_abundance_clade_name_species
#>   WHERE (clade_name_species = 's__Faecalibacterium_prausnitzii')
#> 
#>   UNION ALL
#> 
#>   SELECT relative_abundance_clade_name_species.*
#>   FROM relative_abundance_clade_name_species
#>   WHERE (clade_name_species = 's__Faecalibacterium_sp_An122')
#> 
#>   UNION ALL
#> 
#>   SELECT relative_abundance_clade_name_species.*
#>   FROM relative_abundance_clade_name_species
#>   WHERE (clade_name_species = 's__Faecalibacterium_sp_CLA_AA_H233')
#> 
#>   UNION ALL
#> 
#>   SELECT relative_abundance_clade_name_species.*
#>   FROM relative_abundance_clade_name_species
#>   WHERE (clade_name_species = 's__Faecalibacterium_sp_HTFF')
#> ) q01
#> WHERE
#>   (clade_name_kingdom = 'k__Bacteria') AND
#>   (clade_name_phylum = 'p__Firmicutes') AND
#>   (clade_name_class = 'c__Clostridia') AND
#>   (clade_name_order = 'o__Eubacteriales') AND
#>   (clade_name_family = 'f__Oscillospiraceae') AND
#>   (clade_name_genus = 'g__Faecalibacterium') AND
#>   (NCBI_tax_id_kingdom = '2') AND
#>   (NCBI_tax_id_phylum = '1239') AND
#>   (NCBI_tax_id_class = '186801') AND
#>   (NCBI_tax_id_order = '186802') AND
#>   (NCBI_tax_id_family = '216572') AND
#>   (NCBI_tax_id_genus = '216851') AND
#>   (NCBI_tax_id_terminal = '') AND
#>   (NCBI_tax_id IN ('2|1239|186801|186802|216572|216851||', '2|1239|186801|186802|216572|216851|853|', '2|1239|186801|186802|216572|216851|1965551|', '2|1239|186801|186802|216572|216851|2881266|', '2|1239|186801|186802|216572|216851|2929491|')) AND
#>   (NCBI_tax_id_species IN ('', '853', '1965551', '2881266', '2929491')) AND
#>   (clade_name IN ('k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_SGB15346|t__SGB15346', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15316', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15317', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15318', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15322', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15323', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15332', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15339', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_prausnitzii|t__SGB15342', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_An122|t__SGB15312', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_CLA_AA_H233|t__SGB15315', 'k__Bacteria|p__Firmicutes|c__Clostridia|o__Eubacteriales|f__Oscillospiraceae|g__Faecalibacterium|s__Faecalibacterium_sp_HTFF|t__SGB15340')) AND
#>   (clade_name_terminal IN ('t__SGB15346', 't__SGB15316', 't__SGB15317', 't__SGB15318', 't__SGB15322', 't__SGB15323', 't__SGB15332', 't__SGB15339', 't__SGB15342', 't__SGB15312', 't__SGB15315', 't__SGB15340')) AND
#>   (uuid IN ('0807eb2a-a15e-4647-8e19-2600d8fda378', 'e0fbb54f-0249-4917-a4d7-bd68acb89c62', '25172837-2849-4db3-be91-d54d6a815d00', '39ddb5e7-97f6-4d3c-812b-9653b03f99b3', '7b152a7d-e244-4e2b-b924-7195c7ecfb10', 'dd30f93b-7999-47a4-93fb-21971b899939', '1406666f-04a8-43c9-983b-4ed62fd6da4a', 'fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd', '8707e374-5ddb-4220-8cbf-364b8b0e7be1', '22848a9c-66a6-4993-9058-cb6464edb42f', '08e2b754-78e2-4cb4-8ff2-95fd7b0ff44a', '9baef0b2-93d2-4a40-8082-d357c7f8156a', '09a9303d-d87d-4556-9672-04cbbcaf3d37', 'ac9f3532-90d8-412c-9c80-491037f0bcc2', 'eda61949-02dc-40ae-8dbe-bea2add85a52', '1f007260-be6c-4a21-800a-ad9c36129a0d', 'e47a59bb-443a-405f-9c5d-02659d80e9e5', 'b3eaf3ab-43ef-4830-ab6d-12bafed3c61e', '28f7352f-fe23-4003-93e1-41f4fedc6232', 'b07e2362-5851-4181-ba9a-15d9109ee4dd', '677be4e3-722b-4e43-bd5a-36d8fbed6f86', '0c817272-f873-475f-a401-dfe46a679a9f', '7a3945d9-21bb-434a-9a4e-bfcdeb6194de', '56aa2ad5-007d-407c-a644-48aac1e9a8f0'))

Additional Data Type Examples

The following examples demonstrate retrieving different types of metagenomic data beyond taxonomic relative abundance. Each showcases unique features of the package and different bioinformatics tools.

Example 2: Pathway Abundance (Stratified vs Unstratified)

HUMAnN pathway data comes in two forms: - Unstratified: Total pathway abundance across all contributing organisms - Stratified: Pathway abundance broken down by individual species

This example shows how to retrieve unstratified pathway abundance for butyrate biosynthesis pathways, which are relevant to gut-brain health in Parkinson’s Disease.

Discover Reference Tables

Browse Available Pathways

pathway_ref <- load_ref("pathway_ref")

# Search for butyrate-related pathways
butyrate_pathways <- pathway_ref |>
    filter(grepl("butanoate|butyrat", pathway, ignore.case = TRUE)) |>
    filter(!grepl("\\|", pathway)) |>  # Exclude stratified versions (contains "|")
    select(pathway) |>
    distinct()

Retrieve Unstratified Pathway Data

# Select a single pathway
feature_table_pathway <- pathway_ref |>
    filter(pathway == "PWY-5676: acetyl-CoA fermentation to butanoate II") |>
    select(pathway)

# Use samples from a single study
sample_table_pathway <- sampleMetadata |>
    filter(study_name == "ZhangM_2023") |>
    select(where(~ !any(is.na(.x))))

# Retrieve unstratified pathway abundance
tse_pathway <- returnSamples(
    data_type = "pathabundance_unstratified",
    sample_data = sample_table_pathway,
    feature_data = feature_table_pathway,
    repo = "waldronlab/metagenomics_mac",
    include_empty_samples = TRUE,
    dry_run = FALSE
)

tse_pathway
#> class: TreeSummarizedExperiment 
#> dim: 1 24 
#> metadata(0):
#> assays(1): abundance
#> rownames(1): PWY-5676: acetyl-CoA fermentation to butanoate II
#> rowData names(1): pathway
#> colnames(24): 56aa2ad5-007d-407c-a644-48aac1e9a8f0
#>   677be4e3-722b-4e43-bd5a-36d8fbed6f86 ...
#>   eda61949-02dc-40ae-8dbe-bea2add85a52
#>   25172837-2849-4db3-be91-d54d6a815d00
#> colData names(52): uuid humann_header ...
#>   ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

Understanding Stratified Data

If you wanted to see WHICH bacterial species contribute to this pathway’s abundance, you would use data_type = "pathabundance_stratified" instead. Stratified pathways have the format PATHWAY|SPECIES, for example:

PWY-5676: acetyl-CoA fermentation to butanoate II|g__Faecalibacterium.s__Faecalibacterium_prausnitzii
# Example: Get stratified version (species-level contributions)
# Note: This may return more rows as each pathway-species pair is a feature
feature_table_stratified <- pathway_ref |>
    filter(grepl("^PWY-5676:", pathway)) |>
    filter(grepl("Faecalibacterium", pathway)) |>  # Just Faecalibacterium contributions
    select(pathway)

tse_pathway_stratified <- returnSamples(
    data_type = "pathabundance_stratified",
    sample_data = sample_table_pathway,
    feature_data = feature_table_stratified,
    repo = "waldronlab/metagenomics_mac",
    include_empty_samples = TRUE,
    dry_run = FALSE
)

Example 3: Viral Clusters

MetaPhlAn can identify viral sequences in metagenomic samples. This example retrieves viral cluster data.

Discover Viral Reference

genome_name_ref <- load_ref("genome_name_ref")

# Browse available viral genomes
viral_genomes <- genome_name_ref |>
    select(genome_name) |>
    distinct() |>
    head(20)  # Show first 20 for brevity

Retrieve Viral Cluster Data

# Select a few viral genomes to query
feature_table_viral <- genome_name_ref |>
    head(5) |>  # Select first 5 viral genomes
    select(genome_name)

# Use the same sample set
tse_viral <- returnSamples(
    data_type = "viral_clusters",
    sample_data = sample_table,
    feature_data = feature_table_viral,
    repo = "waldronlab/metagenomics_mac",
    include_empty_samples = TRUE,
    dry_run = FALSE
)
#> 0 rows returned but empty samples exist. TreeSummarizedExperiment will include colData as applicable.

tse_viral
#> class: TreeSummarizedExperiment 
#> dim: 0 24 
#> metadata(0):
#> assays(3): breadth_of_coverage depth_of_coverage_mean
#>   depth_of_coverage_median
#> rownames(0):
#> rowData names(6): genome_name m_group_cluster ...
#>   first_genome_in_cluster other_genomes
#> colnames(24): 0807eb2a-a15e-4647-8e19-2600d8fda378
#>   09a9303d-d87d-4556-9672-04cbbcaf3d37 ...
#>   e47a59bb-443a-405f-9c5d-02659d80e9e5
#>   fe3de3ca-3a14-4bd8-ae1c-0dad69edc9cd
#> colData names(55): uuid db_version ...
#>   ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

Example 4: Gene Family Abundance

HUMAnN quantifies gene families (groups of homologous genes) using the UniRef database. This example shows gene family abundance data.

Note: Gene family queries can be large. For demonstration, we’ll query a small number of gene families.

Discover Gene Family Reference

gene_family_ref <- load_ref("gene_family_ref")

# Browse the structure
gene_family_ref |>
    select(gene_family) |>
    head(20)
#> # A tibble: 20 × 1
#>    gene_family                     
#>    <chr>                           
#>  1 UNMAPPED                        
#>  2 UniRef90_A0A009EC87             
#>  3 UniRef90_A0A009EC87|unclassified
#>  4 UniRef90_A0A009EHH0             
#>  5 UniRef90_A0A009EHH0|unclassified
#>  6 UniRef90_A0A009EMH9             
#>  7 UniRef90_A0A009EMH9|unclassified
#>  8 UniRef90_A0A009EQY8             
#>  9 UniRef90_A0A009EQY8|unclassified
#> 10 UniRef90_A0A009ES08             
#> 11 UniRef90_A0A009ES08|unclassified
#> 12 UniRef90_A0A009EU90             
#> 13 UniRef90_A0A009EU90|unclassified
#> 14 UniRef90_A0A009EY40             
#> 15 UniRef90_A0A009EY40|unclassified
#> 16 UniRef90_A0A009EY59             
#> 17 UniRef90_A0A009EY59|unclassified
#> 18 UniRef90_A0A009F206             
#> 19 UniRef90_A0A009F206|unclassified
#> 20 UniRef90_A0A009F5F7

Retrieve Gene Family Data

# Select specific gene families (using a small set for performance)
feature_table_genes <- gene_family_ref |>
    head(10) |>  # Select first 10 gene families
    select(gene_family)

# Use fewer samples for gene family queries (they can be large)
sample_table_small <- sampleMetadata |>
    filter(study_name == "ZhangM_2023") |>
    head(5) |>  # Use only 5 samples
    select(where(~ !any(is.na(.x))))

tse_genes <- returnSamples(
    data_type = "genefamilies_unstratified",
    sample_data = sample_table_small,
    feature_data = feature_table_genes,
    repo = "waldronlab/metagenomics_mac",
    include_empty_samples = TRUE,
    dry_run = FALSE
)
#> 'genefamilies_unstratified' is a large data type, and collecting the query can take a while. To avoid going through the Hugging Face API, download the source file hf://datasets/waldronlab/metagenomics_mac/genefamilies_unstratified_uuid.parquet and provide it to accessParquetData() in the 'local files' argument.

tse_genes
#> class: TreeSummarizedExperiment 
#> dim: 1 5 
#> metadata(0):
#> assays(1): rpk_abundance
#> rownames(1): UNMAPPED
#> rowData names(1): gene_family
#> colnames(5): 0807eb2a-a15e-4647-8e19-2600d8fda378
#>   e0fbb54f-0249-4917-a4d7-bd68acb89c62
#>   25172837-2849-4db3-be91-d54d6a815d00
#>   39ddb5e7-97f6-4d3c-812b-9653b03f99b3
#>   7b152a7d-e244-4e2b-b924-7195c7ecfb10
#> colData names(52): uuid humann_header ...
#>   ZhangM_2023_uncurated_Sample.Name ZhangM_2023_uncurated_SRA.Study
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

Large Query Considerations: Gene family stratified data (genefamilies_stratified) can be very large. For large queries, consider: 1. Download parquet files locally and use them via the local_files argument 2. Use very selective filters (specific gene families + specific samples) 3. Query in batches 4. See the Working with Large Parquet Files vignette for more strategies

Comparing Data Types

Here’s a summary of what we retrieved in these examples:

Example Data Type Tool Features What It Measures
1 relative_abundance MetaPhlAn Bacterial taxa Which bacteria are present and their relative abundance
2 pathabundance_unstratified HUMAnN Metabolic pathways Total abundance of metabolic pathways
3 viral_clusters MetaPhlAn Viral genomes Presence/abundance of viral sequences
4 genefamilies_unstratified HUMAnN Gene families Abundance of functional gene groups

All return TreeSummarizedExperiment objects with the same structure, making it easy to apply consistent analysis workflows across different data types.

sessionInfo()
#> R Under development (unstable) (2026-03-28 r89738)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] DT_0.34.0                        DBI_1.3.0                       
#> [3] dplyr_1.2.0                      parkinsonsMetagenomicData_0.99.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.41.1     httr2_1.2.2                    
#>  [3] xfun_0.57                       bslib_0.10.0                   
#>  [5] htmlwidgets_1.6.4               Biobase_2.71.0                 
#>  [7] lattice_0.22-9                  tzdb_0.5.0                     
#>  [9] crosstalk_1.2.2                 yulab.utils_0.2.4              
#> [11] vctrs_0.7.2                     tools_4.7.0                    
#> [13] generics_0.1.4                  curl_7.0.0                     
#> [15] stats4_4.7.0                    parallel_4.7.0                 
#> [17] tibble_3.3.1                    blob_1.3.0                     
#> [19] pkgconfig_2.0.3                 Matrix_1.7-5                   
#> [21] dbplyr_2.5.2                    desc_1.4.3                     
#> [23] S4Vectors_0.49.0                assertthat_0.2.1               
#> [25] lifecycle_1.0.5                 stringr_1.6.0                  
#> [27] compiler_4.7.0                  treeio_1.35.0                  
#> [29] textshaping_1.0.5               Biostrings_2.79.5              
#> [31] Seqinfo_1.1.0                   codetools_0.2-20               
#> [33] htmltools_0.5.9                 sass_0.4.10                    
#> [35] yaml_2.3.12                     lazyeval_0.2.2                 
#> [37] pkgdown_2.2.0                   pillar_1.11.1                  
#> [39] crayon_1.5.3                    jquerylib_0.1.4                
#> [41] tidyr_1.3.2                     BiocParallel_1.45.0            
#> [43] SingleCellExperiment_1.33.2     DelayedArray_0.37.0            
#> [45] cachem_1.1.0                    abind_1.4-8                    
#> [47] nlme_3.1-169                    tidyselect_1.2.1               
#> [49] digest_0.6.39                   stringi_1.8.7                  
#> [51] duckdb_1.5.1                    purrr_1.2.1                    
#> [53] arrow_23.0.1.2                  TreeSummarizedExperiment_2.19.0
#> [55] fastmap_1.2.0                   grid_4.7.0                     
#> [57] cli_3.6.5                       SparseArray_1.11.11            
#> [59] magrittr_2.0.4                  S4Arrays_1.11.1                
#> [61] utf8_1.2.6                      ape_5.8-1                      
#> [63] withr_3.0.2                     readr_2.2.0                    
#> [65] rappdirs_0.3.4                  bit64_4.6.0-1                  
#> [67] rmarkdown_2.31                  XVector_0.51.0                 
#> [69] matrixStats_1.5.0               bit_4.6.0                      
#> [71] otel_0.2.0                      hms_1.1.4                      
#> [73] ragg_1.5.2                      evaluate_1.0.5                 
#> [75] knitr_1.51                      GenomicRanges_1.63.1           
#> [77] IRanges_2.45.0                  rlang_1.1.7                    
#> [79] Rcpp_1.1.1                      glue_1.8.0                     
#> [81] tidytree_0.4.7                  BiocGenerics_0.57.0            
#> [83] vroom_1.7.0                     jsonlite_2.0.0                 
#> [85] R6_2.6.1                        MatrixGenerics_1.23.0          
#> [87] systemfonts_1.3.2               fs_2.0.1