Skip to contents

Data Codebook

This vignette documents all variables and data structures in the parkinsonsMetagenomicData package, including sample metadata fields, microbiome data types, and column definitions for parquet files.

Overview

The package provides three main data components:

  1. Sample Metadata (sampleMetadata) - Clinical and demographic information for each sample
  2. Microbiome Data - Multiple data types from MetaPhlAn, HUMAnN, and QC tools
  3. Reference Files - Lookup tables for taxonomic names, gene families, and pathways

Sample Metadata

The sampleMetadata data frame contains curated metadata for all samples. Both curated and uncurated features are included, with uncurated features prefixed by “uncurated_”.

Loading Sample Metadata

data("sampleMetadata", package = "parkinsonsMetagenomicData")
dim(sampleMetadata)
#> [1] 3535 1177

Core Metadata Fields

Identifiers
Field Type Description
curation_id character Dataset x subject identifier (format: study_name:subject_id)
study_name character Dataset name
subject_id character Subject identifier within study
sample_id character Unique sample identifier
uuid character Universal unique identifier for the sample
BioProject character SRA BioProject accession (format: PRJ[DEN][BA][0-9]+)
BioSample character SRA BioSample accession (format: SAM[DNEA]+?[0-9]+)
NCBI_accession character Semicolon-separated vector of NCBI accessions
Study Design
Field Type Description Allowed Values
target_condition character Primary phenotype/condition of interest (multiple values separated by ;) Ontology terms (NCIT:C7057, EFO:0000408 descendants)
control character Sample classification in study “Study Control”, “Case”, “Not Used”
Demographics
Field Type Description Allowed Values
age integer Age of subject Numeric value
age_unit character Unit for age “Day”, “Week”, “Month”, “Year”
age_group character Age category “Infant” (0-2), “Children 2-11 Years Old” (2-11), “Adolescent” (11-18), “Adult” (18-65), “Elderly” (≥65)
sex character Biological sex “Female”, “Male”
host_species character Species of subject “Homo sapiens”, “Mus musculus”
Clinical Information
Field Type Description Allowed Values
disease character Reported disease/condition(s) (multiple values separated by ;; “Healthy” if none) Ontology terms (NCIT:C7057, EFO:0000408 descendants)
body_site character Anatomical location “feces”, “milk”, “nasal cavity”, “oral cavity”, “skin epidermis”, “vagina”
Curation
Field Type Description
curator character Curator name(s) (multiple values separated by ;)

Accessing Metadata Information Programmatically

# View the data dictionary
metadata_info <- data_dict()
kable(head(metadata_info, 10))
ColName ColClass Unique Required MultipleValues Description AllowedValues Delimiter Separater DynamicEnum DynamicEnumProperty
study_name character non-unique optional FALSE Dataset name. [a-zA-Z-]+[0-9]{4}|[a-zA-Z-]+[0-9]{4}[a-zA-Z-]+|[a-zA-Z-]+[0-9]{4}[a-zA-Z-]+|[a-zA-Z-]+[0-9]{4}[a-zA-Z0-9]+ NA NA NA NA
subject_id character non-unique required FALSE Subject identifier. [0-9a-zA-Z]+ NA NA NA NA
sample_id character unique required FALSE Sample identifier. [0-9a-zA-Z]+ NA NA NA NA
target_condition character non-unique required TRUE The primary phenotype/condition of interest in the study from which the sample is derived NA ; NA NCIT:C7057;EFO:0000408 descendant
control character non-unique required FALSE Whether the sample is control, case, or not used in the study Study Control;Case;Not Used NA NA NA NA
body_site character non-unique required FALSE Named locations of or within the body. The anatomical location(s) affected by the patient’s disease/condition/cancer, often the site from which the sample was derived feces;milk;nasal cavity;oral cavity;skin epidermis;vagina NA NA NA NA
age integer non-unique optional FALSE Age of the subject using the unit specified under ‘age_unit’ column [0-9]+ NA NA NA NA
age_group character non-unique optional FALSE 11 <= Adolescent < 18|18 <= Adult < 65|2 <= Children 2-11 Years Old < 11|65 <= Elderly < 130|0 <= Infant < 2 Adolescent;Adult;Children 2-11 Years Old;Elderly;Infant NA NA NA NA
age_unit character non-unique optional FALSE Unit of the subject’s age specified under ‘age’ column Day;Week;Month;Year NA NA NA NA
curator character non-unique required TRUE Curator name. NA ; NA NA NA

Example: Exploring Sample Characteristics

# Summary of control status
table(sampleMetadata$control, useNA = "ifany")
#> 
#>                      Case External Comparison Group Internal Comparison Group 
#>                      1311                        90                        59 
#>   Multiple System Atrophy             Study Control                      <NA> 
#>                         8                      2052                        15

# Age distribution
summary(sampleMetadata$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
#>    1.00   34.00   55.00   48.89   66.00   91.00     608

# Studies included
head(unique(sampleMetadata$study_name))
#> [1] "AsnicarF_2021" "BedarfJR_2017" "BoktorJC_2023" "DuruIC_2024"  
#> [5] "JoS_2022"      "LeeEJ_2024"

# Body sites sampled
table(sampleMetadata$body_site, useNA = "ifany")
#> 
#> feces  <NA> 
#>  3318   217

Microbiome Data Types

The package provides multiple types of microbiome profiling data, organized by biological content and normalization method.

Available Data Types

# Get information about all data types
data_types <- biobakery_files()
kable(data_types)
data_type tool description units_normalization
genefamilies HUMAnN Abundance of gene families, typically identified by UniRef90 IDs. This is the raw, unnormalized output. Reads Per Kilobase (RPK)
genefamilies_cpm HUMAnN Gene family abundances normalized to Copies Per Million. This accounts for sequencing depth, making samples more comparable. Copies Per Million (CPM)
genefamilies_cpm_stratified HUMAnN Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family. Copies Per Million (CPM)
genefamilies_cpm_unstratified HUMAnN Total community-level gene family abundances (in CPM), without taxonomic stratification. This is the sum of the stratified values for each gene family. Copies Per Million (CPM)
genefamilies_relab HUMAnN Gene family abundances converted to relative abundance. The abundances in each sample are scaled to sum to 100%. Relative Abundance (%)
genefamilies_relab_stratified HUMAnN Taxonomically stratified gene family abundances, expressed as relative abundance within each sample. Relative Abundance (%)
genefamilies_relab_unstratified HUMAnN Total community-level gene family abundances, expressed as relative abundance. Relative Abundance (%)
genefamilies_stratified HUMAnN Raw gene family abundances (in RPK) that are taxonomically stratified, showing the contribution of each species. Reads Per Kilobase (RPK)
genefamilies_unstratified HUMAnN Total community-level gene family abundances (in RPK), without taxonomic stratification. This is equivalent to the main ‘genefamilies’ file. Reads Per Kilobase (RPK)
marker_abundance MetaPhlAn Abundance of clade-specific marker genes. This is an intermediate file used by MetaPhlAn to calculate taxonomic relative abundances. Mean coverage of marker genes
marker_presence MetaPhlAn A binary table indicating the presence (1) or absence (0) of specific marker genes for each taxon in a sample. Binary (0 or 1)
pathabundance HUMAnN Abundance of metabolic pathways (e.g., MetaCyc pathways). This is the raw, unnormalized output. Reads Per Kilobase (RPK)
pathabundance_cpm HUMAnN Pathway abundances normalized to Copies Per Million to account for sequencing depth. Copies Per Million (CPM)
pathabundance_cpm_stratified HUMAnN Pathway abundances (in CPM) that are taxonomically stratified, showing the contribution of each species. Copies Per Million (CPM)
pathabundance_cpm_unstratified HUMAnN Total community-level pathway abundances (in CPM), without taxonomic stratification. Copies Per Million (CPM)
pathabundance_relab HUMAnN Pathway abundances converted to relative abundance. The abundances in each sample are scaled to sum to 100%. Relative Abundance (%)
pathabundance_relab_stratified HUMAnN Taxonomically stratified pathway abundances, expressed as relative abundance. Relative Abundance (%)
pathabundance_relab_unstratified HUMAnN Total community-level pathway abundances, expressed as relative abundance. Relative Abundance (%)
pathabundance_stratified HUMAnN Raw pathway abundances (in RPK) that are taxonomically stratified, showing the contribution of each species. Reads Per Kilobase (RPK)
pathabundance_unstratified HUMAnN Total community-level pathway abundances (in RPK). This is equivalent to the main ‘pathabundance’ file. Reads Per Kilobase (RPK)
pathcoverage HUMAnN The proportion of genes within a pathway that were detected in the sample. A value of 1.0 means all genes in the pathway were found. Proportion (0.0 to 1.0)
pathcoverage_stratified HUMAnN Taxonomically stratified pathway coverage, showing the coverage of a pathway within the genome of a specific contributing species. Proportion (0.0 to 1.0)
pathcoverage_unstratified HUMAnN Total community-level pathway coverage. This is equivalent to the main ‘pathcoverage’ file. Proportion (0.0 to 1.0)
relative_abundance MetaPhlAn The primary output for taxonomic profiling, showing the relative abundance of each microbial taxon (from kingdom to species and strain level). Relative Abundance (%)
viral_clusters MetaPhlAn/Custom Represents clusters of viral sequences, often used for viral strain or species-level analysis. The values typically represent the abundance of these viral clusters. Varies (often Relative Abundance or CPM)
strainphlan_markers StrainPhlAn Consensus sequences of clade-specific marker genes for each sample, used for strain-level reconstruction in StrainPhlAn. Fraction of marker covered / Mean coverage
fastqc FastQC Quality control metrics for raw sequencing reads, including quality scores, adapter contamination, and sequence content, length, and duplication. Varies by metric
kneaddata_log KneadData Log file reporting preprocessing steps such as quality trimming, contaminant removal (e.g., host reads), and overall read counts retained. None (log file)
reference NA Reference file reporting all unique values of non-UUID identifiers in a parquet file. NA

Data Type Categories

Taxonomic Composition (MetaPhlAn)

relative_abundance - Primary taxonomic profiling output

  • Description: Relative abundance of bacterial, archaeal, viral, and eukaryotic taxa
  • Units: Relative Abundance (%)
  • Levels: Kingdom through species and strain
  • Tool: MetaPhlAn

viral_clusters - Viral community profiling

  • Description: Clusters of viral sequences for strain/species-level analysis
  • Units: Varies (often Relative Abundance or CPM)
  • Tool: MetaPhlAn/Custom

marker_abundance - Marker gene quantification

  • Description: Abundance of clade-specific marker genes
  • Units: Mean coverage of marker genes
  • Tool: MetaPhlAn

marker_presence - Marker gene detection

  • Description: Binary presence/absence of marker genes
  • Units: Binary (0 or 1)
  • Tool: MetaPhlAn
Strain-Level Profiling (StrainPhlAn)

strainphlan_markers - Strain-specific markers

  • Description: Consensus sequences for strain-level reconstruction
  • Units: Fraction of marker covered / Mean coverage
  • Tool: StrainPhlAn
Functional Profiling (HUMAnN)

HUMAnN data types come in multiple variants based on: - Content: Gene families (genefamilies), pathway abundance (pathabundance), or pathway coverage (pathcoverage) - Stratification: Taxonomically stratified (by species) or unstratified (community total) - Normalization: Raw (RPK), relative abundance (relab), or copies per million (cpm)

Gene Families

Data Type Stratification Normalization Description
genefamilies Mixed RPK Raw output (unnormalized)
genefamilies_unstratified Community total RPK Total gene family abundance
genefamilies_stratified By species RPK Species-specific contributions
genefamilies_relab_unstratified Community total Relative % Normalized community totals
genefamilies_relab_stratified By species Relative % Normalized species contributions
genefamilies_cpm_unstratified Community total CPM Depth-corrected totals
genefamilies_cpm_stratified By species CPM Depth-corrected species contributions

Metabolic Pathways - Abundance

Data Type Stratification Normalization Description
pathabundance Mixed RPK Raw pathway abundance
pathabundance_unstratified Community total RPK Total pathway abundance
pathabundance_stratified By species RPK Species-specific pathway contributions
pathabundance_relab_unstratified Community total Relative % Normalized community pathways
pathabundance_relab_stratified By species Relative % Normalized species pathways
pathabundance_cpm_unstratified Community total CPM Depth-corrected pathway totals
pathabundance_cpm_stratified By species CPM Depth-corrected species pathways

Metabolic Pathways - Coverage

Data Type Stratification Units Description
pathcoverage Mixed Proportion (0-1) Pathway completeness
pathcoverage_unstratified Community total Proportion (0-1) Total pathway coverage
pathcoverage_stratified By species Proportion (0-1) Species-specific pathway coverage
Quality Control

fastqc - Sequencing quality metrics

  • Description: Quality scores, adapter contamination, sequence content
  • Units: Varies by metric
  • Tool: FastQC

kneaddata_log - Preprocessing statistics

  • Description: Quality trimming, host read removal, read counts
  • Units: None (log file)
  • Tool: KneadData

Choosing Data Types

For taxonomic analysis: - Use relative_abundance for bacteria/archaea - Add viral_clusters for viruses

For functional analysis: - Use genefamilies_relab or pathabundance_relab for relative comparisons - Use genefamilies_cpm or pathabundance_cpm when comparing across sequencing depths - Use pathcoverage to assess pathway completeness

Stratified vs. Unstratified: - Unstratified: Total community-level measurements (smaller files) - Stratified: See which species contribute to each function (larger files, richer information)

Parquet File Column Definitions

Each data type has specific columns with defined roles in the TreeSummarizedExperiment structure.

Column Roles

Columns are assigned to specific components:

  • cname: Column names (sample identifiers, typically uuid)
  • cdata: Column data (sample metadata: processing parameters, versions)
  • rname: Row names (feature identifiers)
  • rdata: Row data (feature metadata)
  • assay: Assay data (measurement values)

Accessing Column Information

# Get column information for a specific data type
rel_abund_cols <- parquet_colinfo("relative_abundance")
kable(rel_abund_cols)
general_data_type col_name col_class description se_role ref_file
relative_abundance clade_name character The taxonomic lineage of the detected microbial clade rname clade_name_ref
relative_abundance clade_name_kingdom character The taxonomic kingdom of the detected microbial clade rdata clade_name_ref
relative_abundance clade_name_phylum character The taxonomic phylum of the detected microbial clade rdata clade_name_ref
relative_abundance clade_name_class character The taxonomic class of the detected microbial clade rdata clade_name_ref
relative_abundance clade_name_order character The taxonomic order of the detected microbial clade rdata clade_name_ref
relative_abundance clade_name_family character The taxonomic family of the detected microbial clade rdata clade_name_ref
relative_abundance clade_name_genus character The taxonomic genus of the detected microbial clade rdata clade_name_ref
relative_abundance clade_name_species character The taxonomic species of the detected microbial clade rdata clade_name_ref
relative_abundance clade_name_terminal character The taxonomic terminal (strain, subspecies, etc.) of the detected microbial clade rdata clade_name_ref
relative_abundance NCBI_tax_id character The NCBI Taxonomy identifier for the clade in clade_name rdata clade_name_ref
relative_abundance NCBI_tax_id_kingdom character The NCBI Taxonomy identifier for the kingdom in clade_name_kingdom rdata clade_name_ref
relative_abundance NCBI_tax_id_phylum character The NCBI Taxonomy identifier for the phylum in clade_name_phylum rdata clade_name_ref
relative_abundance NCBI_tax_id_class character The NCBI Taxonomy identifier for the class in clade_name_class rdata clade_name_ref
relative_abundance NCBI_tax_id_order character The NCBI Taxonomy identifier for the order in clade_name_order rdata clade_name_ref
relative_abundance NCBI_tax_id_family character The NCBI Taxonomy identifier for the family in clade_name_family rdata clade_name_ref
relative_abundance NCBI_tax_id_genus character The NCBI Taxonomy identifier for the genus in clade_name_genus rdata clade_name_ref
relative_abundance NCBI_tax_id_species character The NCBI Taxonomy identifier for the species in clade_name_species rdata clade_name_ref
relative_abundance NCBI_tax_id_terminal character The NCBI Taxonomy identifier for the terminal (strain, subspecies, etc.) in clade_name_terminal rdata clade_name_ref
relative_abundance relative_abundance float The proportion of the total microbial community represented by the clade assay NA
relative_abundance additional_species character Any other species represented by the same set of detected markers rdata NA
relative_abundance uuid character Sample UUID cname NA
relative_abundance db_version character MetaPhlAn database version(s) referenced cdata NA
relative_abundance command character MetaPhlAn command given cdata NA
relative_abundance reads_processed character Number of reads processed cdata NA
relative_abundance metaphlan_header character MetaPhlAn’s custom header row cdata NA
relative_abundance original_columns character Original MetaPhlAn column names cdata NA

Relative Abundance Columns

# Show relative abundance columns grouped by role
rel_abund_cols %>%
    select(col_name, col_class, se_role, description) %>%
    kable()
col_name col_class se_role description
clade_name character rname The taxonomic lineage of the detected microbial clade
clade_name_kingdom character rdata The taxonomic kingdom of the detected microbial clade
clade_name_phylum character rdata The taxonomic phylum of the detected microbial clade
clade_name_class character rdata The taxonomic class of the detected microbial clade
clade_name_order character rdata The taxonomic order of the detected microbial clade
clade_name_family character rdata The taxonomic family of the detected microbial clade
clade_name_genus character rdata The taxonomic genus of the detected microbial clade
clade_name_species character rdata The taxonomic species of the detected microbial clade
clade_name_terminal character rdata The taxonomic terminal (strain, subspecies, etc.) of the detected microbial clade
NCBI_tax_id character rdata The NCBI Taxonomy identifier for the clade in clade_name
NCBI_tax_id_kingdom character rdata The NCBI Taxonomy identifier for the kingdom in clade_name_kingdom
NCBI_tax_id_phylum character rdata The NCBI Taxonomy identifier for the phylum in clade_name_phylum
NCBI_tax_id_class character rdata The NCBI Taxonomy identifier for the class in clade_name_class
NCBI_tax_id_order character rdata The NCBI Taxonomy identifier for the order in clade_name_order
NCBI_tax_id_family character rdata The NCBI Taxonomy identifier for the family in clade_name_family
NCBI_tax_id_genus character rdata The NCBI Taxonomy identifier for the genus in clade_name_genus
NCBI_tax_id_species character rdata The NCBI Taxonomy identifier for the species in clade_name_species
NCBI_tax_id_terminal character rdata The NCBI Taxonomy identifier for the terminal (strain, subspecies, etc.) in clade_name_terminal
relative_abundance float assay The proportion of the total microbial community represented by the clade
additional_species character rdata Any other species represented by the same set of detected markers
uuid character cname Sample UUID
db_version character cdata MetaPhlAn database version(s) referenced
command character cdata MetaPhlAn command given
reads_processed character cdata Number of reads processed
metaphlan_header character cdata MetaPhlAn’s custom header row
original_columns character cdata Original MetaPhlAn column names
Sample Identifiers (cname)
  • uuid: Sample UUID - links to sampleMetadata
Sample Metadata (cdata)
  • db_version: MetaPhlAn database version
  • command: MetaPhlAn command executed
  • reads_processed: Number of reads processed
  • metaphlan_header: Original MetaPhlAn header
  • original_columns: Original column names
Feature Identifiers (rname)
  • clade_name: Full taxonomic lineage (e.g., “k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus_mutans”)
Feature Metadata (rdata)
  • clade_name_kingdom through clade_name_terminal: Taxonomic ranks parsed from clade_name
  • NCBI_tax_id through NCBI_tax_id_terminal: NCBI Taxonomy IDs for each rank
  • additional_species: Other species with the same marker set
Assay Data
  • relative_abundance: Proportion of community (0-100%)

Gene Families Columns

# Gene families column structure
gf_cols <- parquet_colinfo("genefamilies")
gf_cols %>%
    select(col_name, col_class, se_role, description) %>%
    kable()
col_name col_class se_role description
gene_family character rname The detected gene family
gene_family_uniref character rdata The UniRef identifier of the detected gene family
gene_family_genus character rdata The taxonomic genus of the detected gene family
gene_family_species character rdata The taxonomic species of the detected gene family
rpk_abundance float assay Gene family abundance in reads per kilobase
uuid character cname Sample UUID
humann_header character cdata HUMAnN’s custom header row

Key columns: - gene_family (rname): Full gene family identifier - gene_family_uniref (rdata): UniRef90 identifier - gene_family_genus (rdata): Taxonomic genus (if stratified) - gene_family_species (rdata): Taxonomic species (if stratified) - rpk_abundance (assay): Reads per kilobase

Pathway Columns

# Pathway abundance/coverage column structure
pa_cols <- parquet_colinfo("pathabundance")
pa_cols %>%
    select(col_name, col_class, se_role, description) %>%
    kable()
col_name col_class se_role description
pathway character rname The detected pathway
pathway_uniref character rdata The UniRef identifier of the detected pathway
pathway_genus character rdata The taxonomic genus of the detected pathway
pathway_species character rdata The taxonomic species of the detected microbial pathway
abundance float assay The abundance of the detected pathway
uuid character cname Sample UUID
humann_header character cdata HUMAnN’s custom header row

Key columns: - pathway (rname): MetaCyc pathway identifier - pathway_uniref (rdata): UniRef identifier (if applicable) - pathway_genus (rdata): Taxonomic genus (if stratified) - pathway_species (rdata): Taxonomic species (if stratified) - abundance or coverage (assay): Measurement value

Reference Files

Reference files provide lookup tables for non-UUID identifiers found in the parquet files.

Available References

# Get information about reference files
ref_info <- get_ref_info()
kable(ref_info)
ref_file general_data_type tool description
clade_name_ref relative_abundance MetaPhlAn All unique values of clade_name and NCBI_tax_id found in the MetaPhlAn relative_abundance files in the same repo
gene_family_ref genefamilies HUMAnN All unique values of gene_family found in the HUMAnN genefamilies files in the same repo
genome_name_ref viral_clusters MetaPhlAn All unique values of genome_name found in the MetaPhlAn viral_clusters files in the same repo
pathway_ref pathabundance;pathcoverage HUMAnN All unique values of pathway found in the HUMAnN pathabundance and pathcoverage files in the same repo
uniref_marker_ref marker_abundance;marker_presence MetaPhlAn All unique values of uniref marker found in the MetaPhlAn marker_abundance and marker_presence files in the same repo

Loading Reference Files

Reference files can be loaded from remote repositories or local files.

# From remote repository (requires network)
clade_ref <- load_ref("clade_name_ref")

# From local file
refpath <- file.path(system.file("extdata",
                                 package = "parkinsonsMetagenomicData"),
                     "pathway_ref.parquet")
pathway_ref <- load_ref(ref_file = refpath)
head(pathway_ref)

Reference File Contents

clade_name_ref - Taxonomic lineages

  • All unique clade_name values from relative_abundance files
  • Includes NCBI Taxonomy IDs at each rank
  • Use for: Taxonomic lookups, lineage parsing

gene_family_ref - Gene family identifiers

  • All unique gene_family values from genefamilies files
  • Use for: Gene family annotation lookups

pathway_ref - Pathway identifiers

  • All unique pathway values from pathabundance and pathcoverage files
  • Use for: Pathway annotation, MetaCyc lookups

genome_name_ref - Viral genome identifiers

  • All unique genome_name values from viral_clusters files
  • Use for: Viral genome annotation

uniref_marker_ref - Marker gene identifiers

  • All unique uniref marker values from marker_abundance and marker_presence files
  • Use for: Marker gene lookups

Data Repositories

Data are hosted on Hugging Face in parquet format for efficient access.

Repository Information

# Get repository information
repos <- get_repo_info()
kable(repos)
repo_name repo_url default
waldronlab/metagenomics_mac https://huggingface.co/datasets/waldronlab/metagenomics_mac/tree/main Y
waldronlab/metagenomics_mac_examples https://huggingface.co/datasets/waldronlab/metagenomics_mac_examples/tree/main N

Default Repository

The default repository (waldronlab/metagenomics_mac) contains the full dataset with all available samples.

Examples Repository

The examples repository (waldronlab/metagenomics_mac_examples) contains small example files with 10 samples each, useful for testing and learning.

Data Access Patterns

Quick Access with returnSamples()

For most users, returnSamples() provides the simplest interface:

# Load sample metadata
data("sampleMetadata", package = "parkinsonsMetagenomicData")

# Filter to samples of interest
my_samples <- sampleMetadata %>%
    filter(control %in% c("Case", "Study Control"),
           age >= 18,
           !is.na(sex))

# Retrieve relative abundance data
tse <- returnSamples(sample_data = my_samples[1:10, ],
                     data_type = "relative_abundance")

Advanced Access with accessParquetData() and loadParquetData()

For more control over filtering and data loading:

# Connect to database
con <- accessParquetData(data_types = "relative_abundance")

# Apply filters and load
tse <- loadParquetData(con = con,
                       data_type = "relative_abundance",
                       filter_values = list(
                           clade_name_species = c("s__Streptococcus_mutans",
                                                  "s__Escherichia_coli")
                       ))

Using Local Files

When working with downloaded parquet files:

# Point to local directory
con <- accessParquetData(file_paths = "path/to/parquet/files/",
                         data_types = "relative_abundance")

# Load data
tse <- loadParquetData(con = con,
                       data_type = "relative_abundance")

Additional Resources

Package Functions for Data Discovery

Vignettes

  • First 15 Minutes - Quick start guide
  • Full Workflow - Comprehensive data retrieval tutorial
  • Piecewise Workflow - Advanced direct database access
  • Working with Large Parquet Files - Strategies for large data types

Session Info

sessionInfo()
#> R Under development (unstable) (2026-03-28 r89738)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] knitr_1.51                       dplyr_1.2.0                     
#> [3] parkinsonsMetagenomicData_0.99.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.41.1     httr2_1.2.2                    
#>  [3] xfun_0.57                       bslib_0.10.0                   
#>  [5] htmlwidgets_1.6.4               Biobase_2.71.0                 
#>  [7] lattice_0.22-9                  tzdb_0.5.0                     
#>  [9] yulab.utils_0.2.4               vctrs_0.7.2                    
#> [11] tools_4.7.0                     generics_0.1.4                 
#> [13] stats4_4.7.0                    parallel_4.7.0                 
#> [15] tibble_3.3.1                    pkgconfig_2.0.3                
#> [17] Matrix_1.7-5                    dbplyr_2.5.2                   
#> [19] desc_1.4.3                      S4Vectors_0.49.0               
#> [21] assertthat_0.2.1                lifecycle_1.0.5                
#> [23] stringr_1.6.0                   compiler_4.7.0                 
#> [25] treeio_1.35.0                   textshaping_1.0.5              
#> [27] Biostrings_2.79.5               Seqinfo_1.1.0                  
#> [29] codetools_0.2-20                htmltools_0.5.9                
#> [31] sass_0.4.10                     yaml_2.3.12                    
#> [33] lazyeval_0.2.2                  pkgdown_2.2.0                  
#> [35] pillar_1.11.1                   crayon_1.5.3                   
#> [37] jquerylib_0.1.4                 tidyr_1.3.2                    
#> [39] BiocParallel_1.45.0             SingleCellExperiment_1.33.2    
#> [41] DelayedArray_0.37.0             cachem_1.1.0                   
#> [43] abind_1.4-8                     nlme_3.1-169                   
#> [45] tidyselect_1.2.1                digest_0.6.39                  
#> [47] stringi_1.8.7                   duckdb_1.5.1                   
#> [49] purrr_1.2.1                     arrow_23.0.1.2                 
#> [51] TreeSummarizedExperiment_2.19.0 fastmap_1.2.0                  
#> [53] grid_4.7.0                      cli_3.6.5                      
#> [55] SparseArray_1.11.11             magrittr_2.0.4                 
#> [57] S4Arrays_1.11.1                 ape_5.8-1                      
#> [59] withr_3.0.2                     readr_2.2.0                    
#> [61] rappdirs_0.3.4                  bit64_4.6.0-1                  
#> [63] rmarkdown_2.31                  XVector_0.51.0                 
#> [65] matrixStats_1.5.0               bit_4.6.0                      
#> [67] otel_0.2.0                      hms_1.1.4                      
#> [69] ragg_1.5.2                      evaluate_1.0.5                 
#> [71] GenomicRanges_1.63.1            IRanges_2.45.0                 
#> [73] rlang_1.1.7                     Rcpp_1.1.1                     
#> [75] glue_1.8.0                      tidytree_0.4.7                 
#> [77] DBI_1.3.0                       BiocGenerics_0.57.0            
#> [79] vroom_1.7.0                     jsonlite_2.0.0                 
#> [81] R6_2.6.1                        MatrixGenerics_1.23.0          
#> [83] systemfonts_1.3.2               fs_2.0.1