Data Codebook

This vignette documents all variables and data structures in the parkinsonsMetagenomicData package, including sample metadata fields, microbiome data types, and column definitions for parquet files.

Overview

The package provides three main data components:

Sample Metadata (sampleMetadata) - Clinical and demographic information for each sample
Microbiome Data - Multiple data types from MetaPhlAn, HUMAnN, and QC tools
Reference Files - Lookup tables for taxonomic names, gene families, and pathways

Sample Metadata

The sampleMetadata data frame contains curated metadata for all samples. Both curated and uncurated features are included, with uncurated features prefixed by “uncurated_”.

Loading Sample Metadata

data("sampleMetadata", package = "parkinsonsMetagenomicData")
dim(sampleMetadata)
#> [1] 3535 1177

Core Metadata Fields

Identifiers

Field	Type	Description
`curation_id`	character	Dataset x subject identifier (format: `study_name:subject_id`)
`study_name`	character	Dataset name
`subject_id`	character	Subject identifier within study
`sample_id`	character	Unique sample identifier
`uuid`	character	Universal unique identifier for the sample
`BioProject`	character	SRA BioProject accession (format: PRJ[DEN][BA][0-9]+)
`BioSample`	character	SRA BioSample accession (format: SAM[DNEA]+?[0-9]+)
`NCBI_accession`	character	Semicolon-separated vector of NCBI accessions

Study Design

Field	Type	Description	Allowed Values
`target_condition`	character	Primary phenotype/condition of interest (multiple values separated by `;`)	Ontology terms (NCIT:C7057, EFO:0000408 descendants)
`control`	character	Sample classification in study	“Study Control”, “Case”, “Not Used”

Demographics

Field	Type	Description	Allowed Values
`age`	integer	Age of subject	Numeric value
`age_unit`	character	Unit for age	“Day”, “Week”, “Month”, “Year”
`age_group`	character	Age category	“Infant” (0-2), “Children 2-11 Years Old” (2-11), “Adolescent” (11-18), “Adult” (18-65), “Elderly” (≥65)
`sex`	character	Biological sex	“Female”, “Male”
`host_species`	character	Species of subject	“Homo sapiens”, “Mus musculus”

Clinical Information

Field	Type	Description	Allowed Values
`disease`	character	Reported disease/condition(s) (multiple values separated by `;`; “Healthy” if none)	Ontology terms (NCIT:C7057, EFO:0000408 descendants)
`body_site`	character	Anatomical location	“feces”, “milk”, “nasal cavity”, “oral cavity”, “skin epidermis”, “vagina”

Curation

Field	Type	Description
`curator`	character	Curator name(s) (multiple values separated by `;`)

Accessing Metadata Information Programmatically

# View the data dictionary
metadata_info <- data_dict()
kable(head(metadata_info, 10))

ColName	ColClass	Unique	Required	MultipleValues	Description	AllowedValues	Delimiter	Separater	DynamicEnum	DynamicEnumProperty
study_name	character	non-unique	optional	FALSE	Dataset name.	[a-zA-Z-]+[0-9]{4}\|[a-zA-Z-]+[0-9]{4}[a-zA-Z-]+\|[a-zA-Z-]+[0-9]{4}[a-zA-Z-]+\|[a-zA-Z-]+[0-9]{4}[a-zA-Z0-9]+	NA	NA	NA	NA
subject_id	character	non-unique	required	FALSE	Subject identifier.	[0-9a-zA-Z]+	NA	NA	NA	NA
sample_id	character	unique	required	FALSE	Sample identifier.	[0-9a-zA-Z]+	NA	NA	NA	NA
target_condition	character	non-unique	required	TRUE	The primary phenotype/condition of interest in the study from which the sample is derived	NA	;	NA	NCIT:C7057;EFO:0000408	descendant
control	character	non-unique	required	FALSE	Whether the sample is control, case, or not used in the study	Study Control;Case;Not Used	NA	NA	NA	NA
body_site	character	non-unique	required	FALSE	Named locations of or within the body. The anatomical location(s) affected by the patient’s disease/condition/cancer, often the site from which the sample was derived	feces;milk;nasal cavity;oral cavity;skin epidermis;vagina	NA	NA	NA	NA
age	integer	non-unique	optional	FALSE	Age of the subject using the unit specified under ‘age_unit’ column	[0-9]+	NA	NA	NA	NA
age_group	character	non-unique	optional	FALSE	11 <= Adolescent < 18\|18 <= Adult < 65\|2 <= Children 2-11 Years Old < 11\|65 <= Elderly < 130\|0 <= Infant < 2	Adolescent;Adult;Children 2-11 Years Old;Elderly;Infant	NA	NA	NA	NA
age_unit	character	non-unique	optional	FALSE	Unit of the subject’s age specified under ‘age’ column	Day;Week;Month;Year	NA	NA	NA	NA
curator	character	non-unique	required	TRUE	Curator name.	NA	;	NA	NA	NA

Example: Exploring Sample Characteristics

# Summary of control status
table(sampleMetadata$control, useNA = "ifany")
#> 
#>                      Case External Comparison Group Internal Comparison Group 
#>                      1311                        90                        59 
#>   Multiple System Atrophy             Study Control                      <NA> 
#>                         8                      2052                        15

# Age distribution
summary(sampleMetadata$age)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.     NAs 
#>    1.00   34.00   55.00   48.89   66.00   91.00     608

# Studies included
head(unique(sampleMetadata$study_name))
#> [1] "AsnicarF_2021" "BedarfJR_2017" "BoktorJC_2023" "DuruIC_2024"  
#> [5] "JoS_2022"      "LeeEJ_2024"

# Body sites sampled
table(sampleMetadata$body_site, useNA = "ifany")
#> 
#> feces  <NA> 
#>  3318   217

Microbiome Data Types

The package provides multiple types of microbiome profiling data, organized by biological content and normalization method.

Available Data Types

# Get information about all data types
data_types <- biobakery_files()
kable(data_types)

data_type	tool	description	units_normalization
genefamilies	HUMAnN	Abundance of gene families, typically identified by UniRef90 IDs. This is the raw, unnormalized output.	Reads Per Kilobase (RPK)
genefamilies_cpm	HUMAnN	Gene family abundances normalized to Copies Per Million. This accounts for sequencing depth, making samples more comparable.	Copies Per Million (CPM)
genefamilies_cpm_stratified	HUMAnN	Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family.	Copies Per Million (CPM)
genefamilies_cpm_unstratified	HUMAnN	Total community-level gene family abundances (in CPM), without taxonomic stratification. This is the sum of the stratified values for each gene family.	Copies Per Million (CPM)
genefamilies_relab	HUMAnN	Gene family abundances converted to relative abundance. The abundances in each sample are scaled to sum to 100%.	Relative Abundance (%)
genefamilies_relab_stratified	HUMAnN	Taxonomically stratified gene family abundances, expressed as relative abundance within each sample.	Relative Abundance (%)
genefamilies_relab_unstratified	HUMAnN	Total community-level gene family abundances, expressed as relative abundance.	Relative Abundance (%)
genefamilies_stratified	HUMAnN	Raw gene family abundances (in RPK) that are taxonomically stratified, showing the contribution of each species.	Reads Per Kilobase (RPK)
genefamilies_unstratified	HUMAnN	Total community-level gene family abundances (in RPK), without taxonomic stratification. This is equivalent to the main ‘genefamilies’ file.	Reads Per Kilobase (RPK)
marker_abundance	MetaPhlAn	Abundance of clade-specific marker genes. This is an intermediate file used by MetaPhlAn to calculate taxonomic relative abundances.	Mean coverage of marker genes
marker_presence	MetaPhlAn	A binary table indicating the presence (1) or absence (0) of specific marker genes for each taxon in a sample.	Binary (0 or 1)
pathabundance	HUMAnN	Abundance of metabolic pathways (e.g., MetaCyc pathways). This is the raw, unnormalized output.	Reads Per Kilobase (RPK)
pathabundance_cpm	HUMAnN	Pathway abundances normalized to Copies Per Million to account for sequencing depth.	Copies Per Million (CPM)
pathabundance_cpm_stratified	HUMAnN	Pathway abundances (in CPM) that are taxonomically stratified, showing the contribution of each species.	Copies Per Million (CPM)
pathabundance_cpm_unstratified	HUMAnN	Total community-level pathway abundances (in CPM), without taxonomic stratification.	Copies Per Million (CPM)
pathabundance_relab	HUMAnN	Pathway abundances converted to relative abundance. The abundances in each sample are scaled to sum to 100%.	Relative Abundance (%)
pathabundance_relab_stratified	HUMAnN	Taxonomically stratified pathway abundances, expressed as relative abundance.	Relative Abundance (%)
pathabundance_relab_unstratified	HUMAnN	Total community-level pathway abundances, expressed as relative abundance.	Relative Abundance (%)
pathabundance_stratified	HUMAnN	Raw pathway abundances (in RPK) that are taxonomically stratified, showing the contribution of each species.	Reads Per Kilobase (RPK)
pathabundance_unstratified	HUMAnN	Total community-level pathway abundances (in RPK). This is equivalent to the main ‘pathabundance’ file.	Reads Per Kilobase (RPK)
pathcoverage	HUMAnN	The proportion of genes within a pathway that were detected in the sample. A value of 1.0 means all genes in the pathway were found.	Proportion (0.0 to 1.0)
pathcoverage_stratified	HUMAnN	Taxonomically stratified pathway coverage, showing the coverage of a pathway within the genome of a specific contributing species.	Proportion (0.0 to 1.0)
pathcoverage_unstratified	HUMAnN	Total community-level pathway coverage. This is equivalent to the main ‘pathcoverage’ file.	Proportion (0.0 to 1.0)
relative_abundance	MetaPhlAn	The primary output for taxonomic profiling, showing the relative abundance of each microbial taxon (from kingdom to species and strain level).	Relative Abundance (%)
viral_clusters	MetaPhlAn/Custom	Represents clusters of viral sequences, often used for viral strain or species-level analysis. The values typically represent the abundance of these viral clusters.	Varies (often Relative Abundance or CPM)
strainphlan_markers	StrainPhlAn	Consensus sequences of clade-specific marker genes for each sample, used for strain-level reconstruction in StrainPhlAn.	Fraction of marker covered / Mean coverage
fastqc	FastQC	Quality control metrics for raw sequencing reads, including quality scores, adapter contamination, and sequence content, length, and duplication.	Varies by metric
kneaddata_log	KneadData	Log file reporting preprocessing steps such as quality trimming, contaminant removal (e.g., host reads), and overall read counts retained.	None (log file)
reference	NA	Reference file reporting all unique values of non-UUID identifiers in a parquet file.	NA

Data Type Categories

Taxonomic Composition (MetaPhlAn)

relative_abundance - Primary taxonomic profiling output

Description: Relative abundance of bacterial, archaeal, viral, and eukaryotic taxa
Units: Relative Abundance (%)
Levels: Kingdom through species and strain
Tool: MetaPhlAn

viral_clusters - Viral community profiling

Description: Clusters of viral sequences for strain/species-level analysis
Units: Varies (often Relative Abundance or CPM)
Tool: MetaPhlAn/Custom

marker_abundance - Marker gene quantification

Description: Abundance of clade-specific marker genes
Units: Mean coverage of marker genes
Tool: MetaPhlAn

marker_presence - Marker gene detection

Description: Binary presence/absence of marker genes
Units: Binary (0 or 1)
Tool: MetaPhlAn

Strain-Level Profiling (StrainPhlAn)

strainphlan_markers - Strain-specific markers

Description: Consensus sequences for strain-level reconstruction
Units: Fraction of marker covered / Mean coverage
Tool: StrainPhlAn

Functional Profiling (HUMAnN)

HUMAnN data types come in multiple variants based on: - Content: Gene families (genefamilies), pathway abundance (pathabundance), or pathway coverage (pathcoverage) - Stratification: Taxonomically stratified (by species) or unstratified (community total) - Normalization: Raw (RPK), relative abundance (relab), or copies per million (cpm)

Gene Families

Data Type	Stratification	Normalization	Description
`genefamilies`	Mixed	RPK	Raw output (unnormalized)
`genefamilies_unstratified`	Community total	RPK	Total gene family abundance
`genefamilies_stratified`	By species	RPK	Species-specific contributions
`genefamilies_relab_unstratified`	Community total	Relative %	Normalized community totals
`genefamilies_relab_stratified`	By species	Relative %	Normalized species contributions
`genefamilies_cpm_unstratified`	Community total	CPM	Depth-corrected totals
`genefamilies_cpm_stratified`	By species	CPM	Depth-corrected species contributions

Metabolic Pathways - Abundance

Data Type	Stratification	Normalization	Description
`pathabundance`	Mixed	RPK	Raw pathway abundance
`pathabundance_unstratified`	Community total	RPK	Total pathway abundance
`pathabundance_stratified`	By species	RPK	Species-specific pathway contributions
`pathabundance_relab_unstratified`	Community total	Relative %	Normalized community pathways
`pathabundance_relab_stratified`	By species	Relative %	Normalized species pathways
`pathabundance_cpm_unstratified`	Community total	CPM	Depth-corrected pathway totals
`pathabundance_cpm_stratified`	By species	CPM	Depth-corrected species pathways

Metabolic Pathways - Coverage

Data Type	Stratification	Units	Description
`pathcoverage`	Mixed	Proportion (0-1)	Pathway completeness
`pathcoverage_unstratified`	Community total	Proportion (0-1)	Total pathway coverage
`pathcoverage_stratified`	By species	Proportion (0-1)	Species-specific pathway coverage

Quality Control

fastqc - Sequencing quality metrics

Description: Quality scores, adapter contamination, sequence content
Units: Varies by metric
Tool: FastQC

kneaddata_log - Preprocessing statistics

Description: Quality trimming, host read removal, read counts
Units: None (log file)
Tool: KneadData

Choosing Data Types

For taxonomic analysis: - Use relative_abundance for bacteria/archaea - Add viral_clusters for viruses

For functional analysis: - Use genefamilies_relab or pathabundance_relab for relative comparisons - Use genefamilies_cpm or pathabundance_cpm when comparing across sequencing depths - Use pathcoverage to assess pathway completeness

Stratified vs. Unstratified: - Unstratified: Total community-level measurements (smaller files) - Stratified: See which species contribute to each function (larger files, richer information)

Parquet File Column Definitions

Each data type has specific columns with defined roles in the TreeSummarizedExperiment structure.

Column Roles

Columns are assigned to specific components:

cname: Column names (sample identifiers, typically uuid)
cdata: Column data (sample metadata: processing parameters, versions)
rname: Row names (feature identifiers)
rdata: Row data (feature metadata)
assay: Assay data (measurement values)

Accessing Column Information

# Get column information for a specific data type
rel_abund_cols <- parquet_colinfo("relative_abundance")
kable(rel_abund_cols)

general_data_type	col_name	col_class	description	se_role	ref_file
relative_abundance	clade_name	character	The taxonomic lineage of the detected microbial clade	rname	clade_name_ref
relative_abundance	clade_name_kingdom	character	The taxonomic kingdom of the detected microbial clade	rdata	clade_name_ref
relative_abundance	clade_name_phylum	character	The taxonomic phylum of the detected microbial clade	rdata	clade_name_ref
relative_abundance	clade_name_class	character	The taxonomic class of the detected microbial clade	rdata	clade_name_ref
relative_abundance	clade_name_order	character	The taxonomic order of the detected microbial clade	rdata	clade_name_ref
relative_abundance	clade_name_family	character	The taxonomic family of the detected microbial clade	rdata	clade_name_ref
relative_abundance	clade_name_genus	character	The taxonomic genus of the detected microbial clade	rdata	clade_name_ref
relative_abundance	clade_name_species	character	The taxonomic species of the detected microbial clade	rdata	clade_name_ref
relative_abundance	clade_name_terminal	character	The taxonomic terminal (strain, subspecies, etc.) of the detected microbial clade	rdata	clade_name_ref
relative_abundance	NCBI_tax_id	character	The NCBI Taxonomy identifier for the clade in clade_name	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_kingdom	character	The NCBI Taxonomy identifier for the kingdom in clade_name_kingdom	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_phylum	character	The NCBI Taxonomy identifier for the phylum in clade_name_phylum	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_class	character	The NCBI Taxonomy identifier for the class in clade_name_class	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_order	character	The NCBI Taxonomy identifier for the order in clade_name_order	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_family	character	The NCBI Taxonomy identifier for the family in clade_name_family	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_genus	character	The NCBI Taxonomy identifier for the genus in clade_name_genus	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_species	character	The NCBI Taxonomy identifier for the species in clade_name_species	rdata	clade_name_ref
relative_abundance	NCBI_tax_id_terminal	character	The NCBI Taxonomy identifier for the terminal (strain, subspecies, etc.) in clade_name_terminal	rdata	clade_name_ref
relative_abundance	relative_abundance	float	The proportion of the total microbial community represented by the clade	assay	NA
relative_abundance	additional_species	character	Any other species represented by the same set of detected markers	rdata	NA
relative_abundance	uuid	character	Sample UUID	cname	NA
relative_abundance	db_version	character	MetaPhlAn database version(s) referenced	cdata	NA
relative_abundance	command	character	MetaPhlAn command given	cdata	NA
relative_abundance	reads_processed	character	Number of reads processed	cdata	NA
relative_abundance	metaphlan_header	character	MetaPhlAn’s custom header row	cdata	NA
relative_abundance	original_columns	character	Original MetaPhlAn column names	cdata	NA

Relative Abundance Columns

# Show relative abundance columns grouped by role
rel_abund_cols %>%
    select(col_name, col_class, se_role, description) %>%
    kable()

col_name	col_class	se_role	description
clade_name	character	rname	The taxonomic lineage of the detected microbial clade
clade_name_kingdom	character	rdata	The taxonomic kingdom of the detected microbial clade
clade_name_phylum	character	rdata	The taxonomic phylum of the detected microbial clade
clade_name_class	character	rdata	The taxonomic class of the detected microbial clade
clade_name_order	character	rdata	The taxonomic order of the detected microbial clade
clade_name_family	character	rdata	The taxonomic family of the detected microbial clade
clade_name_genus	character	rdata	The taxonomic genus of the detected microbial clade
clade_name_species	character	rdata	The taxonomic species of the detected microbial clade
clade_name_terminal	character	rdata	The taxonomic terminal (strain, subspecies, etc.) of the detected microbial clade
NCBI_tax_id	character	rdata	The NCBI Taxonomy identifier for the clade in clade_name
NCBI_tax_id_kingdom	character	rdata	The NCBI Taxonomy identifier for the kingdom in clade_name_kingdom
NCBI_tax_id_phylum	character	rdata	The NCBI Taxonomy identifier for the phylum in clade_name_phylum
NCBI_tax_id_class	character	rdata	The NCBI Taxonomy identifier for the class in clade_name_class
NCBI_tax_id_order	character	rdata	The NCBI Taxonomy identifier for the order in clade_name_order
NCBI_tax_id_family	character	rdata	The NCBI Taxonomy identifier for the family in clade_name_family
NCBI_tax_id_genus	character	rdata	The NCBI Taxonomy identifier for the genus in clade_name_genus
NCBI_tax_id_species	character	rdata	The NCBI Taxonomy identifier for the species in clade_name_species
NCBI_tax_id_terminal	character	rdata	The NCBI Taxonomy identifier for the terminal (strain, subspecies, etc.) in clade_name_terminal
relative_abundance	float	assay	The proportion of the total microbial community represented by the clade
additional_species	character	rdata	Any other species represented by the same set of detected markers
uuid	character	cname	Sample UUID
db_version	character	cdata	MetaPhlAn database version(s) referenced
command	character	cdata	MetaPhlAn command given
reads_processed	character	cdata	Number of reads processed
metaphlan_header	character	cdata	MetaPhlAn’s custom header row
original_columns	character	cdata	Original MetaPhlAn column names

Sample Identifiers (cname)

uuid: Sample UUID - links to sampleMetadata

Sample Metadata (cdata)

db_version: MetaPhlAn database version
command: MetaPhlAn command executed
reads_processed: Number of reads processed
metaphlan_header: Original MetaPhlAn header
original_columns: Original column names

Feature Identifiers (rname)

clade_name: Full taxonomic lineage (e.g., “k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus_mutans”)

Feature Metadata (rdata)

clade_name_kingdom through clade_name_terminal: Taxonomic ranks parsed from clade_name
NCBI_tax_id through NCBI_tax_id_terminal: NCBI Taxonomy IDs for each rank
additional_species: Other species with the same marker set

Assay Data

relative_abundance: Proportion of community (0-100%)

Gene Families Columns

# Gene families column structure
gf_cols <- parquet_colinfo("genefamilies")
gf_cols %>%
    select(col_name, col_class, se_role, description) %>%
    kable()

col_name	col_class	se_role	description
gene_family	character	rname	The detected gene family
gene_family_uniref	character	rdata	The UniRef identifier of the detected gene family
gene_family_genus	character	rdata	The taxonomic genus of the detected gene family
gene_family_species	character	rdata	The taxonomic species of the detected gene family
rpk_abundance	float	assay	Gene family abundance in reads per kilobase
uuid	character	cname	Sample UUID
humann_header	character	cdata	HUMAnN’s custom header row

Key columns: - gene_family (rname): Full gene family identifier - gene_family_uniref (rdata): UniRef90 identifier - gene_family_genus (rdata): Taxonomic genus (if stratified) - gene_family_species (rdata): Taxonomic species (if stratified) - rpk_abundance (assay): Reads per kilobase

Pathway Columns

# Pathway abundance/coverage column structure
pa_cols <- parquet_colinfo("pathabundance")
pa_cols %>%
    select(col_name, col_class, se_role, description) %>%
    kable()

col_name	col_class	se_role	description
pathway	character	rname	The detected pathway
pathway_uniref	character	rdata	The UniRef identifier of the detected pathway
pathway_genus	character	rdata	The taxonomic genus of the detected pathway
pathway_species	character	rdata	The taxonomic species of the detected microbial pathway
abundance	float	assay	The abundance of the detected pathway
uuid	character	cname	Sample UUID
humann_header	character	cdata	HUMAnN’s custom header row

Key columns: - pathway (rname): MetaCyc pathway identifier - pathway_uniref (rdata): UniRef identifier (if applicable) - pathway_genus (rdata): Taxonomic genus (if stratified) - pathway_species (rdata): Taxonomic species (if stratified) - abundance or coverage (assay): Measurement value

Reference Files

Reference files provide lookup tables for non-UUID identifiers found in the parquet files.

Available References

# Get information about reference files
ref_info <- get_ref_info()
kable(ref_info)

ref_file	general_data_type	tool	description
clade_name_ref	relative_abundance	MetaPhlAn	All unique values of clade_name and NCBI_tax_id found in the MetaPhlAn relative_abundance files in the same repo
gene_family_ref	genefamilies	HUMAnN	All unique values of gene_family found in the HUMAnN genefamilies files in the same repo
genome_name_ref	viral_clusters	MetaPhlAn	All unique values of genome_name found in the MetaPhlAn viral_clusters files in the same repo
pathway_ref	pathabundance;pathcoverage	HUMAnN	All unique values of pathway found in the HUMAnN pathabundance and pathcoverage files in the same repo
uniref_marker_ref	marker_abundance;marker_presence	MetaPhlAn	All unique values of uniref marker found in the MetaPhlAn marker_abundance and marker_presence files in the same repo

Loading Reference Files

Reference files can be loaded from remote repositories or local files.

# From remote repository (requires network)
clade_ref <- load_ref("clade_name_ref")

# From local file
refpath <- file.path(system.file("extdata",
                                 package = "parkinsonsMetagenomicData"),
                     "pathway_ref.parquet")
pathway_ref <- load_ref(ref_file = refpath)
head(pathway_ref)

Reference File Contents

clade_name_ref - Taxonomic lineages

All unique clade_name values from relative_abundance files
Includes NCBI Taxonomy IDs at each rank
Use for: Taxonomic lookups, lineage parsing

gene_family_ref - Gene family identifiers

All unique gene_family values from genefamilies files
Use for: Gene family annotation lookups

pathway_ref - Pathway identifiers

All unique pathway values from pathabundance and pathcoverage files
Use for: Pathway annotation, MetaCyc lookups

genome_name_ref - Viral genome identifiers

All unique genome_name values from viral_clusters files
Use for: Viral genome annotation

uniref_marker_ref - Marker gene identifiers

All unique uniref marker values from marker_abundance and marker_presence files
Use for: Marker gene lookups

Data Repositories

Data are hosted on Hugging Face in parquet format for efficient access.

Repository Information

# Get repository information
repos <- get_repo_info()
kable(repos)

repo_name	repo_url	default
waldronlab/metagenomics_mac	https://huggingface.co/datasets/waldronlab/metagenomics_mac/tree/main	Y
waldronlab/metagenomics_mac_examples	https://huggingface.co/datasets/waldronlab/metagenomics_mac_examples/tree/main	N

Default Repository

The default repository (waldronlab/metagenomics_mac) contains the full dataset with all available samples.

Examples Repository

The examples repository (waldronlab/metagenomics_mac_examples) contains small example files with 10 samples each, useful for testing and learning.

Data Access Patterns

Quick Access with returnSamples()

For most users, returnSamples() provides the simplest interface:

# Load sample metadata
data("sampleMetadata", package = "parkinsonsMetagenomicData")

# Filter to samples of interest
my_samples <- sampleMetadata %>%
    filter(control %in% c("Case", "Study Control"),
           age >= 18,
           !is.na(sex))

# Retrieve relative abundance data
tse <- returnSamples(sample_data = my_samples[1:10, ],
                     data_type = "relative_abundance")

Advanced Access with accessParquetData() and loadParquetData()

For more control over filtering and data loading:

# Connect to database
con <- accessParquetData(data_types = "relative_abundance")

# Apply filters and load
tse <- loadParquetData(con = con,
                       data_type = "relative_abundance",
                       filter_values = list(
                           clade_name_species = c("s__Streptococcus_mutans",
                                                  "s__Escherichia_coli")
                       ))

Using Local Files

When working with downloaded parquet files:

# Point to local directory
con <- accessParquetData(file_paths = "path/to/parquet/files/",
                         data_types = "relative_abundance")

# Load data
tse <- loadParquetData(con = con,
                       data_type = "relative_abundance")

Additional Resources

Package Functions for Data Discovery

output_file_types() - List available output file types and their properties
parquet_colinfo() - Get column definitions for a specific data type
get_repo_info() - List available Hugging Face repositories
get_ref_info() - List available reference files
get_hf_parquet_urls() - Get direct URLs to parquet files

Vignettes

First 15 Minutes - Quick start guide
Full Workflow - Comprehensive data retrieval tutorial
Piecewise Workflow - Advanced direct database access
Working with Large Parquet Files - Strategies for large data types

External Documentation

Session Info

sessionInfo()
#> R Under development (unstable) (2026-04-12 r89873)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] knitr_1.51                       dplyr_1.2.1                     
#> [3] parkinsonsMetagenomicData_0.99.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.41.1     httr2_1.2.2                    
#>  [3] xfun_0.57                       bslib_0.10.0                   
#>  [5] htmlwidgets_1.6.4               Biobase_2.71.0                 
#>  [7] lattice_0.22-9                  tzdb_0.5.0                     
#>  [9] yulab.utils_0.2.4               vctrs_0.7.3                    
#> [11] tools_4.7.0                     generics_0.1.4                 
#> [13] stats4_4.7.0                    parallel_4.7.0                 
#> [15] tibble_3.3.1                    pkgconfig_2.0.3                
#> [17] Matrix_1.7-5                    dbplyr_2.5.2                   
#> [19] desc_1.4.3                      S4Vectors_0.49.1-1             
#> [21] assertthat_0.2.1                lifecycle_1.0.5                
#> [23] stringr_1.6.0                   compiler_4.7.0                 
#> [25] treeio_1.35.0                   textshaping_1.0.5              
#> [27] Biostrings_2.79.5               Seqinfo_1.1.0                  
#> [29] codetools_0.2-20                htmltools_0.5.9                
#> [31] sass_0.4.10                     yaml_2.3.12                    
#> [33] lazyeval_0.2.3                  pkgdown_2.2.0                  
#> [35] pillar_1.11.1                   crayon_1.5.3                   
#> [37] jquerylib_0.1.4                 tidyr_1.3.2                    
#> [39] BiocParallel_1.45.0             SingleCellExperiment_1.33.2    
#> [41] DelayedArray_0.37.1             cachem_1.1.0                   
#> [43] abind_1.4-8                     nlme_3.1-169                   
#> [45] tidyselect_1.2.1                digest_0.6.39                  
#> [47] stringi_1.8.7                   duckdb_1.5.2                   
#> [49] purrr_1.2.2                     arrow_23.0.1.2                 
#> [51] TreeSummarizedExperiment_2.19.0 fastmap_1.2.0                  
#> [53] grid_4.7.0                      cli_3.6.6                      
#> [55] SparseArray_1.11.13             magrittr_2.0.5                 
#> [57] S4Arrays_1.11.1                 ape_5.8-1                      
#> [59] withr_3.0.2                     readr_2.2.0                    
#> [61] rappdirs_0.3.4                  bit64_4.6.0-1                  
#> [63] rmarkdown_2.31                  XVector_0.51.0                 
#> [65] matrixStats_1.5.0               bit_4.6.0                      
#> [67] otel_0.2.0                      hms_1.1.4                      
#> [69] ragg_1.5.2                      evaluate_1.0.5                 
#> [71] GenomicRanges_1.63.2            IRanges_2.45.0                 
#> [73] rlang_1.2.0                     Rcpp_1.1.1-1                   
#> [75] glue_1.8.0                      tidytree_0.4.7                 
#> [77] DBI_1.3.0                       BiocGenerics_0.57.0            
#> [79] vroom_1.7.1                     jsonlite_2.0.0                 
#> [81] R6_2.6.1                        MatrixGenerics_1.23.0          
#> [83] systemfonts_1.3.2               fs_2.0.1