Data Codebook
This vignette documents all variables and data structures in the
parkinsonsMetagenomicData package, including sample
metadata fields, microbiome data types, and column definitions for
parquet files.
Overview
The package provides three main data components:
-
Sample Metadata (
sampleMetadata) - Clinical and demographic information for each sample - Microbiome Data - Multiple data types from MetaPhlAn, HUMAnN, and QC tools
- Reference Files - Lookup tables for taxonomic names, gene families, and pathways
Sample Metadata
The sampleMetadata data frame contains curated metadata
for all samples. Both curated and uncurated features are included, with
uncurated features prefixed by “uncurated_”.
Core Metadata Fields
Identifiers
| Field | Type | Description |
|---|---|---|
curation_id |
character | Dataset x subject identifier (format:
study_name:subject_id) |
study_name |
character | Dataset name |
subject_id |
character | Subject identifier within study |
sample_id |
character | Unique sample identifier |
uuid |
character | Universal unique identifier for the sample |
BioProject |
character | SRA BioProject accession (format: PRJ[DEN][BA][0-9]+) |
BioSample |
character | SRA BioSample accession (format: SAM[DNEA]+?[0-9]+) |
NCBI_accession |
character | Semicolon-separated vector of NCBI accessions |
Study Design
| Field | Type | Description | Allowed Values |
|---|---|---|---|
target_condition |
character | Primary phenotype/condition of interest (multiple values separated
by ;) |
Ontology terms (NCIT:C7057, EFO:0000408 descendants) |
control |
character | Sample classification in study | “Study Control”, “Case”, “Not Used” |
Demographics
| Field | Type | Description | Allowed Values |
|---|---|---|---|
age |
integer | Age of subject | Numeric value |
age_unit |
character | Unit for age | “Day”, “Week”, “Month”, “Year” |
age_group |
character | Age category | “Infant” (0-2), “Children 2-11 Years Old” (2-11), “Adolescent” (11-18), “Adult” (18-65), “Elderly” (≥65) |
sex |
character | Biological sex | “Female”, “Male” |
host_species |
character | Species of subject | “Homo sapiens”, “Mus musculus” |
Clinical Information
| Field | Type | Description | Allowed Values |
|---|---|---|---|
disease |
character | Reported disease/condition(s) (multiple values separated by
;; “Healthy” if none) |
Ontology terms (NCIT:C7057, EFO:0000408 descendants) |
body_site |
character | Anatomical location | “feces”, “milk”, “nasal cavity”, “oral cavity”, “skin epidermis”, “vagina” |
Accessing Metadata Information Programmatically
| ColName | ColClass | Unique | Required | MultipleValues | Description | AllowedValues | Delimiter | Separater | DynamicEnum | DynamicEnumProperty |
|---|---|---|---|---|---|---|---|---|---|---|
| study_name | character | non-unique | optional | FALSE | Dataset name. | [a-zA-Z-]+[0-9]{4}|[a-zA-Z-]+[0-9]{4}[a-zA-Z-]+|[a-zA-Z-]+[0-9]{4}[a-zA-Z-]+|[a-zA-Z-]+[0-9]{4}[a-zA-Z0-9]+ | NA | NA | NA | NA |
| subject_id | character | non-unique | required | FALSE | Subject identifier. | [0-9a-zA-Z]+ | NA | NA | NA | NA |
| sample_id | character | unique | required | FALSE | Sample identifier. | [0-9a-zA-Z]+ | NA | NA | NA | NA |
| target_condition | character | non-unique | required | TRUE | The primary phenotype/condition of interest in the study from which the sample is derived | NA | ; | NA | NCIT:C7057;EFO:0000408 | descendant |
| control | character | non-unique | required | FALSE | Whether the sample is control, case, or not used in the study | Study Control;Case;Not Used | NA | NA | NA | NA |
| body_site | character | non-unique | required | FALSE | Named locations of or within the body. The anatomical location(s) affected by the patient’s disease/condition/cancer, often the site from which the sample was derived | feces;milk;nasal cavity;oral cavity;skin epidermis;vagina | NA | NA | NA | NA |
| age | integer | non-unique | optional | FALSE | Age of the subject using the unit specified under ‘age_unit’ column | [0-9]+ | NA | NA | NA | NA |
| age_group | character | non-unique | optional | FALSE | 11 <= Adolescent < 18|18 <= Adult < 65|2 <= Children 2-11 Years Old < 11|65 <= Elderly < 130|0 <= Infant < 2 | Adolescent;Adult;Children 2-11 Years Old;Elderly;Infant | NA | NA | NA | NA |
| age_unit | character | non-unique | optional | FALSE | Unit of the subject’s age specified under ‘age’ column | Day;Week;Month;Year | NA | NA | NA | NA |
| curator | character | non-unique | required | TRUE | Curator name. | NA | ; | NA | NA | NA |
Example: Exploring Sample Characteristics
# Summary of control status
table(sampleMetadata$control, useNA = "ifany")
#>
#> Case External Comparison Group Internal Comparison Group
#> 1311 90 59
#> Multiple System Atrophy Study Control <NA>
#> 8 2052 15
# Age distribution
summary(sampleMetadata$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NAs
#> 1.00 34.00 55.00 48.89 66.00 91.00 608
# Studies included
head(unique(sampleMetadata$study_name))
#> [1] "AsnicarF_2021" "BedarfJR_2017" "BoktorJC_2023" "DuruIC_2024"
#> [5] "JoS_2022" "LeeEJ_2024"
# Body sites sampled
table(sampleMetadata$body_site, useNA = "ifany")
#>
#> feces <NA>
#> 3318 217Microbiome Data Types
The package provides multiple types of microbiome profiling data, organized by biological content and normalization method.
Available Data Types
# Get information about all data types
data_types <- biobakery_files()
kable(data_types)| data_type | tool | description | units_normalization |
|---|---|---|---|
| genefamilies | HUMAnN | Abundance of gene families, typically identified by UniRef90 IDs. This is the raw, unnormalized output. | Reads Per Kilobase (RPK) |
| genefamilies_cpm | HUMAnN | Gene family abundances normalized to Copies Per Million. This accounts for sequencing depth, making samples more comparable. | Copies Per Million (CPM) |
| genefamilies_cpm_stratified | HUMAnN | Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family. | Copies Per Million (CPM) |
| genefamilies_cpm_unstratified | HUMAnN | Total community-level gene family abundances (in CPM), without taxonomic stratification. This is the sum of the stratified values for each gene family. | Copies Per Million (CPM) |
| genefamilies_relab | HUMAnN | Gene family abundances converted to relative abundance. The abundances in each sample are scaled to sum to 100%. | Relative Abundance (%) |
| genefamilies_relab_stratified | HUMAnN | Taxonomically stratified gene family abundances, expressed as relative abundance within each sample. | Relative Abundance (%) |
| genefamilies_relab_unstratified | HUMAnN | Total community-level gene family abundances, expressed as relative abundance. | Relative Abundance (%) |
| genefamilies_stratified | HUMAnN | Raw gene family abundances (in RPK) that are taxonomically stratified, showing the contribution of each species. | Reads Per Kilobase (RPK) |
| genefamilies_unstratified | HUMAnN | Total community-level gene family abundances (in RPK), without taxonomic stratification. This is equivalent to the main ‘genefamilies’ file. | Reads Per Kilobase (RPK) |
| marker_abundance | MetaPhlAn | Abundance of clade-specific marker genes. This is an intermediate file used by MetaPhlAn to calculate taxonomic relative abundances. | Mean coverage of marker genes |
| marker_presence | MetaPhlAn | A binary table indicating the presence (1) or absence (0) of specific marker genes for each taxon in a sample. | Binary (0 or 1) |
| pathabundance | HUMAnN | Abundance of metabolic pathways (e.g., MetaCyc pathways). This is the raw, unnormalized output. | Reads Per Kilobase (RPK) |
| pathabundance_cpm | HUMAnN | Pathway abundances normalized to Copies Per Million to account for sequencing depth. | Copies Per Million (CPM) |
| pathabundance_cpm_stratified | HUMAnN | Pathway abundances (in CPM) that are taxonomically stratified, showing the contribution of each species. | Copies Per Million (CPM) |
| pathabundance_cpm_unstratified | HUMAnN | Total community-level pathway abundances (in CPM), without taxonomic stratification. | Copies Per Million (CPM) |
| pathabundance_relab | HUMAnN | Pathway abundances converted to relative abundance. The abundances in each sample are scaled to sum to 100%. | Relative Abundance (%) |
| pathabundance_relab_stratified | HUMAnN | Taxonomically stratified pathway abundances, expressed as relative abundance. | Relative Abundance (%) |
| pathabundance_relab_unstratified | HUMAnN | Total community-level pathway abundances, expressed as relative abundance. | Relative Abundance (%) |
| pathabundance_stratified | HUMAnN | Raw pathway abundances (in RPK) that are taxonomically stratified, showing the contribution of each species. | Reads Per Kilobase (RPK) |
| pathabundance_unstratified | HUMAnN | Total community-level pathway abundances (in RPK). This is equivalent to the main ‘pathabundance’ file. | Reads Per Kilobase (RPK) |
| pathcoverage | HUMAnN | The proportion of genes within a pathway that were detected in the sample. A value of 1.0 means all genes in the pathway were found. | Proportion (0.0 to 1.0) |
| pathcoverage_stratified | HUMAnN | Taxonomically stratified pathway coverage, showing the coverage of a pathway within the genome of a specific contributing species. | Proportion (0.0 to 1.0) |
| pathcoverage_unstratified | HUMAnN | Total community-level pathway coverage. This is equivalent to the main ‘pathcoverage’ file. | Proportion (0.0 to 1.0) |
| relative_abundance | MetaPhlAn | The primary output for taxonomic profiling, showing the relative abundance of each microbial taxon (from kingdom to species and strain level). | Relative Abundance (%) |
| viral_clusters | MetaPhlAn/Custom | Represents clusters of viral sequences, often used for viral strain or species-level analysis. The values typically represent the abundance of these viral clusters. | Varies (often Relative Abundance or CPM) |
| strainphlan_markers | StrainPhlAn | Consensus sequences of clade-specific marker genes for each sample, used for strain-level reconstruction in StrainPhlAn. | Fraction of marker covered / Mean coverage |
| fastqc | FastQC | Quality control metrics for raw sequencing reads, including quality scores, adapter contamination, and sequence content, length, and duplication. | Varies by metric |
| kneaddata_log | KneadData | Log file reporting preprocessing steps such as quality trimming, contaminant removal (e.g., host reads), and overall read counts retained. | None (log file) |
| reference | NA | Reference file reporting all unique values of non-UUID identifiers in a parquet file. | NA |
Data Type Categories
Taxonomic Composition (MetaPhlAn)
relative_abundance - Primary taxonomic
profiling output
- Description: Relative abundance of bacterial, archaeal, viral, and eukaryotic taxa
- Units: Relative Abundance (%)
- Levels: Kingdom through species and strain
- Tool: MetaPhlAn
viral_clusters - Viral community
profiling
- Description: Clusters of viral sequences for strain/species-level analysis
- Units: Varies (often Relative Abundance or CPM)
- Tool: MetaPhlAn/Custom
marker_abundance - Marker gene
quantification
- Description: Abundance of clade-specific marker genes
- Units: Mean coverage of marker genes
- Tool: MetaPhlAn
marker_presence - Marker gene
detection
- Description: Binary presence/absence of marker genes
- Units: Binary (0 or 1)
- Tool: MetaPhlAn
Strain-Level Profiling (StrainPhlAn)
strainphlan_markers - Strain-specific
markers
- Description: Consensus sequences for strain-level reconstruction
- Units: Fraction of marker covered / Mean coverage
- Tool: StrainPhlAn
Functional Profiling (HUMAnN)
HUMAnN data types come in multiple variants based on: -
Content: Gene families (genefamilies),
pathway abundance (pathabundance), or pathway coverage
(pathcoverage) - Stratification:
Taxonomically stratified (by species) or unstratified (community total)
- Normalization: Raw (RPK), relative abundance
(relab), or copies per million (cpm)
Gene Families
| Data Type | Stratification | Normalization | Description |
|---|---|---|---|
genefamilies |
Mixed | RPK | Raw output (unnormalized) |
genefamilies_unstratified |
Community total | RPK | Total gene family abundance |
genefamilies_stratified |
By species | RPK | Species-specific contributions |
genefamilies_relab_unstratified |
Community total | Relative % | Normalized community totals |
genefamilies_relab_stratified |
By species | Relative % | Normalized species contributions |
genefamilies_cpm_unstratified |
Community total | CPM | Depth-corrected totals |
genefamilies_cpm_stratified |
By species | CPM | Depth-corrected species contributions |
Metabolic Pathways - Abundance
| Data Type | Stratification | Normalization | Description |
|---|---|---|---|
pathabundance |
Mixed | RPK | Raw pathway abundance |
pathabundance_unstratified |
Community total | RPK | Total pathway abundance |
pathabundance_stratified |
By species | RPK | Species-specific pathway contributions |
pathabundance_relab_unstratified |
Community total | Relative % | Normalized community pathways |
pathabundance_relab_stratified |
By species | Relative % | Normalized species pathways |
pathabundance_cpm_unstratified |
Community total | CPM | Depth-corrected pathway totals |
pathabundance_cpm_stratified |
By species | CPM | Depth-corrected species pathways |
Metabolic Pathways - Coverage
| Data Type | Stratification | Units | Description |
|---|---|---|---|
pathcoverage |
Mixed | Proportion (0-1) | Pathway completeness |
pathcoverage_unstratified |
Community total | Proportion (0-1) | Total pathway coverage |
pathcoverage_stratified |
By species | Proportion (0-1) | Species-specific pathway coverage |
Choosing Data Types
For taxonomic analysis: - Use
relative_abundance for bacteria/archaea - Add
viral_clusters for viruses
For functional analysis: - Use
genefamilies_relab or pathabundance_relab for
relative comparisons - Use genefamilies_cpm or
pathabundance_cpm when comparing across sequencing depths -
Use pathcoverage to assess pathway completeness
Stratified vs. Unstratified: - Unstratified: Total community-level measurements (smaller files) - Stratified: See which species contribute to each function (larger files, richer information)
Parquet File Column Definitions
Each data type has specific columns with defined roles in the TreeSummarizedExperiment structure.
Column Roles
Columns are assigned to specific components:
-
cname: Column names (sample identifiers, typicallyuuid) -
cdata: Column data (sample metadata: processing parameters, versions) -
rname: Row names (feature identifiers) -
rdata: Row data (feature metadata) -
assay: Assay data (measurement values)
Accessing Column Information
# Get column information for a specific data type
rel_abund_cols <- parquet_colinfo("relative_abundance")
kable(rel_abund_cols)| general_data_type | col_name | col_class | description | se_role | ref_file |
|---|---|---|---|---|---|
| relative_abundance | clade_name | character | The taxonomic lineage of the detected microbial clade | rname | clade_name_ref |
| relative_abundance | clade_name_kingdom | character | The taxonomic kingdom of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | clade_name_phylum | character | The taxonomic phylum of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | clade_name_class | character | The taxonomic class of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | clade_name_order | character | The taxonomic order of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | clade_name_family | character | The taxonomic family of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | clade_name_genus | character | The taxonomic genus of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | clade_name_species | character | The taxonomic species of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | clade_name_terminal | character | The taxonomic terminal (strain, subspecies, etc.) of the detected microbial clade | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id | character | The NCBI Taxonomy identifier for the clade in clade_name | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_kingdom | character | The NCBI Taxonomy identifier for the kingdom in clade_name_kingdom | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_phylum | character | The NCBI Taxonomy identifier for the phylum in clade_name_phylum | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_class | character | The NCBI Taxonomy identifier for the class in clade_name_class | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_order | character | The NCBI Taxonomy identifier for the order in clade_name_order | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_family | character | The NCBI Taxonomy identifier for the family in clade_name_family | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_genus | character | The NCBI Taxonomy identifier for the genus in clade_name_genus | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_species | character | The NCBI Taxonomy identifier for the species in clade_name_species | rdata | clade_name_ref |
| relative_abundance | NCBI_tax_id_terminal | character | The NCBI Taxonomy identifier for the terminal (strain, subspecies, etc.) in clade_name_terminal | rdata | clade_name_ref |
| relative_abundance | relative_abundance | float | The proportion of the total microbial community represented by the clade | assay | NA |
| relative_abundance | additional_species | character | Any other species represented by the same set of detected markers | rdata | NA |
| relative_abundance | uuid | character | Sample UUID | cname | NA |
| relative_abundance | db_version | character | MetaPhlAn database version(s) referenced | cdata | NA |
| relative_abundance | command | character | MetaPhlAn command given | cdata | NA |
| relative_abundance | reads_processed | character | Number of reads processed | cdata | NA |
| relative_abundance | metaphlan_header | character | MetaPhlAn’s custom header row | cdata | NA |
| relative_abundance | original_columns | character | Original MetaPhlAn column names | cdata | NA |
Relative Abundance Columns
# Show relative abundance columns grouped by role
rel_abund_cols %>%
select(col_name, col_class, se_role, description) %>%
kable()| col_name | col_class | se_role | description |
|---|---|---|---|
| clade_name | character | rname | The taxonomic lineage of the detected microbial clade |
| clade_name_kingdom | character | rdata | The taxonomic kingdom of the detected microbial clade |
| clade_name_phylum | character | rdata | The taxonomic phylum of the detected microbial clade |
| clade_name_class | character | rdata | The taxonomic class of the detected microbial clade |
| clade_name_order | character | rdata | The taxonomic order of the detected microbial clade |
| clade_name_family | character | rdata | The taxonomic family of the detected microbial clade |
| clade_name_genus | character | rdata | The taxonomic genus of the detected microbial clade |
| clade_name_species | character | rdata | The taxonomic species of the detected microbial clade |
| clade_name_terminal | character | rdata | The taxonomic terminal (strain, subspecies, etc.) of the detected microbial clade |
| NCBI_tax_id | character | rdata | The NCBI Taxonomy identifier for the clade in clade_name |
| NCBI_tax_id_kingdom | character | rdata | The NCBI Taxonomy identifier for the kingdom in clade_name_kingdom |
| NCBI_tax_id_phylum | character | rdata | The NCBI Taxonomy identifier for the phylum in clade_name_phylum |
| NCBI_tax_id_class | character | rdata | The NCBI Taxonomy identifier for the class in clade_name_class |
| NCBI_tax_id_order | character | rdata | The NCBI Taxonomy identifier for the order in clade_name_order |
| NCBI_tax_id_family | character | rdata | The NCBI Taxonomy identifier for the family in clade_name_family |
| NCBI_tax_id_genus | character | rdata | The NCBI Taxonomy identifier for the genus in clade_name_genus |
| NCBI_tax_id_species | character | rdata | The NCBI Taxonomy identifier for the species in clade_name_species |
| NCBI_tax_id_terminal | character | rdata | The NCBI Taxonomy identifier for the terminal (strain, subspecies, etc.) in clade_name_terminal |
| relative_abundance | float | assay | The proportion of the total microbial community represented by the clade |
| additional_species | character | rdata | Any other species represented by the same set of detected markers |
| uuid | character | cname | Sample UUID |
| db_version | character | cdata | MetaPhlAn database version(s) referenced |
| command | character | cdata | MetaPhlAn command given |
| reads_processed | character | cdata | Number of reads processed |
| metaphlan_header | character | cdata | MetaPhlAn’s custom header row |
| original_columns | character | cdata | Original MetaPhlAn column names |
Sample Metadata (cdata)
-
db_version: MetaPhlAn database version -
command: MetaPhlAn command executed -
reads_processed: Number of reads processed -
metaphlan_header: Original MetaPhlAn header -
original_columns: Original column names
Feature Identifiers (rname)
-
clade_name: Full taxonomic lineage (e.g., “k__Bacteria|p__Firmicutes|c__Bacilli|o__Lactobacillales|f__Streptococcaceae|g__Streptococcus|s__Streptococcus_mutans”)
Gene Families Columns
# Gene families column structure
gf_cols <- parquet_colinfo("genefamilies")
gf_cols %>%
select(col_name, col_class, se_role, description) %>%
kable()| col_name | col_class | se_role | description |
|---|---|---|---|
| gene_family | character | rname | The detected gene family |
| gene_family_uniref | character | rdata | The UniRef identifier of the detected gene family |
| gene_family_genus | character | rdata | The taxonomic genus of the detected gene family |
| gene_family_species | character | rdata | The taxonomic species of the detected gene family |
| rpk_abundance | float | assay | Gene family abundance in reads per kilobase |
| uuid | character | cname | Sample UUID |
| humann_header | character | cdata | HUMAnN’s custom header row |
Key columns: - gene_family (rname):
Full gene family identifier -
gene_family_uniref (rdata): UniRef90
identifier - gene_family_genus (rdata):
Taxonomic genus (if stratified) -
gene_family_species (rdata): Taxonomic
species (if stratified) - rpk_abundance
(assay): Reads per kilobase
Pathway Columns
# Pathway abundance/coverage column structure
pa_cols <- parquet_colinfo("pathabundance")
pa_cols %>%
select(col_name, col_class, se_role, description) %>%
kable()| col_name | col_class | se_role | description |
|---|---|---|---|
| pathway | character | rname | The detected pathway |
| pathway_uniref | character | rdata | The UniRef identifier of the detected pathway |
| pathway_genus | character | rdata | The taxonomic genus of the detected pathway |
| pathway_species | character | rdata | The taxonomic species of the detected microbial pathway |
| abundance | float | assay | The abundance of the detected pathway |
| uuid | character | cname | Sample UUID |
| humann_header | character | cdata | HUMAnN’s custom header row |
Key columns: - pathway (rname): MetaCyc
pathway identifier - pathway_uniref
(rdata): UniRef identifier (if applicable) -
pathway_genus (rdata): Taxonomic genus (if
stratified) - pathway_species (rdata):
Taxonomic species (if stratified) -
abundance or
coverage (assay): Measurement value
Reference Files
Reference files provide lookup tables for non-UUID identifiers found in the parquet files.
Available References
# Get information about reference files
ref_info <- get_ref_info()
kable(ref_info)| ref_file | general_data_type | tool | description |
|---|---|---|---|
| clade_name_ref | relative_abundance | MetaPhlAn | All unique values of clade_name and NCBI_tax_id found in the MetaPhlAn relative_abundance files in the same repo |
| gene_family_ref | genefamilies | HUMAnN | All unique values of gene_family found in the HUMAnN genefamilies files in the same repo |
| genome_name_ref | viral_clusters | MetaPhlAn | All unique values of genome_name found in the MetaPhlAn viral_clusters files in the same repo |
| pathway_ref | pathabundance;pathcoverage | HUMAnN | All unique values of pathway found in the HUMAnN pathabundance and pathcoverage files in the same repo |
| uniref_marker_ref | marker_abundance;marker_presence | MetaPhlAn | All unique values of uniref marker found in the MetaPhlAn marker_abundance and marker_presence files in the same repo |
Loading Reference Files
Reference files can be loaded from remote repositories or local files.
# From remote repository (requires network)
clade_ref <- load_ref("clade_name_ref")
# From local file
refpath <- file.path(system.file("extdata",
package = "parkinsonsMetagenomicData"),
"pathway_ref.parquet")
pathway_ref <- load_ref(ref_file = refpath)
head(pathway_ref)Reference File Contents
clade_name_ref - Taxonomic lineages
- All unique
clade_namevalues fromrelative_abundancefiles - Includes NCBI Taxonomy IDs at each rank
- Use for: Taxonomic lookups, lineage parsing
gene_family_ref - Gene family
identifiers
- All unique
gene_familyvalues fromgenefamiliesfiles - Use for: Gene family annotation lookups
pathway_ref - Pathway identifiers
- All unique
pathwayvalues frompathabundanceandpathcoveragefiles - Use for: Pathway annotation, MetaCyc lookups
genome_name_ref - Viral genome
identifiers
- All unique
genome_namevalues fromviral_clustersfiles - Use for: Viral genome annotation
uniref_marker_ref - Marker gene
identifiers
- All unique
unirefmarker values frommarker_abundanceandmarker_presencefiles - Use for: Marker gene lookups
Data Repositories
Data are hosted on Hugging Face in parquet format for efficient access.
Repository Information
# Get repository information
repos <- get_repo_info()
kable(repos)| repo_name | repo_url | default |
|---|---|---|
| waldronlab/metagenomics_mac | https://huggingface.co/datasets/waldronlab/metagenomics_mac/tree/main | Y |
| waldronlab/metagenomics_mac_examples | https://huggingface.co/datasets/waldronlab/metagenomics_mac_examples/tree/main | N |
Data Access Patterns
Quick Access with returnSamples()
For most users, returnSamples() provides the simplest
interface:
# Load sample metadata
data("sampleMetadata", package = "parkinsonsMetagenomicData")
# Filter to samples of interest
my_samples <- sampleMetadata %>%
filter(control %in% c("Case", "Study Control"),
age >= 18,
!is.na(sex))
# Retrieve relative abundance data
tse <- returnSamples(sample_data = my_samples[1:10, ],
data_type = "relative_abundance")Advanced Access with accessParquetData() and loadParquetData()
For more control over filtering and data loading:
# Connect to database
con <- accessParquetData(data_types = "relative_abundance")
# Apply filters and load
tse <- loadParquetData(con = con,
data_type = "relative_abundance",
filter_values = list(
clade_name_species = c("s__Streptococcus_mutans",
"s__Escherichia_coli")
))Using Local Files
When working with downloaded parquet files:
# Point to local directory
con <- accessParquetData(file_paths = "path/to/parquet/files/",
data_types = "relative_abundance")
# Load data
tse <- loadParquetData(con = con,
data_type = "relative_abundance")Additional Resources
Package Functions for Data Discovery
-
output_file_types()- List available output file types and their properties -
parquet_colinfo()- Get column definitions for a specific data type -
get_repo_info()- List available Hugging Face repositories -
get_ref_info()- List available reference files -
get_hf_parquet_urls()- Get direct URLs to parquet files
Vignettes
- First 15 Minutes - Quick start guide
- Full Workflow - Comprehensive data retrieval tutorial
- Piecewise Workflow - Advanced direct database access
- Working with Large Parquet Files - Strategies for large data types
Session Info
sessionInfo()
#> R Under development (unstable) (2026-03-28 r89738)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] knitr_1.51 dplyr_1.2.0
#> [3] parkinsonsMetagenomicData_0.99.0
#>
#> loaded via a namespace (and not attached):
#> [1] SummarizedExperiment_1.41.1 httr2_1.2.2
#> [3] xfun_0.57 bslib_0.10.0
#> [5] htmlwidgets_1.6.4 Biobase_2.71.0
#> [7] lattice_0.22-9 tzdb_0.5.0
#> [9] yulab.utils_0.2.4 vctrs_0.7.2
#> [11] tools_4.7.0 generics_0.1.4
#> [13] stats4_4.7.0 parallel_4.7.0
#> [15] tibble_3.3.1 pkgconfig_2.0.3
#> [17] Matrix_1.7-5 dbplyr_2.5.2
#> [19] desc_1.4.3 S4Vectors_0.49.0
#> [21] assertthat_0.2.1 lifecycle_1.0.5
#> [23] stringr_1.6.0 compiler_4.7.0
#> [25] treeio_1.35.0 textshaping_1.0.5
#> [27] Biostrings_2.79.5 Seqinfo_1.1.0
#> [29] codetools_0.2-20 htmltools_0.5.9
#> [31] sass_0.4.10 yaml_2.3.12
#> [33] lazyeval_0.2.2 pkgdown_2.2.0
#> [35] pillar_1.11.1 crayon_1.5.3
#> [37] jquerylib_0.1.4 tidyr_1.3.2
#> [39] BiocParallel_1.45.0 SingleCellExperiment_1.33.2
#> [41] DelayedArray_0.37.0 cachem_1.1.0
#> [43] abind_1.4-8 nlme_3.1-169
#> [45] tidyselect_1.2.1 digest_0.6.39
#> [47] stringi_1.8.7 duckdb_1.5.1
#> [49] purrr_1.2.1 arrow_23.0.1.2
#> [51] TreeSummarizedExperiment_2.19.0 fastmap_1.2.0
#> [53] grid_4.7.0 cli_3.6.5
#> [55] SparseArray_1.11.11 magrittr_2.0.4
#> [57] S4Arrays_1.11.1 ape_5.8-1
#> [59] withr_3.0.2 readr_2.2.0
#> [61] rappdirs_0.3.4 bit64_4.6.0-1
#> [63] rmarkdown_2.31 XVector_0.51.0
#> [65] matrixStats_1.5.0 bit_4.6.0
#> [67] otel_0.2.0 hms_1.1.4
#> [69] ragg_1.5.2 evaluate_1.0.5
#> [71] GenomicRanges_1.63.1 IRanges_2.45.0
#> [73] rlang_1.1.7 Rcpp_1.1.1
#> [75] glue_1.8.0 tidytree_0.4.7
#> [77] DBI_1.3.0 BiocGenerics_0.57.0
#> [79] vroom_1.7.0 jsonlite_2.0.0
#> [81] R6_2.6.1 MatrixGenerics_1.23.0
#> [83] systemfonts_1.3.2 fs_2.0.1