Skip to contents

This function queries the Hugging Face Hub API to find all Parquet files within a specified dataset repository. It constructs the direct download URLs and then joins this information with a local file containing definitions for BioBakery data types.

Usage

get_hf_parquet_urls(repo_name = NULL, verbose = FALSE)

Arguments

repo_name

A character string specifying the Hugging Face dataset repository name in the format "user/repo" or "org/repo". If NULL, the repo listed as the default in get_repo_info() will be selected. Default: NULL

verbose

Boolean: should output be verbose, Default: FALSE

Value

A data.frame with the following columns:

filename

The name of the Parquet file.

URL

The full download URL for the file.

DataType

The base name of the file, used for joining with metadata.

Tool

The bioBakery tool that typically produces the data type.

Description

A brief description of the data type.

Units.Normalization

The units or normalization method used.

Details

The metadata is sourced from the "biobakery-file-definitions.csv" file, which is expected to be in the inst/extdata directory of the parkinsonsMetagenomicData package. If this package is not available, the metadata columns will be populated with NA.

Examples

# \donttest{
 file_info <- get_hf_parquet_urls()
 head(file_info)
#>                                                 filename
#> 1                                 clade_name_ref.parquet
#> 2                                gene_family_ref.parquet
#> 3            genefamilies_cpm_gene_family_uniref.parquet
#> 4 genefamilies_cpm_stratified_gene_family_uniref.parquet
#> 5               genefamilies_cpm_stratified_uuid.parquet
#> 6      genefamilies_cpm_unstratified_gene_family.parquet
#>                                                                                                                               url
#> 1                                 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/clade_name_ref.parquet
#> 2                                https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/gene_family_ref.parquet
#> 3            https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_gene_family_uniref.parquet
#> 4 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_stratified_gene_family_uniref.parquet
#> 5               https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_stratified_uuid.parquet
#> 6      https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_unstratified_gene_family.parquet
#>                       data_type   tool
#> 1                     reference   <NA>
#> 2                     reference   <NA>
#> 3              genefamilies_cpm HUMAnN
#> 4   genefamilies_cpm_stratified HUMAnN
#> 5   genefamilies_cpm_stratified HUMAnN
#> 6 genefamilies_cpm_unstratified HUMAnN
#>                                                                                                                                               description
#> 1                                                                   Reference file reporting all unique values of non-UUID identifiers in a parquet file.
#> 2                                                                   Reference file reporting all unique values of non-UUID identifiers in a parquet file.
#> 3                            Gene family abundances normalized to Copies Per Million. This accounts for sequencing depth, making samples more comparable.
#> 4 Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family.
#> 5 Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family.
#> 6 Total community-level gene family abundances (in CPM), without taxonomic stratification. This is the sum of the stratified values for each gene family.
#>        units_normalization
#> 1                     <NA>
#> 2                     <NA>
#> 3 Copies Per Million (CPM)
#> 4 Copies Per Million (CPM)
#> 5 Copies Per Million (CPM)
#> 6 Copies Per Million (CPM)
# }