Get Parquet File URLs and Metadata from a Hugging Face Repository — get_hf_parquet

This function queries the Hugging Face Hub API to find all Parquet files within a specified dataset repository. It constructs the direct download URLs and then joins this information with a local file containing definitions for BioBakery data types.

Usage

get_hf_parquet_urls(repo_name = NULL, verbose = FALSE)

Arguments

repo_name: A character string specifying the Hugging Face dataset repository name in the format "user/repo" or "org/repo". If NULL, the repo listed as the default in get_repo_info() will be selected. Default: NULL
verbose: Boolean: should output be verbose, Default: FALSE

Value

A data.frame with the following columns:

filename: The name of the Parquet file.
URL: The full download URL for the file.
DataType: The base name of the file, used for joining with metadata.
Tool: The bioBakery tool that typically produces the data type.
Description: A brief description of the data type.
Units.Normalization: The units or normalization method used.

Details

The metadata is sourced from the "biobakery-file-definitions.csv" file, which is expected to be in the inst/extdata directory of the parkinsonsMetagenomicData package. If this package is not available, the metadata columns will be populated with NA.

Examples

# \donttest{
 file_info <- get_hf_parquet_urls()
 head(file_info)
#>                                                 filename
#> 1                                 clade_name_ref.parquet
#> 2                                gene_family_ref.parquet
#> 3            genefamilies_cpm_gene_family_uniref.parquet
#> 4 genefamilies_cpm_stratified_gene_family_uniref.parquet
#> 5               genefamilies_cpm_stratified_uuid.parquet
#> 6      genefamilies_cpm_unstratified_gene_family.parquet
#>                                                                                                                               url
#> 1                                 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/clade_name_ref.parquet
#> 2                                https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/gene_family_ref.parquet
#> 3            https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_gene_family_uniref.parquet
#> 4 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_stratified_gene_family_uniref.parquet
#> 5               https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_stratified_uuid.parquet
#> 6      https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_unstratified_gene_family.parquet
#>                       data_type   tool
#> 1                     reference   <NA>
#> 2                     reference   <NA>
#> 3              genefamilies_cpm HUMAnN
#> 4   genefamilies_cpm_stratified HUMAnN
#> 5   genefamilies_cpm_stratified HUMAnN
#> 6 genefamilies_cpm_unstratified HUMAnN
#>                                                                                                                                               description
#> 1                                                                   Reference file reporting all unique values of non-UUID identifiers in a parquet file.
#> 2                                                                   Reference file reporting all unique values of non-UUID identifiers in a parquet file.
#> 3                            Gene family abundances normalized to Copies Per Million. This accounts for sequencing depth, making samples more comparable.
#> 4 Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family.
#> 5 Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family.
#> 6 Total community-level gene family abundances (in CPM), without taxonomic stratification. This is the sum of the stratified values for each gene family.
#>        units_normalization
#> 1                     <NA>
#> 2                     <NA>
#> 3 Copies Per Million (CPM)
#> 4 Copies Per Million (CPM)
#> 5 Copies Per Million (CPM)
#> 6 Copies Per Million (CPM)
# }