Get Parquet File URLs and Metadata from a Hugging Face Repository
Source:R/readParquet.R
get_hf_parquet_urls.RdThis function queries the Hugging Face Hub API to find all Parquet files within a specified dataset repository. It constructs the direct download URLs and then joins this information with a local file containing definitions for BioBakery data types.
Value
A data.frame with the following columns:
- filename
The name of the Parquet file.
- URL
The full download URL for the file.
- DataType
The base name of the file, used for joining with metadata.
- Tool
The bioBakery tool that typically produces the data type.
- Description
A brief description of the data type.
- Units.Normalization
The units or normalization method used.
Details
The metadata is sourced from the "biobakery-file-definitions.csv"
file, which is expected to be in the inst/extdata directory of the
parkinsonsMetagenomicData package. If this package is not available,
the metadata columns will be populated with NA.
Examples
# \donttest{
file_info <- get_hf_parquet_urls()
head(file_info)
#> filename
#> 1 clade_name_ref.parquet
#> 2 gene_family_ref.parquet
#> 3 genefamilies_cpm_gene_family_uniref.parquet
#> 4 genefamilies_cpm_stratified_gene_family_uniref.parquet
#> 5 genefamilies_cpm_stratified_uuid.parquet
#> 6 genefamilies_cpm_unstratified_gene_family.parquet
#> url
#> 1 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/clade_name_ref.parquet
#> 2 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/gene_family_ref.parquet
#> 3 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_gene_family_uniref.parquet
#> 4 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_stratified_gene_family_uniref.parquet
#> 5 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_stratified_uuid.parquet
#> 6 https://huggingface.co/datasets/waldronlab/metagenomics_mac/resolve/main/genefamilies_cpm_unstratified_gene_family.parquet
#> data_type tool
#> 1 reference <NA>
#> 2 reference <NA>
#> 3 genefamilies_cpm HUMAnN
#> 4 genefamilies_cpm_stratified HUMAnN
#> 5 genefamilies_cpm_stratified HUMAnN
#> 6 genefamilies_cpm_unstratified HUMAnN
#> description
#> 1 Reference file reporting all unique values of non-UUID identifiers in a parquet file.
#> 2 Reference file reporting all unique values of non-UUID identifiers in a parquet file.
#> 3 Gene family abundances normalized to Copies Per Million. This accounts for sequencing depth, making samples more comparable.
#> 4 Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family.
#> 5 Gene family abundances (in CPM) that are taxonomically stratified, showing the contribution of each species to the total abundance of each gene family.
#> 6 Total community-level gene family abundances (in CPM), without taxonomic stratification. This is the sum of the stratified values for each gene family.
#> units_normalization
#> 1 <NA>
#> 2 <NA>
#> 3 Copies Per Million (CPM)
#> 4 Copies Per Million (CPM)
#> 5 Copies Per Million (CPM)
#> 6 Copies Per Million (CPM)
# }