Skip to contents

Working with Large Parquet Files

This vignette focuses on strategies for working with the largest data types in the package, particularly genefamilies_stratified and similar large files. While these files work the same way as smaller data types, they require careful query strategies to avoid rate limiting and performance issues.

Understanding Parquet File Sorting

Each data type has parquet files sorted by specific columns. This sorting is crucial for query performance when accessing remote files.

For example, genefamilies_stratified has two parquet files: - genefamilies_stratified_uuid.parquet - sorted by uuid column - genefamilies_stratified_gene_family_uniref.parquet - sorted by gene_family_uniref column

Key principle: When querying remote parquet files, always filter on sorted columns first to minimize data transfer and avoid rate limiting.

The Sorted Column Strategy

Why This Matters

When you filter on a sorted column, DuckDB can efficiently skip large portions of the file without reading them. When you filter on a non-sorted column, DuckDB must scan the entire file remotely, which:

  1. Takes much longer
  2. Transfers more data
  3. Can trigger Hugging Face rate limits (HTTP 429 errors)
  4. May cause the query to hang or fail

Two-Stage Filtering Approach

For large files, use this two-stage strategy:

Stage 1: Remote filtering on sorted columns - Filter by sorted columns (e.g., uuid, gene_family_uniref) when querying the remote parquet files - This creates a smaller TreeSummarizedExperiment with only relevant data

Stage 2: Local filtering on non-sorted columns - Once data is in R, apply additional filters on non-sorted columns (e.g., gene_family_species) - Use standard R/dplyr operations on the TreeSummarizedExperiment object

Alternative: Download Files Locally

For repeated queries or very large extractions, download the parquet files locally. This completely eliminates rate limits and allows filtering on any column. See the Full Workflow vignette for a working example of local file usage with system.file() and the inst/extdata example files.

Practical Example: Querying genefamilies_stratified

The examples below demonstrate the sorted column strategy with genefamilies_stratified, one of the largest data types. These queries use accessParquetData() and loadParquetData() for fine-grained control.

Resource Considerations

Large file operations may require: - Adequate disk space if using a persistent DuckDB database file (specify with dbdir parameter) - Sufficient RAM for in-memory operations (default :memory:) - For very large extractions, consider running in a high-memory environment or cloud instance

Example Scenario: Mouse Microbiome Gene Families

This example retrieves specific gene families from a mouse study, demonstrating the sorted column strategy:

# Establish connection to genefamilies_stratified remote files
con_gs <- accessParquetData(data_types = "genefamilies_stratified")

# Target: Gene families specific to different mouse groups
# - UniRef90_T4BVE4: present only in SPF mice
# - UniRef90_A0A1B1SA57: present only in WildR mice

# Load study metadata
data("sampleMetadata", package = "parkinsonsMetagenomicData")
selected_samples <- sampleMetadata |>
    filter(study_name == "MazmanianS_DumitrescuDG") |>
    select(where(~ !any(is.na(.x))))

# STRATEGY: Pre-filter sample IDs using metadata before querying parquet files
# This minimizes the data transfer from remote files

wildr_ids <- selected_samples |>
    filter(MazmanianS_DumitrescuDG_uncurated_donor_microbiome_type == "WildR") |>
    pull(uuid)

spf_ids <- selected_samples |>
    filter(MazmanianS_DumitrescuDG_uncurated_donor_microbiome_type == "SPF") |>
    pull(uuid)

# Query 1: SPF-specific gene family with subset of samples
# Filters on SORTED columns: gene_family_uniref and uuid
spf_ex <- loadParquetData(con_gs, "genefamilies_stratified",
        filter_values = list(gene_family_uniref = "UniRef90_T4BVE4",
                            uuid = spf_ids))
#> 'genefamilies_stratified' is a large data type, and collecting the query can take a while. To avoid going through the Hugging Face API, download the source file hf://datasets/waldronlab/metagenomics_mac/genefamilies_stratified_gene_family_uniref.parquet and provide it to accessParquetData() in the 'local files' argument.
spf_ex
#> class: TreeSummarizedExperiment 
#> dim: 4 14 
#> metadata(0):
#> assays(1): rpk_abundance
#> rownames(4): UniRef90_T4BVE4|g__Bacteroides.s__Bacteroides_vulgatus
#>   UniRef90_T4BVE4|unclassified
#>   UniRef90_T4BVE4|g__Bacteroides.s__Bacteroides_thetaiotaomicron
#>   UniRef90_T4BVE4|g__Parabacteroides.s__Parabacteroides_distasonis
#> rowData names(4): gene_family gene_family_uniref gene_family_genus
#>   gene_family_species
#> colnames(14): 1d949e2f-8bdb-48b7-bcf5-171c37d9ad66
#>   d2b9638a-8c15-4d1d-b5cd-3efeeeed0f2f ...
#>   631e73ff-42b9-47ca-b2a0-8dcc7558af6c
#>   529f3a3a-5ef3-495d-93af-4aa316b2cbf4
#> colData names(43): uuid humann_header ...
#>   MazmanianS_DumitrescuDG_uncurated_unitn_file_paths
#>   MazmanianS_DumitrescuDG_uncurated_study_name
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

# Query 2: WildR-specific gene family with subset of samples
wildr_ex <- loadParquetData(con_gs, "genefamilies_stratified",
    filter_values = list(gene_family_uniref = "UniRef90_A0A1B1SA57",
                        uuid = wildr_ids))
#> 'genefamilies_stratified' is a large data type, and collecting the query can take a while. To avoid going through the Hugging Face API, download the source file hf://datasets/waldronlab/metagenomics_mac/genefamilies_stratified_gene_family_uniref.parquet and provide it to accessParquetData() in the 'local files' argument.
wildr_ex
#> class: TreeSummarizedExperiment 
#> dim: 2 14 
#> metadata(0):
#> assays(1): rpk_abundance
#> rownames(2):
#>   UniRef90_A0A1B1SA57|g__Muribaculum.s__Muribaculum_intestinale
#>   UniRef90_A0A1B1SA57|unclassified
#> rowData names(4): gene_family gene_family_uniref gene_family_genus
#>   gene_family_species
#> colnames(14): 985f49c5-a0d2-428d-98f4-b458b0c2c0da
#>   1e626f8b-ce4e-4a6a-ae29-82e7ec86b8b8 ...
#>   4509ee29-d50a-493e-94bf-ac283f46c0dc
#>   074be6f7-95e8-4874-a959-f6ea3c4f9a57
#> colData names(43): uuid humann_header ...
#>   MazmanianS_DumitrescuDG_uncurated_unitn_file_paths
#>   MazmanianS_DumitrescuDG_uncurated_study_name
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

# Query 3: Multiple gene families across all samples in study
# This works but transfers more data (all samples vs. filtered subset)
all_ex <- loadParquetData(con_gs, "genefamilies_stratified",
                filter_values = list(gene_family_uniref = c("UniRef90_T4BVE4",
                                                        "UniRef90_A0A1B1SA57"),
                                    uuid = selected_samples$uuid))
#> 'genefamilies_stratified' is a large data type, and collecting the query can take a while. To avoid going through the Hugging Face API, download the source file hf://datasets/waldronlab/metagenomics_mac/genefamilies_stratified_gene_family_uniref.parquet and provide it to accessParquetData() in the 'local files' argument.
all_ex
#> class: TreeSummarizedExperiment 
#> dim: 6 28 
#> metadata(0):
#> assays(1): rpk_abundance
#> rownames(6): UniRef90_T4BVE4|g__Bacteroides.s__Bacteroides_vulgatus
#>   UniRef90_T4BVE4|unclassified ...
#>   UniRef90_A0A1B1SA57|g__Muribaculum.s__Muribaculum_intestinale
#>   UniRef90_A0A1B1SA57|unclassified
#> rowData names(4): gene_family gene_family_uniref gene_family_genus
#>   gene_family_species
#> colnames(28): 1d949e2f-8bdb-48b7-bcf5-171c37d9ad66
#>   d2b9638a-8c15-4d1d-b5cd-3efeeeed0f2f ...
#>   4509ee29-d50a-493e-94bf-ac283f46c0dc
#>   074be6f7-95e8-4874-a959-f6ea3c4f9a57
#> colData names(43): uuid humann_header ...
#>   MazmanianS_DumitrescuDG_uncurated_unitn_file_paths
#>   MazmanianS_DumitrescuDG_uncurated_study_name
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):
#> rowLinks: NULL
#> rowTree: NULL
#> colLinks: NULL
#> colTree: NULL

Key Takeaways

  1. Always filter on sorted columns when querying remote large files
  2. Pre-filter sample IDs using metadata before querying parquet files
  3. Balance specificity vs. breadth: More specific queries (fewer samples/features) are faster
  4. For filtering on non-sorted columns (e.g., gene_family_species), retrieve data first, then filter the TreeSummarizedExperiment in R

When to Download Locally

Consider downloading parquet files if: - You need to query the same large files repeatedly - You need to filter on non-sorted columns extensively - You’re working in an environment with fast local disk but slow/unreliable internet - You’re hitting rate limits despite using sorted column strategies

See the Full Workflow vignette for local file examples.

sessionInfo()
#> R Under development (unstable) (2026-03-28 r89738)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.4 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
#>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
#>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
#>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
#>  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
#> 
#> time zone: Etc/UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] DT_0.34.0                        DBI_1.3.0                       
#> [3] dplyr_1.2.0                      parkinsonsMetagenomicData_0.99.0
#> 
#> loaded via a namespace (and not attached):
#>  [1] SummarizedExperiment_1.41.1     httr2_1.2.2                    
#>  [3] xfun_0.57                       bslib_0.10.0                   
#>  [5] htmlwidgets_1.6.4               Biobase_2.71.0                 
#>  [7] lattice_0.22-9                  tzdb_0.5.0                     
#>  [9] yulab.utils_0.2.4               vctrs_0.7.2                    
#> [11] tools_4.7.0                     generics_0.1.4                 
#> [13] curl_7.0.0                      stats4_4.7.0                   
#> [15] parallel_4.7.0                  tibble_3.3.1                   
#> [17] blob_1.3.0                      pkgconfig_2.0.3                
#> [19] Matrix_1.7-5                    dbplyr_2.5.2                   
#> [21] desc_1.4.3                      S4Vectors_0.49.0               
#> [23] assertthat_0.2.1                lifecycle_1.0.5                
#> [25] stringr_1.6.0                   compiler_4.7.0                 
#> [27] treeio_1.35.0                   textshaping_1.0.5              
#> [29] Biostrings_2.79.5               Seqinfo_1.1.0                  
#> [31] codetools_0.2-20                htmltools_0.5.9                
#> [33] sass_0.4.10                     yaml_2.3.12                    
#> [35] lazyeval_0.2.2                  pkgdown_2.2.0                  
#> [37] pillar_1.11.1                   crayon_1.5.3                   
#> [39] jquerylib_0.1.4                 tidyr_1.3.2                    
#> [41] BiocParallel_1.45.0             SingleCellExperiment_1.33.2    
#> [43] DelayedArray_0.37.0             cachem_1.1.0                   
#> [45] abind_1.4-8                     nlme_3.1-169                   
#> [47] tidyselect_1.2.1                digest_0.6.39                  
#> [49] stringi_1.8.7                   duckdb_1.5.1                   
#> [51] purrr_1.2.1                     arrow_23.0.1.2                 
#> [53] TreeSummarizedExperiment_2.19.0 fastmap_1.2.0                  
#> [55] grid_4.7.0                      cli_3.6.5                      
#> [57] SparseArray_1.11.11             magrittr_2.0.4                 
#> [59] S4Arrays_1.11.1                 ape_5.8-1                      
#> [61] withr_3.0.2                     readr_2.2.0                    
#> [63] rappdirs_0.3.4                  bit64_4.6.0-1                  
#> [65] rmarkdown_2.31                  XVector_0.51.0                 
#> [67] matrixStats_1.5.0               bit_4.6.0                      
#> [69] otel_0.2.0                      hms_1.1.4                      
#> [71] ragg_1.5.2                      evaluate_1.0.5                 
#> [73] knitr_1.51                      GenomicRanges_1.63.1           
#> [75] IRanges_2.45.0                  rlang_1.1.7                    
#> [77] Rcpp_1.1.1                      glue_1.8.0                     
#> [79] tidytree_0.4.7                  BiocGenerics_0.57.0            
#> [81] vroom_1.7.0                     jsonlite_2.0.0                 
#> [83] R6_2.6.1                        MatrixGenerics_1.23.0          
#> [85] systemfonts_1.3.2               fs_2.0.1