Package Overview
This package is dedicated to retrieving, storing, and handling specific output files produced with the curatedMetagenomicsNextflow pipeline. For additional utility functions surrounding the analysis of the data in these files, including the handling of taxonomy and statistical tests, go to biobakeryUtils
Available Data
The ASAP-MAC initiative has collected a number of Parkinson’s Disease-focused studies for metadata curation, uniform processing, and meta-analysis. The ongoing process of the collection of these datasets can be followed in the parkinsons_data_search repository. The majority of the studies listed in the table parkinson_shotgun_datasets.tsv are available for retrieval with this package.
To browse the available data, load the sampleMetadata object included in this package. Additionally, calling biobakery_files() will provide a list of the different output file types that are available for each sample.
Sample Metadata
Metadata for all samples present within parkinsonsMetagenomicData is available through the included data.frame sampleMetadata. Both curated and uncurated features are included in this data.frame, with uncurated features being prefixed by “uncurated_”. Curated features include the following at this time:
- curation_id
- study_name
- sample_id
- subject_id
- target_condition
- target_condition_ontology_term_id
- control
- control_ontology_term_id
- age
- age_group
- age_group_ontology_term_id
- age_unit
- age_unit_ontology_term_id
- sex
- sex_ontology_term_id
- disease
- disease_ontology_term_id
- curator
- BioProject
- BioSample
- NCBI_accession
- uuid
Output File Types
- MetaPhlAn
- viral_clusters
- relative_abundance
- marker_abundance
- marker_presence
- StrainPhlAn
- strainphlan_markers
- HUMAnN
- genefamilies
- genefamilies_cpm
- genefamilies_relab
- genefamilies_stratified
- genefamilies_unstratified
- genefamilies_cpm_stratified
- genefamilies_relab_stratified
- genefamilies_cpm_unstratified
- genefamilies_relab_unstratified
- pathabundance
- pathabundance_cpm
- pathabundance_relab
- pathabundance_stratified
- pathabundance_unstratified
- pathabundance_cpm_stratified
- pathabundance_relab_stratified
- pathabundance_cpm_unstratified
- pathabundance_relab_unstratified
- pathcoverage_unstratified
- pathcoverage_stratified
- pathcoverage
- FastQC
- fastqc
- KneadData
- kneaddata_log
Data Hosting
While the sample metadata are available within this package, the various output files are hosted remotely due to their size and number. There are therefore two options for data retrieval.
Google Cloud Storage
The initial output location of the pipeline is the Google Cloud Bucket gs://metagenomics-mac, which requires credentials for access. The creation of these credentials is covered in the Google Cloud Storage vignette, and you will need the owner of the Google Cloud Project within which the Bucket is contained to follow these steps and provide you with the resulting credentials. Once you have access to the Bucket, the data will be stored in individual files for each sample and output type, and can be accessed with the functions and workflows detailed in the Google Cloud Storage vignette.
Hugging Face
While Google Cloud Storage is a good place to access the data as soon as they have been processed, it requires credentialed access and more file wrangling. As a simpler alternative, the data have been combined into parquet files and hosted publicly on Hugging Face in the metagenomics_mac repo. Smaller example files featuring data from 10 samples each can be found at metagenomics_mac_examples. These files are able to be easily accessed through the DuckDB R client and the functions and workflows detailed in the Parquet File vignette streamline this process even further.