Full Data Preparation
This vignette describes the process of generating the data available in this package. The main steps are data acquisition, processing, transformation, and metadata curation.
Data Acquisition
The first step in making the data collected in parkinsonsMetagenomicData accessible is to find and obtain sequences and metadata. Datasets relevant to both Parkinson’s disease and the microbiome are discovered through a Google Scholar search, and the associated sequences and metadata are gathered, or requested if necessary. The main repository associated with the data acquisition step is parkinsons_data_search. The table parkinson_shotgun_datasets.tsv describes the datasets discovered, and most of them are available for retrieval with this package.
Data Processing Pipeline
Once the datasets have been acquired, the sequences are uniformly processed with the bioBakery-based pipeline curatedMetagenomicsNextflow. This pipeline performs basic quality control and alignment with KneadData, taxonomic profiling with MetaPhlAn, and functional profiling with HUMAnN. Many of the files produced directly from this pipeline are detailed in the Available File Types vignette. Other files, containing information such as the exact commands and software versions used in each step, are output from this pipeline, but they are not publicly available at this time. In the meantime, MetaPhlAn output files generally contain the database version and exact code used to generate that file within the header (this information is carried into the parquet format).
Reproducing the data
If you are interested in reproducing these analyses from scratch,
simply run the curatedMetagenomicsNextflow
pipeline with the following parameters set in nextflow.config:
// KneadData parameters
organism_database = 'human_genome' // Alternative: 'mouse_C57BL'
// MetaPhlAn parameters
metaphlan_index = 'latest'
// HUMAnN parameters
chocophlan = 'full'
uniref = 'uniref90_diamond'
// Process control parameters
skip_humann = false // Set to true to skip HUMAnN processing
// biobakery databases
uniref = "uniref90_diamond"
chocophlan = "full"
metaphlan_index = "latest"
Other parameters referenced in nextflow.config pertain
to the environment the pipeline is set up in and will vary according to
individual users’ needs. For example, if you would like the pipeline
output to go to a Google Bucket rather than a local directory, you will
need to specify the address of that bucket as the value for
publish_dir. You will also need to make sure that the
environment variable GOOGLE_APPLICATION_CREDENTIALS is set
to the location of a JSON keyfile for that bucket. For more details, the
curatedMetagenomicsNextflow
repo contains a number of example scripts and accompanying profiles used
to run the pipeline on various HPC systems. The pMD data was run using
the script submit_unitn.sh,
with the appropriate credentials filled in. This script is simply a
wrapper for the Nextflow run command:
nextflow run ASAP-MAC/metagenomicsNextflowMAC --metadata_tsv=$metadata_tsv -profile unitn -with-weblog https://nf-telemetry-819875667022.us-central1.run.app/nextflow-telemetry/events
Individual input tables were supplied for each dataset’s run by providing them when submitting the job:
qsub -N job_name -v metadata_tsv=/absolute/path/to/samples.tsv submit_unitn.sh
These tables contain UUIDs mapped to each original sample (discussed further below) and can be found in the parkinsonsManualCuration repository in the uuid_mapping/input_tables/kneaddata_v1 and uuid_mapping/input_tables/kneaddata_v2 directories.
We generated UUIDs for each sample as a redundant de-identification measure, and those UUIDs are what is referenced in the resultant output files even as they are hosted on Hugging Face. This is not a necessary step in analysis reproduction. The UUIDs and their mappings to original sample metadata are available in the parkinsonsManualCuration repository.
Example Pipeline Run
For example: to run the samples in the dataset “QianY_2020.tsv”, we
first download the accession
list from SRA. Then, we assign a UUID to each accession to create an
input
table. This can be done with the uuids.R
script. Then, we setup the pipeline
in our environment of choice. If using the UniTn HPC, unitn_setup.md
has some helpful tips. We configure the pipeline according to the
guidelines above, and finally run the script submit_unitn.sh
with the following command:
# assuming we are working in /home/user/workdir/
qsub -N job_name -v metadata_tsv=/home/user/workdir/QianY_2020.tsv submit_unitn.sh
The pipeline will now run, and all output files will be deposited to
the location that is specified for publish_dir in nextflow.config.
Pipeline Output Transformation
Once the pipeline has been run on each dataset and the output files have been produced and stored in a Google Cloud Bucket, they are transformed into parquet format. This is done so that the individual files from each sample can be combined into a single file for each type of output, and so that the resulting single file can be filtered in an efficient manner. The process and scripts associated can be found in the parquet_generation repo. Here is a summarized list of the various transformations performed:
- standardization of column names
- conversion of extra header information into a tabular format
- splitting of full taxonomic string into individual levels
- sorting data by specific relevant columns to ease filtering
After these transformations have been applied, the resulting parquet files are published in the metagenomics_mac and metagenomics_mac_examples repositories on Hugging Face. These files are then directly accessed by parkinsonsMetagenomicsData through DuckDB’s Hugging Face-specific protocol.
Metadata Curation
As an accompanying step to the uniform sequence processing, sample
metadata is manually curated to a brief, but uniform, schema. This
process is documented in the parkinsonsManualCuration
repository. Original metadata is preserved and indicated with the
uncurated_ prefix, and is automatically attached to data as
it is accessed from Hugging Face. Metadata for each dataset is curated
in a dedicated script and output to an individual CSV. All CSVs are then
combined into a single table and saved as
sampleMetadata.rda, which is then transported to the data
file of the same name here in parkinsonsMetagenomicData.