The research question in hand
Table of contents
- Intro
- Let’s start by looking at where an organism of interest is found in public data sets
- Limitations
- a placeholder for conceptual information behind branchwater and chillfilter
- Installation Festival
Intro
We’re going to get started with a workflow we can run entirely in the browser.
The initial step to all research is defining a question. What is it you want to know more about?
For our work in bioinformatics, this question is often exploratory. An exploratory analysis can also be a hypothesis-generative analysis. In other words, a question such as “What environmental samples contain Candida albicans?” may lead to the specific hypothesis of “Candida albicans may be increasingly identified in samples throughout time due to its opportunistic pathogenic nature and the increase use of sequencing technology”

These steps may take some time because the web-based infrastructure takes a while to run. Prepare some lecture for students while waiting…
Let’s start by looking at where an organism of interest is found in public data sets
Pick a species of interest to experiment with from the list below or from your own interest:
- Escherichia coli
- Staphylococcus aureus
- Pseudomonas aeruginosa
- Mycobacterium tuberculosis
- Helicobacter pylori
- Bacillus subtilis
- Klebsiella pneumoniae
- Streptococcus pneumoniae
- Listeria monocytogenes
- Haemophilus influenzae
- Salmonella enterica
- Chlamydia trachomatis
- Enterococcus faecalis
- Acinetobacter baumannii
- Staphylococcus epidermidis
- Streptococcus mutans
- Neisseria gonorrhoeae
- Vibrio cholerae
- Bacillus cereus
- Mycobacterium avium
Use the NCBI Web site (instructions below) to download a genome for your selected species :
The National Center of Biotechnological Information (NCBI)
A GenBank (GCA) genome assembly contains assembled genome sequences submitted by investigators or sequencing centers to GenBank or another member of the International Nucleotide Sequence Database Collaboration (INSDC). The GenBank (GCA) assembly is an archival record that is owned by the submitter and may or may not include annotation. A RefSeq (GCF) genome assembly represents an NCBI-derived copy of a submitted GenBank (GCA) assembly maintained by NCBI and includes annotation.
- Navigate to https://www.ncbi.nlm.nih.gov/datasets/genome/

- Type and select the species you are interested in exploring:

- When downloading a single genome from the NCBI, start with the species reference genome (the green check mark):

- Select the three vertical dots and click
Download:
- Choose the RefSeq Genome Sequence to Download:

- Navigate to https://www.ncbi.nlm.nih.gov/datasets/genome/
ALTERNATE place to grab genomes: The Genome Taxonomic DataBase (GTDB)
Importantly and increasingly, this dataset includes draft genomes of uncultured microorganisms obtained from metagenomes and single cells, ensuring improved genomic representation of the microbial world. All genomes are independently quality controlled using CheckM before inclusion in GTDB
- Navigate to the link above:

- Search for the species you are interested in exploring and select one of the
Accessionlinks:
- Select the
GTDB Representative of Species:
- Scroll down to
NCBI Metadata:
- Click
Download:
- Choose the RefSeq Genome Sequence to Download:

- Navigate to the link above:
With a FASTA file downloaded, search all environmental DNA samples available in the public collection of the Sequence Read Archive (SRA) using the branchwater tool (this tool is from Titus/Colton lab) -
- Upload the FASTA file from the previous section and click
Submit:
- Look through the results and click
Download CSV:
- Upload the FASTA file from the previous section and click
At this point, we now have all the samples containing the genome you select to a 90% containment Average Nucleotide Identity (cANI). This means that our tool estimates that approximately 90% of the genome is present in the sample.
- With the accessions available from our SRA search, we can inquire to the make-up of the environmental DNA using chill-filter:
To download from the browser, the file must be < 5GB. We suggest starting with a file that is smaller, since it will need to download to your computer!
- Use the Branchwater results to identify a sample you are interested in knowing the other genomes that make it up.
- Download that sample by one of two ways:
- Navigate to the SRA:
- Search for the sample by its accession number:

- Click on the
Runlink:
- Go to
FASTA/FASTQ download
2. - Navigate to the SRA Run Browser - Search for the sample accession of your choice:

- Go to
FASTA/FASTQ download
To submit this to the chill-filter Web site:
- go to https://chill-filter.sourmash.bio/
- Browse your computer to find the FASTAQ file you just downloaded:

- Hit
Submitto see the results.
- Click the table links to see a further breakdown of the sample:

And see what other organisms are there.
We can now ask questions like, what is correlated with our first search organism?
Limitations
Why is this approach limited?
- Because it is a point and click adventure (GUI and Browser)
- Because you are at the whims of the developers (like us!), and those people often don’t (can’t) know anything about your specific scientific question!
- For example, the map gif at the beginning of this document was derived from the same information the developer used to create their map on the branchwater results page.
To truly unlock your computational potential we must incorporate:
- a terminal
- scripts
- workflows
a placeholder for conceptual information behind branchwater and chillfilter

Installation Festival
Currently, we have the intention of introducing you to using R/Python.
Git - a version control software that stores all the changes to the files within a directory. It also allows users to link any local repositories to a remote server (GitHub) for easy collaboration and more. This should also allow Windows users to have a bash emulator terminal called Git Bash
R and RStudio - a statistical programming language that has a massive ecosystem, activate community, and wide user base in bioinformatics.
Miniconda - A customizable environment manager software that is packaged with the lastest version of Python. It allows users to create environments of various software combinations through conda install.
Optional, Pixi the ‘new kid’ environment manager that utilizes the conda architecture to simplify the management of environments.
Verification for Sanity
A standard way of verifying an installation is to use the --version keyword argument for the software installed.
conda --version
If the command returned something like conda 24.11.3, you have installed conda! Now to configure.
Configuration for Easy-of-use
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
This command adds the channel defaults to the ~/.condarc, a configuration file stored in a “dotfile”
Conda creation environment for downloading large files from the SRA
If you wanted to download a sample that was > 5GB, it’s best to use the SRA-tools and awscli
These two software can be install through conda with a command similar to:
conda create -n <environment-name> -y <software-list...>
To find the conda package name for a software, search for it at the anaconda website:
- https://anaconda.org/bioconda/sra-tools
For an environment to download files from the SRA:
conda create -n sra -y sra-tools awscli
Enter the environment that contains the software:
conda activate sra
To download a large file from the SRA:
mkdir -p sra/ && aws s3 cp --quiet --no-sign-request s3://sra-pub-run-odp/sra/<sample-id>/<sample-id> sra/<sample-id>.sra || prefetch --quiet <sample-id> -o sra/<sample-id>.sra
To parse the sra file into a usable format:
fasterq-dump sra/<sample-id>.sra --skip-technical --split-files --progress --threads 4 --bufsize 1000MB --curcache 10000MB --mem 16GB
This would be a great stopping point for the tangent of workflows and big data processing!!!
## Examples to scripted results --- With the CSV containing the metadata of the environmental samples containing your selected genome: {: .important } The various scripts included in this document use the following R packages or libraries. ```r library(tidyverse) # A collection of libraries for data analysis and visualization. Including ggplot2, dplyr, tidyr, readr, purrr, tibble, stringr, forcats, lubridate. #library(dplyr) # A data manipulation package. #library(lubridate) # A package to extend R's ability to parse and manage Date and Time data classes. #library(ggplot2) # A data visualization package library(stringdist) # Matching, distance, and fuzzy finding strings in R at speed! library(text2vec) # Text and NLP processing in R library(pheatmap) # Make pretty heatmaps in R library(gganimate) # Animate the plots in R library(clipr) # Easily read and write to the clipboard library(maps) # Basic maps in R ```
### What is the association of `Country` to `Organism`? --- 
### What is the association of `cANI` to `Containment`? --- 