Module 8 – Bioinformatics and Data Management
In this week’s discussion class, we discussed data management and bioinformatic pipelines. As a class, we discussed the importance of data management, storage, and security. Additionally, we read the Data Management Plan document, which outlines the requirements, content, and all other aspects of a data management plan. Data management plans are often a required part of grant proposals.
We discussed the following questions related to data management, storage, and security:
- How many places do you save your data? I save my data 3 places other than my local hard drive.
- Local disk drive, external hard drive, cloud, or other? I save my data locally, on an external hard drive, and on two cloud locations.
- Do you have paper and digital copies? I keep paper copies of lab notes, field notes, and manuscripts, but all of these are also stored digitally. I also keep paper copies of journal articles and books (in addition to having them stored digitally).
This week we examined the article by Callahan et al. (2017), in which the authors argue that amplicon sequence variants (ASVs) should be used instead of operational taxonomic units (OTUs) in analyses of high-throughput marker-gene sequencing data. ASV methods have previously been shown to be as good and sometimes better than OTU methods in sensitivity and specificity. Its higher resolution helps one distinguish a single, target species from a community where many other species in that genus are present. The biggest advantage offered by the ASV method (as argued by the authors) is that “ASVs […] combine the benefits for subsequent analysis of closed-reference and de novo OTUs: ASVs are reusable across studies, reproducible in future data sets and are not limited by incomplete reference databases.”
- Closed reference methods = sequence similar to sequence in a reference database gets recruited to that OTU. When you use this method, you lose biological variation that isn’t in the reference database.
- De novo methods = reads grouped into OTUs based on pairwise sequence similarities. Because the relative abundances of the community in a sample changes and other parameters differ between communities, you cannot compare two different data sets’ de novo OTUs.
ASV methods use a de novo process, and, because of this, ASVs can only be inferred on a single sample, not a single read. This handy thing about ASVs is that they are consistent labels because ASVs do represent biological reality. This means that they can be compared between different samples and studies. ASV methods are able to overcome the limitations presented by OTU de novo and reference methods.
We discussed the following questions related to the Callahan et al. (2017) paper:
- Are the author’s persuasive in their support of ASVs?
- Did they give a fair treatment to OTUs?
- Are there times when OTUs may be more effective?
- “…the smallest unit of data from which ASVs can be inferred is a sample.” What are the consequences of this?