This is Data Science!
While we titled this workshop as Bioinformatics for STEM High School Teachers, the needs expressed by the attendees amounted to a desire to learn fundamental data science. Or, a desire to teach students to use rigorous logic to answer a experimental hypothesis through a transparent, reproducible process. The key to reproducible results is similar and divergent replicates (I.e. case verses control samples of sufficient power and explainable metadata differences).
Consider a task like digging a hole which boils down to moving dirt from one spot to another. You can use a variety of tools that scale in power, but also required expertise: your hands, a stick, a garden spade, a full size spade, an excavator. Each will allow you to move dirt, but some tools are more suited for the specific goals of the tasks. Digging a hole at the beach can easily be done with your hands or a stick while preparing a site for sewer main access would be best done by an excavator.
Our job as data scientist is to know when to use computational tools and skills at the correct time and in the correct order.
Fundamental Data Science can be summarized by three Rules:
- Basic coding skills
- Data management
- Plotting
Basic coding
Writing scripts in a high level coding language is the easiest entry to the data science field. Python is the industry and research standard. Not because it is the best language for a given task, but it is generally considered the second best language for every task.
Python Basics
- Variables
- assignment operators verses equivalence operators
- positional and keyword arguments
- indexing
- arithmetic operators
- class types
- class methods
- Built-in functions
- The Standard Library
Most of these basics are covered in sufficient detail by the Software Carpentry tutorial from our in-person sessions.
Data management
Once you grasp how to use a programming language, you can start to explore the data you have accumulated. This is a crucial point in a project. You must know the data you are using to be able to do anything else with it.
This may seem like a trivial statement and obvious once expressed, but so is breaking hard packed soil with a tiller or pick before digging with a shovel.
In summary, Always print the stats! This will give you a summary understanding of the data you hope to use. This understanding is crucial when needing to manipulate the data to tell a story with plots.
Plot
Plotting, much like data exploration, is a vital part of data science. While summary statistics can help raise a users awareness of the data they are working with, visualizing the data in different forms is a requirement to grasp nuanced concepts hidden in the numbers.
Always plot some basic plots!
- Boxplots for comparisons
- Scatterplots for relationships
- Lineplots for time series
- Histogram for distribution