Practice Data for Project Development

Colloquially, learning scripting and computational skills are binned into two categories. There is “tutorial hell”, a cycle of consuming tutorial-like content without ever breaking out on their own to producing anything they see as value. This is fine for the narrow applications of a software with well-crafted tutorials that the answer the learner’s narrow question. But, what if the data is complex, the question is undefined, there is no content to answer the question? Well, the learner is trapped with nowhere to go.

Tutorial hell is a metaphysical space where beginner programmers are sometimes consigned as a punishment for flying too close to the sun of learning.

This tends to be the starting point for every coder. The common advice to avoid or break free of “tutorial hell”, Start projects right away! You do not need anything more that the very basics to begin writing software that can do neat things. Basic project are a great tool for understanding through applying the knowledge you are learning. There are many benefits to taking ownership of a project (no matter how small) as soon as possible.

  1. Most understand more by solving their problems and overcoming challenges. This is not something you can achieve by following a tutorial. Those are normally curated to avoid the pitfalls and major challenges in computational space.

  2. By starting a project, the ownership and interest in a project you care about will be an intrinsic motivator. Tutorials present material for the the most users with the easiest inputs available. Courses present will cover a broad range of topics to give students strong fundamentals. Both tutorials and course materials are vital to the computational learning space. And, there will be topics that are not interesting. When you can imagine the end result, even the boring tasks or topics are easier to learn about because you know what you are learning this material for.

  3. To become a well-rounded explorer of the vast realm of computational skills, you must acknowledge that you will have shallow and deep skills in many areas. This tends to be self-guided outside of a program. Many consider the basics to be: navigating and using terminal applications, using version control software, intermidiate understanding of 1-2 programing languages and workflow management. This says nothing about front-end development like building website or applications for users, developing or applying a novel algorithms and/or statistical methods for an approach, or systems-focused like application program interfaces, cloud computing, and performance profiling. Simply put, breadth will yield context while depth gives power. Working through many projects will show you where your interest lies and where you would like to focus.

For any project, there is one thing you need outside of fundamental skills, Data!

Reading in data

Reading in data is “simple” in python or R. The complexity comes from the various skills that experts chunk into this action.

  1. You must have a data file
  2. You must know the data file type
  3. You must know where the data file is in your system
  4. You must be able to run python or R
  5. You must know the command and syntax to read in that data file

This is somewhat equivalent to opening a file in a spreadsheet software, but it requires different skills that many people have not needed to develop to use a computer.

Reading in a file in Python:

import pandas as pd # use the functions stored in the pandas module and rename it to pd for this script

df = pd.read_csv("path/to/file.csv")

# Or for an Excel file
xl = pd.ExcelFile("path/filename.xlsx")

Reading in a file in R:

df = read.csv('path/to/file.csv')

# Or an Excel file
library("readxl") # Loading the `readxl` library of functions to use

# xlsx files
xl <- read_excel("my_file.xlsx")

Data set examples

Coding in data

Coding data example

A reproducible example is a quick and easy to run script that has the data built-in! Having the data generated by the script allows you to run various tests by adjusting the data and script.

The most advantageous benefit of coding the data into the script is allowing you to share the standalone script with others easily. This makes requests for help with fixing bugs much easier!

# import the modules you will need to use
import pandas as pd

# read CSV file
csv_file = 'path/to/your_data.csv'  # replace with your actual filename
df = pd.read_csv(csv_file)

# For every column in the dataframe, df, convert
# the first 10 non-NA values to a list of
# values. Print the column name and the list
# in a format that can be copied and pasted
# into the script you need reproducible data
# to run and share with others.
for col in df.columns:
    values = df[col].dropna().head(10).tolist()
    print(f"{col} = {values}")

But lets generate random data to help

import pandas as pd
import numpy as np

# To reproducibly create
# a random set of datapoints
# set the seed before the
# creating the data
np.random.seed(42)

# Generate sample data for later use
n_rows = 100
data = {
    'id': range(1, n_rows + 1),
    'age': np.random.randint(18, 70, size=n_rows),
    'income': np.random.normal(50000, 15000, size=n_rows).round(2),
    'gender': np.random.choice(['Male', 'Female', 'Other'], size=n_rows),
    'signup_date': pd.date_range(start='2022-01-01', periods=n_rows, freq='D')
}

# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)

The dataframe looks something like this:

df.head()

   id  age    income  gender signup_date
0   1   56  21864.85  Female  2022-01-01
1   2   69  29498.27   Other  2022-01-02
2   3   46  59544.58   Other  2022-01-03
3   4   32  36399.19  Female  2022-01-04
4   5   60  57140.64    Male  2022-01-05

The summary statistics are:

df.describe()

               id         age        income          signup_date
count  100.000000  100.000000    100.000000                  100
mean    50.500000   43.350000  50584.245200  2022-02-19 12:00:00
min      1.000000   19.000000  20901.330000  2022-01-01 00:00:00
25%     25.750000   31.750000  38368.137500  2022-01-25 18:00:00
50%     50.500000   42.000000  49051.090000  2022-02-19 12:00:00
75%     75.250000   57.000000  61242.652500  2022-03-16 06:00:00
max    100.000000   69.000000  94154.950000  2022-04-10 00:00:00
std     29.011492   14.904663  15273.856456                  NaN

Example Data

World Happiness Data

Datasaurus Dozen Data

All data available from Seaborn data

All the data available from ‘The Python Graph Gallery’ website

Real data from a Material Scientist for noisy data

Huge List of Awesome Public Datasets