Practice Data for Project Development
Colloquially, learning scripting and computational skills are binned into two categories. There is “tutorial hell”, a cycle of consuming tutorial-like content without ever breaking out on their own to producing anything they see as value. This is fine for the narrow applications of a software with well-crafted tutorials that the answer the learner’s narrow question. But, what if the data is complex, the question is undefined, there is no content to answer the question? Well, the learner is trapped with nowhere to go.
This tends to be the starting point for every coder. The common advice to avoid or break free of “tutorial hell”, Start projects right away! You do not need anything more that the very basics to begin writing software that can do neat things. Basic project are a great tool for understanding through applying the knowledge you are learning. There are many benefits to taking ownership of a project (no matter how small) as soon as possible.
Most understand more by solving their problems and overcoming challenges. This is not something you can achieve by following a tutorial. Those are normally curated to avoid the pitfalls and major challenges in computational space.
By starting a project, the ownership and interest in a project you care about will be an intrinsic motivator. Tutorials present material for the the most users with the easiest inputs available. Courses present will cover a broad range of topics to give students strong fundamentals. Both tutorials and course materials are vital to the computational learning space. And, there will be topics that are not interesting. When you can imagine the end result, even the boring tasks or topics are easier to learn about because you know what you are learning this material for.
To become a well-rounded explorer of the vast realm of computational skills, you must acknowledge that you will have shallow and deep skills in many areas. This tends to be self-guided outside of a program. Many consider the basics to be: navigating and using terminal applications, using version control software, intermidiate understanding of 1-2 programing languages and workflow management. This says nothing about front-end development like building website or applications for users, developing or applying a novel algorithms and/or statistical methods for an approach, or systems-focused like application program interfaces, cloud computing, and performance profiling. Simply put, breadth will yield context while depth gives power. Working through many projects will show you where your interest lies and where you would like to focus.
For any project, there is one thing you need outside of fundamental skills, Data!
Reading in data
Reading in data is “simple” in python or R. The complexity comes from the various skills that experts chunk into this action.
- You must have a data file
- You must know the data file type
- You must know where the data file is in your system
- You must be able to run python or R
- You must know the command and syntax to read in that data file
This is somewhat equivalent to opening a file in a spreadsheet software, but it requires different skills that many people have not needed to develop to use a computer.
Reading in a file in Python:
import pandas as pd # use the functions stored in the pandas module and rename it to pd for this script
df = pd.read_csv("path/to/file.csv")
# Or for an Excel file
xl = pd.ExcelFile("path/filename.xlsx")
Reading in a file in R:
df = read.csv('path/to/file.csv')
# Or an Excel file
library("readxl") # Loading the `readxl` library of functions to use
# xlsx files
xl <- read_excel("my_file.xlsx")
Data set examples
Coding in data
Coding data example
A reproducible example is a quick and easy to run script that has the data built-in! Having the data generated by the script allows you to run various tests by adjusting the data and script.
The most advantageous benefit of coding the data into the script is allowing you to share the standalone script with others easily. This makes requests for help with fixing bugs much easier!
# import the modules you will need to use
import pandas as pd
# read CSV file
csv_file = 'path/to/your_data.csv' # replace with your actual filename
df = pd.read_csv(csv_file)
# For every column in the dataframe, df, convert
# the first 10 non-NA values to a list of
# values. Print the column name and the list
# in a format that can be copied and pasted
# into the script you need reproducible data
# to run and share with others.
for col in df.columns:
values = df[col].dropna().head(10).tolist()
print(f"{col} = {values}")
But lets generate random data to help
import pandas as pd
import numpy as np
# To reproducibly create
# a random set of datapoints
# set the seed before the
# creating the data
np.random.seed(42)
# Generate sample data for later use
n_rows = 100
data = {
'id': range(1, n_rows + 1),
'age': np.random.randint(18, 70, size=n_rows),
'income': np.random.normal(50000, 15000, size=n_rows).round(2),
'gender': np.random.choice(['Male', 'Female', 'Other'], size=n_rows),
'signup_date': pd.date_range(start='2022-01-01', periods=n_rows, freq='D')
}
# Convert the dictionary to a DataFrame
df = pd.DataFrame(data)
The dataframe looks something like this:
df.head()
id age income gender signup_date
0 1 56 21864.85 Female 2022-01-01
1 2 69 29498.27 Other 2022-01-02
2 3 46 59544.58 Other 2022-01-03
3 4 32 36399.19 Female 2022-01-04
4 5 60 57140.64 Male 2022-01-05
The summary statistics are:
df.describe()
id age income signup_date
count 100.000000 100.000000 100.000000 100
mean 50.500000 43.350000 50584.245200 2022-02-19 12:00:00
min 1.000000 19.000000 20901.330000 2022-01-01 00:00:00
25% 25.750000 31.750000 38368.137500 2022-01-25 18:00:00
50% 50.500000 42.000000 49051.090000 2022-02-19 12:00:00
75% 75.250000 57.000000 61242.652500 2022-03-16 06:00:00
max 100.000000 69.000000 94154.950000 2022-04-10 00:00:00
std 29.011492 14.904663 15273.856456 NaN
Example Data
All data available from Seaborn data
All the data available from ‘The Python Graph Gallery’ website