Data Analytics Lesson 1: Loading and Understanding Your Data

Introduction to Data Analytics

Welcome to your first data analytics lesson! Data analytics is like being a detective - we use data to answer questions and discover patterns. Today, we’ll learn how to load data into Python and take our first look at what we’re working with.

What You’ll Need

Before we start, we need to install some special Python tools (called libraries) that help us work with data:

# These are the tools we'll use for data analysis
import pandas as pd    # For working with data tables
import numpy as np     # For math operations
import os             # For working with files

# Don't worry if you see warnings - that's normal!

Loading Your First Dataset

Let’s start with a simple dataset about students and their test scores. In real life, data often comes in CSV files (Comma Separated Values).

# First, let's create some sample student data to practice with
student_data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eva', 'Frank', 'Grace', 'Henry'],
    'Age': [16, 17, 16, 18, 17, 16, 17, 18],
    'Grade': ['10th', '11th', '10th', '12th', '11th', '10th', '11th', '12th'],
    'Math_Score': [85, 92, 78, 95, 88, 76, 91, 87],
    'Science_Score': [89, 87, 82, 98, 85, 79, 93, 89],
    'English_Score': [92, 85, 88, 91, 94, 83, 89, 86]
}

# Convert this into a DataFrame (think of it as a digital spreadsheet)
df = pd.DataFrame(student_data)

# Let's see what our data looks like
print("Our Student Dataset:")
print(df)

Output:

Our Student Dataset:
      Name  Age Grade  Math_Score  Science_Score  English_Score
0    Alice   16  10th          85             89             92
1      Bob   17  11th          92             87             85
2  Charlie   16  10th          78             82             88
3    Diana   18  12th          95             98             91
4      Eva   17  11th          88             85             94
5    Frank   16  10th          76             79             83
6    Grace   17  11th          91             93             89
7    Henry   18  12th          87             89             86

Understanding Your Data Structure

Now let’s explore what we have. Think of this like getting to know a new friend - we want to learn basic facts about our data:

# How big is our dataset? (rows and columns)
print("Dataset shape (rows, columns):", df.shape)

# What types of information do we have?
print("\nColumn names:")
print(df.columns.tolist())

# What kind of data is in each column?
print("\nData types:")
print(df.dtypes)

Expected Output:

Dataset shape (rows, columns): (8, 6) Column names: ['Name', 'Age', 'Grade', 'Math_Score', 'Science_Score', 'English_Score']

Data types: Name object Age int64 Grade object Math_Score int64 Science_Score int64 English_Score int64 dtype: object

Getting Basic Information About Your Data

Let’s learn some basic facts about our student data:

# Get a quick summary of numerical columns
print("Basic Statistics Summary:")
print(df.describe())

# Count how many students are in each grade
print("\nStudents per grade:")
print(df['Grade'].value_counts())

# Look at just the first few rows (useful for large datasets)
print("\nFirst 3 students:")
print(df.head(3))

# Look at the last few rows
print("\nLast 3 students:")
print(df.tail(3))

Expected Output:

Basic Statistics Summary: Age Math_Score Science_Score English_Score count 8.000000 8.000000 8.000000 8.000000 mean 16.750000 86.500000 87.750000 88.500000 std 0.886405 6.345177 6.135528 3.778153 min 16.000000 76.000000 79.000000 83.000000 25% 16.250000 82.750000 84.000000 86.000000 50% 17.000000 87.500000 88.000000 88.500000 75% 17.750000 91.250000 92.250000 91.250000 max 18.000000 95.000000 98.000000 94.000000 Students per grade: 10th 3 11th 3 12th 2 Name: Grade, dtype: int64 First 3 students: Name Age Grade Math_Score Science_Score English_Score 0 Alice 16 10th 85 89 92 1 Bob 17 11th 92 87 85 2 Charlie 16 10th 78 82 88

Last 3 students: Name Age Grade Math_Score Science_Score English_Score 5 Frank 16 10th 76 79 83 6 Grace 17 11th 91 93 89 7 Henry 18 12th 87 89 86

Loading Data from a Real File

In real projects, you’ll often load data from files. Here’s how to do it:

# If you had a CSV file, you would load it like this:
# df = pd.read_csv('student_data.csv')

# Let's save our data to a file first, then load it back
df.to_csv('student_scores.csv', index=False)  # Save without row numbers

# Now load it back (this is what you'd do with real data)
loaded_df = pd.read_csv('student_scores.csv')

print("Data loaded from file:")
print(loaded_df.head())

# Check if it's the same as our original data
print(f"\nAre they identical? {df.equals(loaded_df)}")

Handling Missing Data

Real data is often messy and has missing values. Let’s see how to deal with this:

# Let's create some data with missing values to practice
messy_data = student_data.copy()
messy_data['Math_Score'][2] = None    # Charlie's math score is missing
messy_data['Age'][5] = None          # Frank's age is missing

df_messy = pd.DataFrame(messy_data)

print("Data with missing values:")
print(df_messy)

# Check for missing data
print("\nMissing values per column:")
print(df_messy.isnull().sum())

# Fill in missing values (we'll learn better methods later)
df_clean = df_messy.fillna(df_messy.mean(numeric_only=True))
print("\nAfter filling missing values with averages:")
print(df_clean)

Key Learning Points

DataFrames are like digital spreadsheets - they organize data in rows and columns
Always explore your data first - use .shape, .head(), .describe() to understand what you have
Real data is often messy - missing values and errors are common
CSV files are the most common way to store and share data
pandas is your best friend for working with data in Python

Practice Exercises

Create Your Own Dataset: Make a DataFrame with information about your favorite movies (title, year, rating, genre)
Explore the Data: Use .describe(), .head(), and .info() on your movie dataset
Save and Load: Save your dataset to a CSV file and load it back
Handle Missing Data: Add some missing values to your dataset and practice filling them in

What’s Next?

Now that you can load and examine data, in the next lesson we’ll learn how to ask good questions about our data and start exploring patterns. We’ll discover what kinds of questions our student dataset can answer!

Next: Exploring Data and Asking Questions

I, Oliver Bonham-Carter 👋

Associate Professor in The Department of Computer and Information Science, Allegheny College