Introduction to Pandas: Your First Data Manipulation Library

Learn pandas for data science from scratch. Master DataFrames, Series, data loading, basic operations, and data manipulation fundamentals. Complete beginner’s guide with examples.

Introduction to Pandas: Your First Data Manipulation Library

Introduction

After mastering NumPy for numerical array operations, you possess powerful tools for working with homogeneous numerical data. However, real-world datasets rarely consist of simple numeric arrays. Instead, they contain mixed data types: customer names as strings, purchase amounts as floats, order dates as timestamps, and product categories as text. They arrive as tables with labeled columns where each column represents a different variable. They include missing values, require filtering and grouping, and need joining with other datasets. NumPy alone handles these scenarios awkwardly, requiring you to manage multiple arrays, track column meanings manually, and write extensive code for common operations. Pandas solves these problems elegantly.

Pandas, short for “panel data,” provides high-level data structures and operations designed specifically for data manipulation and analysis. The library’s two primary structures, Series (one-dimensional labeled arrays) and DataFrame (two-dimensional labeled tables), bring spreadsheet-like and SQL-like capabilities to Python while integrating seamlessly with NumPy’s numerical computing power. If NumPy provides the computational engine, pandas provides the interface that makes working with real-world datasets natural and intuitive. You can load data from CSV, Excel, databases, and dozens of other formats. You can filter rows, select columns, group by categories, merge datasets, handle missing values, and transform data, all with readable, expressive code.

Pandas has become the standard tool for data manipulation in Python, so ubiquitous that “doing data science in Python” essentially means “using pandas.” Every data science workflow begins with pandas: loading raw data, exploring its structure, cleaning and transforming it, and preparing it for analysis or modeling. Machine learning libraries like scikit-learn accept pandas DataFrames directly. Visualization libraries integrate with pandas seamlessly. Learning pandas represents not just learning another library but gaining the primary interface through which you will interact with data throughout your career. The investment in learning pandas thoroughly pays immediate dividends in productivity and capability.

This comprehensive guide introduces pandas from first principles through practical competence. You will learn what pandas is and why it dominates data manipulation in Python, how Series and DataFrames structure data with labels and mixed types, how to create DataFrames from various sources, how to explore and understand DataFrame structure and contents, how to access data through indexing and selection, and common operations for viewing and summarizing data. You will also discover how pandas builds on NumPy while providing higher-level abstractions, best practices for working with DataFrames, and how to think about data manipulation problems in pandas terms. By the end, you will confidently load and explore datasets, understanding DataFrames thoroughly and recognizing when pandas is the right tool.

What Is Pandas and Why Does It Matter?

Pandas provides data structures and tools designed for practical data analysis in Python. Created by Wes McKinney in 2008 while working in finance, pandas addressed the need for flexible, high-performance tools for quantitative analysis. The library quickly expanded beyond finance to become the standard data manipulation library across all domains, from scientific research to business analytics to machine learning preprocessing.

The name “pandas” derives from “panel data,” an econometrics term for multi-dimensional structured datasets. While pandas now extends far beyond panel data specifically, the name stuck. The library focuses on tabular data: datasets organized as rows of observations and columns of variables, much like spreadsheets or database tables. This focus aligns perfectly with how most real-world data arrives and how analysts naturally think about data.

Pandas builds directly on NumPy, using NumPy arrays internally for storage and computation. However, pandas adds critical capabilities that NumPy lacks:

Labeled axes: DataFrames have labeled rows (index) and columns (column names), letting you reference data by meaningful names rather than numeric positions. This labeling makes code self-documenting and prevents position-based errors.

Mixed data types: While NumPy arrays must contain homogeneous types, DataFrame columns can each have different types. One column might contain strings, another floats, and another dates.

Missing data handling: Real datasets contain missing values. Pandas provides explicit support for missing data through NaN values and comprehensive methods for detecting, filling, or removing missing values.

Database-like operations: Pandas implements SQL-like operations including filtering, grouping, joining, and aggregating through Python syntax. You can filter rows based on complex conditions, group data by categories and compute statistics, merge datasets like database joins, and reshape data between wide and long formats.

Time series functionality: Pandas includes extensive capabilities for working with time-indexed data, from date parsing and conversion to resampling and rolling windows.

I/O operations: Pandas reads and writes data in dozens of formats including CSV, Excel, JSON, SQL databases, Parquet, and more, with intelligent handling of data types, missing values, and encoding.

While NumPy excels at numerical computation on arrays, pandas excels at the broader task of data wrangling: loading messy real-world data, exploring it, cleaning it, transforming it, and preparing it for analysis. Most data science projects spend the majority of time in this wrangling phase, making pandas indispensable.

Installing and Importing Pandas

Install pandas using conda or pip:

Bash
# Using conda (recommended, handles dependencies better)
conda install pandas

# Using pip
pip install pandas

Import pandas with the standard alias:

Bash
import pandas as pd

The pd alias is universal convention. Always use it for consistency with documentation and other code.

Verify installation and check version:

Python
print(pd.__version__)

Pandas evolves continuously, so version matters. Version 1.3 or later provides modern functionality, though version 2.0 introduces some breaking changes with significant improvements.

Pandas automatically imports NumPy, so importing pandas makes NumPy available:

Python
import pandas as pd
import numpy as np  # Often imported explicitly anyway

Understanding Series: Pandas’ One-Dimensional Structure

Before exploring DataFrames, understanding Series provides foundation for how pandas structures data. A Series is a one-dimensional labeled array that can hold any data type.

Create a Series from a list:

Python
import pandas as pd

# Simple Series from list
s = pd.Series([1, 2, 3, 4, 5])
print(s)

Output:

Python
0    1
1    2
2    3
3    4
4    5
dtype: int64

The left column shows the index (labels), the right shows values. By default, pandas creates a numeric index starting at 0.

Create Series with custom index:

Python
s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print(s)

Output:

Python
a    100
b    200
c    300
dtype: int64

Now labels are meaningful strings rather than positions.

Create Series from dictionary:

Python
data = {'Alice': 85, 'Bob': 92, 'Charlie': 78}
s = pd.Series(data)
print(s)

Output:

Python
Alice      85
Bob        92
Charlie    78
dtype: int64

Dictionary keys become the index automatically.

Access Series elements like dictionaries:

Python
s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])

print(s['a'])  # 100
print(s['b'])  # 200

Or like NumPy arrays by position:

Python
print(s[0])  # 100
print(s[1])  # 200

Series support vectorized operations like NumPy arrays:

Python
s = pd.Series([1, 2, 3, 4, 5])

print(s * 2)      # Multiply all elements by 2
print(s ** 2)     # Square all elements
print(s > 3)      # Boolean mask
print(s[s > 3])   # Filter using mask

Series have attributes describing them:

Python
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])

print(s.values)   # Underlying NumPy array: [1 2 3 4 5]
print(s.index)    # Index object: Index(['a', 'b', 'c', 'd', 'e'])
print(s.dtype)    # Data type: int64
print(s.shape)    # Shape: (5,)
print(s.size)     # Number of elements: 5

While Series are useful independently, they primarily serve as building blocks for DataFrames, where each column is a Series.

Understanding DataFrames: Pandas’ Two-Dimensional Structure

DataFrames represent the primary pandas data structure: two-dimensional labeled tables with columns potentially containing different types. Think of DataFrames as spreadsheets or database tables in Python.

Create a DataFrame from a dictionary:

Python
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, 30, 35, 28],
    'City': ['Boston', 'Seattle', 'Chicago', 'Austin'],
    'Salary': [70000, 85000, 90000, 75000]
}

df = pd.DataFrame(data)
print(df)

Output:

Python
      Name  Age     City  Salary
0    Alice   25   Boston   70000
1      Bob   30  Seattle   85000
2  Charlie   35  Chicago   90000
3    Diana   28   Austin   75000

Dictionary keys become column names. Pandas creates a numeric index (0, 1, 2, 3) automatically.

Each column is a Series:

Python
print(type(df['Name']))  # <class 'pandas.core.series.Series'>
print(df['Age'])

Create DataFrame from list of dictionaries:

Python
data = [
    {'Name': 'Alice', 'Age': 25, 'City': 'Boston'},
    {'Name': 'Bob', 'Age': 30, 'City': 'Seattle'},
    {'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]

df = pd.DataFrame(data)
print(df)

Each dictionary becomes a row.

Create DataFrame from NumPy array:

Python
import numpy as np

arr = np.array([[1, 2, 3],
                [4, 5, 6],
                [7, 8, 9]])

df = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print(df)

Output:

Python
   A  B  C
0  1  2  3
1  4  5  6
2  7  8  9

Specify custom index when creating DataFrame:

Python
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}

df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
print(df)

Output:

Python
         Name  Age
row1    Alice   25
row2      Bob   30
row3  Charlie   35

Exploring DataFrame Structure

Before analyzing data, understand its structure and contents. Pandas provides many methods for exploration.

View first few rows:

Python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [25, 30, 35, 28, 32],
    'Salary': [70000, 85000, 90000, 75000, 80000]
})

print(df.head())  # First 5 rows by default
print(df.head(3))  # First 3 rows

View last few rows:

Python
print(df.tail())  # Last 5 rows by default
print(df.tail(2))  # Last 2 rows

Get DataFrame information:

Python
print(df.info())

Output:

Bash
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Name    5 non-null      object
 1   Age     5 non-null      int64 
 2   Salary  5 non-null      int64 
dtypes: int64(2), object(1)
memory usage: 248.0+ bytes

This shows number of rows, column names, data types, and memory usage.

Get summary statistics:

Python
print(df.describe())

Output:

Python
             Age        Salary
count   5.000000      5.000000
mean   30.000000  80000.000000
std     3.807887   7905.694150
min    25.000000  70000.000000
25%    28.000000  75000.000000
50%    30.000000  80000.000000
75%    32.000000  85000.000000
max    35.000000  90000.000000

This computes statistics for numeric columns automatically.

Access DataFrame attributes:

Python
print(df.shape)       # (5, 3) - 5 rows, 3 columns
print(df.columns)     # Column names
print(df.index)       # Row index
print(df.dtypes)      # Data types of each column
print(df.size)        # Total elements: 15
print(len(df))        # Number of rows: 5

Check for missing values:

Python
print(df.isnull().sum())  # Count missing values per column
print(df.notnull().sum())  # Count non-missing values

Get column names as list:

Python
columns = df.columns.tolist()
print(columns)  # ['Name', 'Age', 'Salary']

Selecting Data: Columns and Rows

Pandas provides multiple ways to select data, each suited for different scenarios.

Select single column (returns Series):

Python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [70000, 85000, 90000]
})

# Two equivalent ways
ages = df['Age']
ages = df.Age  # Dot notation - only works if name is valid Python identifier

print(ages)
print(type(ages))  # <class 'pandas.core.series.Series'>

Select multiple columns (returns DataFrame):

Python
# Pass list of column names
subset = df[['Name', 'Age']]
print(subset)
print(type(subset))  # <class 'pandas.core.frame.DataFrame'>

Note the double brackets: outer brackets for indexing, inner brackets for list.

Select rows by position using iloc:

Python
# First row
print(df.iloc[0])  # Returns Series

# First three rows
print(df.iloc[0:3])  # Returns DataFrame

# Specific rows
print(df.iloc[[0, 2]])  # First and third rows

Select rows by label using loc:

Python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35]
}, index=['row1', 'row2', 'row3'])

# Select by index label
print(df.loc['row1'])

# Select multiple rows
print(df.loc[['row1', 'row3']])

Select rows and columns together:

Python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [70000, 85000, 90000]
})

# Rows 0-1, columns Name and Age
print(df.loc[0:1, ['Name', 'Age']])

# All rows, specific columns
print(df.loc[:, ['Name', 'Salary']])

# Specific rows and columns by position
print(df.iloc[0:2, 0:2])

Filter rows based on conditions:

Python
# Boolean indexing
print(df[df['Age'] > 28])

# Multiple conditions
print(df[(df['Age'] > 25) & (df['Salary'] < 90000)])

# Using query method
print(df.query('Age > 28'))
print(df.query('Age > 25 and Salary < 90000'))

Modifying DataFrames

Add new columns:

Python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [70000, 85000, 90000]
})

# Add constant value
df['Country'] = 'USA'

# Add calculated column
df['Salary_K'] = df['Salary'] / 1000

# Add using apply
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')

print(df)

Modify existing values:

Python
# Modify entire column
df['Age'] = df['Age'] + 1

# Modify based on condition
df.loc[df['Age'] > 30, 'Salary'] = df['Salary'] * 1.1

# Modify single value
df.loc[0, 'Name'] = 'Alice Smith'

Delete columns:

Python
# Drop column (returns new DataFrame)
df_new = df.drop('Salary_K', axis=1)

# Drop column in place
df.drop('Salary_K', axis=1, inplace=True)

# Drop multiple columns
df.drop(['Country', 'Age_Group'], axis=1, inplace=True)

Delete rows:

Python
# Drop rows by index
df.drop([0, 2], axis=0, inplace=True)

# Drop rows based on condition
df = df[df['Age'] > 25]

Rename columns:

Python
df.rename(columns={'Name': 'Full_Name', 'Age': 'Years'}, inplace=True)
print(df.columns)

Common DataFrame Operations

Sort by values:

Python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'Salary': [70000, 85000, 90000]
})

# Sort by one column
df_sorted = df.sort_values('Age')

# Sort descending
df_sorted = df.sort_values('Age', ascending=False)

# Sort by multiple columns
df_sorted = df.sort_values(['Age', 'Salary'], ascending=[True, False])

Reset index:

Python
df_reset = df.reset_index(drop=True)  # Drop old index

Set index:

Python
df.set_index('Name', inplace=True)
print(df)

Count values:

Python
# Count occurrences in a column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
counts = df['Age_Group'].value_counts()
print(counts)

Handling Missing Data

Create DataFrame with missing values:

Python
df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age': [25, np.nan, 35, 28],
    'Salary': [70000, 85000, np.nan, 75000]
})

print(df)

Detect missing values:

Python
print(df.isnull())  # Returns boolean DataFrame
print(df.isnull().sum())  # Count per column

Drop rows with missing values:

Python
df_clean = df.dropna()  # Drop any row with any missing value
df_clean = df.dropna(subset=['Age'])  # Drop if Age is missing

Fill missing values:

Python
# Fill with specific value
df_filled = df.fillna(0)

# Fill with column mean
df['Age'].fillna(df['Age'].mean(), inplace=True)

# Forward fill (use previous value)
df.fillna(method='ffill', inplace=True)

Best Practices for Working with DataFrames

Follow these patterns for effective pandas usage:

Use meaningful column names without spaces:

Python
# Good
df.columns = ['name', 'age', 'salary']

# Avoid spaces (complicates access)
df.columns = ['Full Name', 'Age', 'Salary']

Chain operations for readability:

Python
result = (df
    .query('Age > 25')
    .sort_values('Salary', ascending=False)
    .head(10)
)

Avoid modifying DataFrames during iteration:

Python
# Don't do this
for idx, row in df.iterrows():
    df.loc[idx, 'New_Col'] = row['Age'] * 2

# Do this instead (vectorized)
df['New_Col'] = df['Age'] * 2

Use copy() to avoid unintended modifications:

Python
df_subset = df[['Name', 'Age']].copy()  # Creates independent copy

Conclusion

Pandas transforms Python into a powerful platform for data manipulation, providing intuitive interfaces for the operations that consume most data science time: loading data, exploring structure, cleaning issues, filtering observations, selecting features, and transforming formats. Understanding DataFrames and Series thoroughly provides foundation for all pandas work, from simple data exploration through complex data wrangling pipelines.

The transition from NumPy arrays to pandas DataFrames requires embracing labeled axes and mixed types. Instead of tracking what each array position means, you reference data by meaningful names. Instead of managing multiple related arrays, you work with unified DataFrames. This higher-level abstraction makes code more readable and less error-prone while maintaining NumPy’s computational efficiency underneath.

This introduction provides foundation for the coming articles covering reading data from files, advanced selection and filtering, grouping and aggregation, merging and joining, and specialized operations. Practice creating DataFrames, exploring their structure, selecting data, and performing basic operations. Build intuition for when to use loc versus iloc, how indexing works, and how operations compose together. With solid DataFrame fundamentals, you will confidently tackle real datasets and recognize pandas as the indispensable tool it is for data science.

Share:
Subscribe
Notify of
0 Comments
Inline Feedbacks
View all comments

Discover More

Top Data Science Bootcamps Compared: Which is Right for You?

Compare top data science bootcamps including curriculum, cost, outcomes, and learning formats. Discover which bootcamp…

Vectors and Matrices Explained for Robot Movement

Learn how vectors and matrices control robot movement. Understand position, velocity, rotation, and transformations with…

The Basics of Soldering: How to Create Permanent Connections

The Basics of Soldering: How to Create Permanent Connections

Learn soldering basics from equipment selection to technique, temperature, and finishing touches to create reliable…

Exploring Capacitors: Types and Capacitance Values

Discover the different types of capacitors, their capacitance values, and applications. Learn how capacitors function…

Kindred Raises $125M for Peer-to-Peer Home Exchange Platform

Travel platform Kindred raises $125 million across Series B and C rounds for peer-to-peer home…

Understanding Transistors: The Building Blocks of Modern Electronics

Understanding Transistors: The Building Blocks of Modern Electronics

Learn what transistors are, how BJTs and MOSFETs work, why they’re the foundation of all…

Click For More
0
Would love your thoughts, please comment.x
()
x