Introduction
After mastering NumPy for numerical array operations, you possess powerful tools for working with homogeneous numerical data. However, real-world datasets rarely consist of simple numeric arrays. Instead, they contain mixed data types: customer names as strings, purchase amounts as floats, order dates as timestamps, and product categories as text. They arrive as tables with labeled columns where each column represents a different variable. They include missing values, require filtering and grouping, and need joining with other datasets. NumPy alone handles these scenarios awkwardly, requiring you to manage multiple arrays, track column meanings manually, and write extensive code for common operations. Pandas solves these problems elegantly.
Pandas, short for “panel data,” provides high-level data structures and operations designed specifically for data manipulation and analysis. The library’s two primary structures, Series (one-dimensional labeled arrays) and DataFrame (two-dimensional labeled tables), bring spreadsheet-like and SQL-like capabilities to Python while integrating seamlessly with NumPy’s numerical computing power. If NumPy provides the computational engine, pandas provides the interface that makes working with real-world datasets natural and intuitive. You can load data from CSV, Excel, databases, and dozens of other formats. You can filter rows, select columns, group by categories, merge datasets, handle missing values, and transform data, all with readable, expressive code.
Pandas has become the standard tool for data manipulation in Python, so ubiquitous that “doing data science in Python” essentially means “using pandas.” Every data science workflow begins with pandas: loading raw data, exploring its structure, cleaning and transforming it, and preparing it for analysis or modeling. Machine learning libraries like scikit-learn accept pandas DataFrames directly. Visualization libraries integrate with pandas seamlessly. Learning pandas represents not just learning another library but gaining the primary interface through which you will interact with data throughout your career. The investment in learning pandas thoroughly pays immediate dividends in productivity and capability.
This comprehensive guide introduces pandas from first principles through practical competence. You will learn what pandas is and why it dominates data manipulation in Python, how Series and DataFrames structure data with labels and mixed types, how to create DataFrames from various sources, how to explore and understand DataFrame structure and contents, how to access data through indexing and selection, and common operations for viewing and summarizing data. You will also discover how pandas builds on NumPy while providing higher-level abstractions, best practices for working with DataFrames, and how to think about data manipulation problems in pandas terms. By the end, you will confidently load and explore datasets, understanding DataFrames thoroughly and recognizing when pandas is the right tool.
What Is Pandas and Why Does It Matter?
Pandas provides data structures and tools designed for practical data analysis in Python. Created by Wes McKinney in 2008 while working in finance, pandas addressed the need for flexible, high-performance tools for quantitative analysis. The library quickly expanded beyond finance to become the standard data manipulation library across all domains, from scientific research to business analytics to machine learning preprocessing.
The name “pandas” derives from “panel data,” an econometrics term for multi-dimensional structured datasets. While pandas now extends far beyond panel data specifically, the name stuck. The library focuses on tabular data: datasets organized as rows of observations and columns of variables, much like spreadsheets or database tables. This focus aligns perfectly with how most real-world data arrives and how analysts naturally think about data.
Pandas builds directly on NumPy, using NumPy arrays internally for storage and computation. However, pandas adds critical capabilities that NumPy lacks:
Labeled axes: DataFrames have labeled rows (index) and columns (column names), letting you reference data by meaningful names rather than numeric positions. This labeling makes code self-documenting and prevents position-based errors.
Mixed data types: While NumPy arrays must contain homogeneous types, DataFrame columns can each have different types. One column might contain strings, another floats, and another dates.
Missing data handling: Real datasets contain missing values. Pandas provides explicit support for missing data through NaN values and comprehensive methods for detecting, filling, or removing missing values.
Database-like operations: Pandas implements SQL-like operations including filtering, grouping, joining, and aggregating through Python syntax. You can filter rows based on complex conditions, group data by categories and compute statistics, merge datasets like database joins, and reshape data between wide and long formats.
Time series functionality: Pandas includes extensive capabilities for working with time-indexed data, from date parsing and conversion to resampling and rolling windows.
I/O operations: Pandas reads and writes data in dozens of formats including CSV, Excel, JSON, SQL databases, Parquet, and more, with intelligent handling of data types, missing values, and encoding.
While NumPy excels at numerical computation on arrays, pandas excels at the broader task of data wrangling: loading messy real-world data, exploring it, cleaning it, transforming it, and preparing it for analysis. Most data science projects spend the majority of time in this wrangling phase, making pandas indispensable.
Installing and Importing Pandas
Install pandas using conda or pip:
# Using conda (recommended, handles dependencies better)
conda install pandas
# Using pip
pip install pandasImport pandas with the standard alias:
import pandas as pdThe pd alias is universal convention. Always use it for consistency with documentation and other code.
Verify installation and check version:
print(pd.__version__)Pandas evolves continuously, so version matters. Version 1.3 or later provides modern functionality, though version 2.0 introduces some breaking changes with significant improvements.
Pandas automatically imports NumPy, so importing pandas makes NumPy available:
import pandas as pd
import numpy as np # Often imported explicitly anywayUnderstanding Series: Pandas’ One-Dimensional Structure
Before exploring DataFrames, understanding Series provides foundation for how pandas structures data. A Series is a one-dimensional labeled array that can hold any data type.
Create a Series from a list:
import pandas as pd
# Simple Series from list
s = pd.Series([1, 2, 3, 4, 5])
print(s)Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64The left column shows the index (labels), the right shows values. By default, pandas creates a numeric index starting at 0.
Create Series with custom index:
s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print(s)Output:
a 100
b 200
c 300
dtype: int64Now labels are meaningful strings rather than positions.
Create Series from dictionary:
data = {'Alice': 85, 'Bob': 92, 'Charlie': 78}
s = pd.Series(data)
print(s)Output:
Alice 85
Bob 92
Charlie 78
dtype: int64Dictionary keys become the index automatically.
Access Series elements like dictionaries:
s = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print(s['a']) # 100
print(s['b']) # 200Or like NumPy arrays by position:
print(s[0]) # 100
print(s[1]) # 200Series support vectorized operations like NumPy arrays:
s = pd.Series([1, 2, 3, 4, 5])
print(s * 2) # Multiply all elements by 2
print(s ** 2) # Square all elements
print(s > 3) # Boolean mask
print(s[s > 3]) # Filter using maskSeries have attributes describing them:
s = pd.Series([1, 2, 3, 4, 5], index=['a', 'b', 'c', 'd', 'e'])
print(s.values) # Underlying NumPy array: [1 2 3 4 5]
print(s.index) # Index object: Index(['a', 'b', 'c', 'd', 'e'])
print(s.dtype) # Data type: int64
print(s.shape) # Shape: (5,)
print(s.size) # Number of elements: 5While Series are useful independently, they primarily serve as building blocks for DataFrames, where each column is a Series.
Understanding DataFrames: Pandas’ Two-Dimensional Structure
DataFrames represent the primary pandas data structure: two-dimensional labeled tables with columns potentially containing different types. Think of DataFrames as spreadsheets or database tables in Python.
Create a DataFrame from a dictionary:
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['Boston', 'Seattle', 'Chicago', 'Austin'],
'Salary': [70000, 85000, 90000, 75000]
}
df = pd.DataFrame(data)
print(df)Output:
Name Age City Salary
0 Alice 25 Boston 70000
1 Bob 30 Seattle 85000
2 Charlie 35 Chicago 90000
3 Diana 28 Austin 75000Dictionary keys become column names. Pandas creates a numeric index (0, 1, 2, 3) automatically.
Each column is a Series:
print(type(df['Name'])) # <class 'pandas.core.series.Series'>
print(df['Age'])Create DataFrame from list of dictionaries:
data = [
{'Name': 'Alice', 'Age': 25, 'City': 'Boston'},
{'Name': 'Bob', 'Age': 30, 'City': 'Seattle'},
{'Name': 'Charlie', 'Age': 35, 'City': 'Chicago'}
]
df = pd.DataFrame(data)
print(df)Each dictionary becomes a row.
Create DataFrame from NumPy array:
import numpy as np
arr = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
df = pd.DataFrame(arr, columns=['A', 'B', 'C'])
print(df)Output:
A B C
0 1 2 3
1 4 5 6
2 7 8 9Specify custom index when creating DataFrame:
data = {
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}
df = pd.DataFrame(data, index=['row1', 'row2', 'row3'])
print(df)Output:
Name Age
row1 Alice 25
row2 Bob 30
row3 Charlie 35Exploring DataFrame Structure
Before analyzing data, understand its structure and contents. Pandas provides many methods for exploration.
View first few rows:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
'Age': [25, 30, 35, 28, 32],
'Salary': [70000, 85000, 90000, 75000, 80000]
})
print(df.head()) # First 5 rows by default
print(df.head(3)) # First 3 rowsView last few rows:
print(df.tail()) # Last 5 rows by default
print(df.tail(2)) # Last 2 rowsGet DataFrame information:
print(df.info())Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Name 5 non-null object
1 Age 5 non-null int64
2 Salary 5 non-null int64
dtypes: int64(2), object(1)
memory usage: 248.0+ bytesThis shows number of rows, column names, data types, and memory usage.
Get summary statistics:
print(df.describe())Output:
Age Salary
count 5.000000 5.000000
mean 30.000000 80000.000000
std 3.807887 7905.694150
min 25.000000 70000.000000
25% 28.000000 75000.000000
50% 30.000000 80000.000000
75% 32.000000 85000.000000
max 35.000000 90000.000000This computes statistics for numeric columns automatically.
Access DataFrame attributes:
print(df.shape) # (5, 3) - 5 rows, 3 columns
print(df.columns) # Column names
print(df.index) # Row index
print(df.dtypes) # Data types of each column
print(df.size) # Total elements: 15
print(len(df)) # Number of rows: 5Check for missing values:
print(df.isnull().sum()) # Count missing values per column
print(df.notnull().sum()) # Count non-missing valuesGet column names as list:
columns = df.columns.tolist()
print(columns) # ['Name', 'Age', 'Salary']Selecting Data: Columns and Rows
Pandas provides multiple ways to select data, each suited for different scenarios.
Select single column (returns Series):
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [70000, 85000, 90000]
})
# Two equivalent ways
ages = df['Age']
ages = df.Age # Dot notation - only works if name is valid Python identifier
print(ages)
print(type(ages)) # <class 'pandas.core.series.Series'>Select multiple columns (returns DataFrame):
# Pass list of column names
subset = df[['Name', 'Age']]
print(subset)
print(type(subset)) # <class 'pandas.core.frame.DataFrame'>Note the double brackets: outer brackets for indexing, inner brackets for list.
Select rows by position using iloc:
# First row
print(df.iloc[0]) # Returns Series
# First three rows
print(df.iloc[0:3]) # Returns DataFrame
# Specific rows
print(df.iloc[[0, 2]]) # First and third rowsSelect rows by label using loc:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35]
}, index=['row1', 'row2', 'row3'])
# Select by index label
print(df.loc['row1'])
# Select multiple rows
print(df.loc[['row1', 'row3']])Select rows and columns together:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [70000, 85000, 90000]
})
# Rows 0-1, columns Name and Age
print(df.loc[0:1, ['Name', 'Age']])
# All rows, specific columns
print(df.loc[:, ['Name', 'Salary']])
# Specific rows and columns by position
print(df.iloc[0:2, 0:2])Filter rows based on conditions:
# Boolean indexing
print(df[df['Age'] > 28])
# Multiple conditions
print(df[(df['Age'] > 25) & (df['Salary'] < 90000)])
# Using query method
print(df.query('Age > 28'))
print(df.query('Age > 25 and Salary < 90000'))Modifying DataFrames
Add new columns:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [70000, 85000, 90000]
})
# Add constant value
df['Country'] = 'USA'
# Add calculated column
df['Salary_K'] = df['Salary'] / 1000
# Add using apply
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
print(df)Modify existing values:
# Modify entire column
df['Age'] = df['Age'] + 1
# Modify based on condition
df.loc[df['Age'] > 30, 'Salary'] = df['Salary'] * 1.1
# Modify single value
df.loc[0, 'Name'] = 'Alice Smith'Delete columns:
# Drop column (returns new DataFrame)
df_new = df.drop('Salary_K', axis=1)
# Drop column in place
df.drop('Salary_K', axis=1, inplace=True)
# Drop multiple columns
df.drop(['Country', 'Age_Group'], axis=1, inplace=True)Delete rows:
# Drop rows by index
df.drop([0, 2], axis=0, inplace=True)
# Drop rows based on condition
df = df[df['Age'] > 25]Rename columns:
df.rename(columns={'Name': 'Full_Name', 'Age': 'Years'}, inplace=True)
print(df.columns)Common DataFrame Operations
Sort by values:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Age': [25, 30, 35],
'Salary': [70000, 85000, 90000]
})
# Sort by one column
df_sorted = df.sort_values('Age')
# Sort descending
df_sorted = df.sort_values('Age', ascending=False)
# Sort by multiple columns
df_sorted = df.sort_values(['Age', 'Salary'], ascending=[True, False])Reset index:
df_reset = df.reset_index(drop=True) # Drop old indexSet index:
df.set_index('Name', inplace=True)
print(df)Count values:
# Count occurrences in a column
df['Age_Group'] = df['Age'].apply(lambda x: 'Young' if x < 30 else 'Senior')
counts = df['Age_Group'].value_counts()
print(counts)Handling Missing Data
Create DataFrame with missing values:
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, np.nan, 35, 28],
'Salary': [70000, 85000, np.nan, 75000]
})
print(df)Detect missing values:
print(df.isnull()) # Returns boolean DataFrame
print(df.isnull().sum()) # Count per columnDrop rows with missing values:
df_clean = df.dropna() # Drop any row with any missing value
df_clean = df.dropna(subset=['Age']) # Drop if Age is missingFill missing values:
# Fill with specific value
df_filled = df.fillna(0)
# Fill with column mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
# Forward fill (use previous value)
df.fillna(method='ffill', inplace=True)Best Practices for Working with DataFrames
Follow these patterns for effective pandas usage:
Use meaningful column names without spaces:
# Good
df.columns = ['name', 'age', 'salary']
# Avoid spaces (complicates access)
df.columns = ['Full Name', 'Age', 'Salary']Chain operations for readability:
result = (df
.query('Age > 25')
.sort_values('Salary', ascending=False)
.head(10)
)Avoid modifying DataFrames during iteration:
# Don't do this
for idx, row in df.iterrows():
df.loc[idx, 'New_Col'] = row['Age'] * 2
# Do this instead (vectorized)
df['New_Col'] = df['Age'] * 2Use copy() to avoid unintended modifications:
df_subset = df[['Name', 'Age']].copy() # Creates independent copyConclusion
Pandas transforms Python into a powerful platform for data manipulation, providing intuitive interfaces for the operations that consume most data science time: loading data, exploring structure, cleaning issues, filtering observations, selecting features, and transforming formats. Understanding DataFrames and Series thoroughly provides foundation for all pandas work, from simple data exploration through complex data wrangling pipelines.
The transition from NumPy arrays to pandas DataFrames requires embracing labeled axes and mixed types. Instead of tracking what each array position means, you reference data by meaningful names. Instead of managing multiple related arrays, you work with unified DataFrames. This higher-level abstraction makes code more readable and less error-prone while maintaining NumPy’s computational efficiency underneath.
This introduction provides foundation for the coming articles covering reading data from files, advanced selection and filtering, grouping and aggregation, merging and joining, and specialized operations. Practice creating DataFrames, exploring their structure, selecting data, and performing basic operations. Build intuition for when to use loc versus iloc, how indexing works, and how operations compose together. With solid DataFrame fundamentals, you will confidently tackle real datasets and recognize pandas as the indispensable tool it is for data science.








