Geospatial data represents information tied to locations on Earth — points (a store location), lines (a road), or polygons (a country boundary). In Python, Shapely provides the geometry objects (Point, LineString, Polygon), GeoPandas extends pandas DataFrames to hold these geometries alongside regular data, and Folium or Matplotlib with contextily creates maps. The most common geospatial data science task is the spatial join: finding which polygon (country, city, zip code, census tract) each point (customer, event, sensor) falls within — using gpd.sjoin() to merge datasets by spatial relationship rather than a shared key column.
Introduction
Location matters. A restaurant’s success depends on foot traffic from nearby offices and apartments. A retailer’s sales correlate with neighborhood income and competitors within five miles. An epidemiologist tracking disease spread needs to know which census tracts have outbreaks and which hospitals are within driving distance. A logistics company routing deliveries must minimize total distance traveled across hundreds of stops.
All of these are geospatial problems — problems where the spatial dimension of data is analytically important. Geospatial analysis is a specialized but increasingly central part of data science, and Python has a mature ecosystem for it. You don’t need to become a geographic information systems (GIS) expert to do productive geospatial data science, but you do need to understand a handful of foundational concepts — coordinate systems, geometry types, and spatial operations — before the code makes sense.
This article teaches you working with geospatial data from the ground up: what geospatial data is, the key Python libraries, coordinate reference systems, reading and writing geospatial formats, creating and manipulating geometries, performing spatial joins and operations, calculating distances, and building maps. By the end you’ll be equipped to add the spatial dimension to any data science project.
Foundational Concepts
Geometry Types
Geospatial data uses three fundamental geometry types, each representing a different kind of geographic feature:
Point: A single location with longitude and latitude coordinates. Used for stores, customers, events, sensor locations, cities.
LineString: An ordered sequence of points forming a line. Used for roads, rivers, flight paths, trade routes.
Polygon: A closed shape defined by a ring of coordinates. Used for country/state/city boundaries, zip code areas, census tracts, buildings, lakes.
Multi-variants: MultiPoint, MultiLineString, MultiPolygon for features that consist of multiple disconnected parts (e.g., Hawaii is a MultiPolygon — multiple islands that together form one US state).
Coordinate Reference Systems (CRS)
A Coordinate Reference System (CRS) defines how the coordinates in your data map to actual locations on Earth. This is the most important and most frequently misunderstood concept in geospatial work.
Geographic CRS: Coordinates expressed as longitude and latitude in degrees. The most common is WGS84 / EPSG:4326 — used by GPS, Google Maps, and virtually all web mapping services. Coordinates look like (-97.7431, 30.2672) for Austin, Texas.
Projected CRS: Coordinates expressed in meters (or feet) on a flat surface, after projecting the spherical Earth onto a 2D plane. Different projections are optimized for different parts of the world. EPSG:3857 (Web Mercator) is used for web maps. UTM (Universal Transverse Mercator) zones provide accurate measurements within local regions.
Why this matters: Distance calculations and area calculations must be done in a projected CRS (in meters), not a geographic CRS (in degrees). One degree of longitude in Austin is approximately 96 km, but one degree of longitude near the poles is only a few km. If you calculate distances in degrees, your results will be wildly wrong at non-equatorial latitudes.
# The CRS workflow:
# 1. Load data (often in WGS84 / EPSG:4326)
# 2. Reproject to a local projected CRS for accurate measurements
# 3. Do distance/area calculations
# 4. Reproject back to WGS84 for mapping if neededThe EPSG Code System
EPSG codes are standardized identifiers for coordinate reference systems:
| EPSG Code | Name | Used For |
|---|---|---|
| 4326 | WGS84 (lat/lon degrees) | GPS, global datasets, web APIs |
| 3857 | Web Mercator | Web maps (Google Maps, OpenStreetMap tiles) |
| 32614 | UTM Zone 14N | Central USA (accurate measurements) |
| 27700 | British National Grid | UK datasets |
| 4269 | NAD83 | US federal datasets |
Installing the Libraries
pip install geopandas shapely folium contextily pyproj fiona- geopandas: Spatial extension of pandas — the primary workhorse
- shapely: Geometry objects and spatial operations
- folium: Interactive web maps using Leaflet.js
- contextily: Adds basemap tiles to matplotlib maps
- pyproj: CRS transformation
- fiona: Reads/writes vector geospatial formats (used by GeoPandas)
Shapely: Geometry Objects in Python
Shapely provides the fundamental geometry objects. Before using GeoPandas, understanding Shapely geometries is essential.
from shapely.geometry import Point, LineString, Polygon, MultiPoint, MultiPolygon
from shapely import wkt, wkb
import numpy as np
# ── Points ────────────────────────────────────────────────────────
# Point(longitude, latitude) — NOTE: x (longitude) comes FIRST
austin = Point(-97.7431, 30.2672)
new_york = Point(-74.0060, 40.7128)
london = Point(-0.1276, 51.5074)
tokyo = Point(139.6917, 35.6895)
print(f"Austin: {austin}") # POINT (-97.7431 30.2672)
print(f"Type: {austin.geom_type}") # Point
print(f"X (lon): {austin.x}") # -97.7431
print(f"Y (lat): {austin.y}") # 30.2672
print(f"Coords: {list(austin.coords)}") # [(-97.7431, 30.2672)]
# ── LineStrings ───────────────────────────────────────────────────
route = LineString([
(-97.7431, 30.2672), # Austin
(-95.3698, 29.7604), # Houston
(-90.0715, 29.9511), # New Orleans
(-84.3880, 33.7490), # Atlanta
])
print(f"\nRoute length (degrees): {route.length:.4f}")
# WARNING: This is length in DEGREES, not km!
# For km, use a projected CRS (see below)
print(f"Start point: {route.coords[0]}")
print(f"Number of vertices: {len(route.coords)}")
print(f"Bounding box: {route.bounds}") # (min_x, min_y, max_x, max_y)
# ── Polygons ──────────────────────────────────────────────────────
# Define the exterior ring (must close — first = last point)
texas_simplified = Polygon([
(-106.65, 31.90), # El Paso area
(-100.00, 28.00), # South
(-97.00, 26.00), # South tip
(-93.50, 29.75), # East
(-94.00, 33.55), # Northeast
(-100.00, 36.50), # Panhandle
(-103.00, 36.50), # West panhandle
(-106.65, 31.90), # Back to start
])
print(f"\nTexas (simplified):")
print(f" Area (sq degrees): {texas_simplified.area:.4f}")
print(f" Perimeter (degrees): {texas_simplified.length:.4f}")
print(f" Centroid: {texas_simplified.centroid}")
print(f" Bounds: {texas_simplified.bounds}")
# Polygon with a hole (ring) — exterior + list of interior rings
polygon_with_hole = Polygon(
exterior=[(0, 0), (10, 0), (10, 10), (0, 10), (0, 0)],
holes=[ [(3, 3), (7, 3), (7, 7), (3, 7), (3, 3)]]
)
print(f"\nPolygon area with hole: {polygon_with_hole.area}") # 100 - 16 = 84Spatial Relationships and Operations
from shapely.geometry import Point, Polygon, LineString
from shapely.ops import unary_union
# Define some geometries
store_a = Point(0, 0)
store_b = Point(5, 0)
store_c = Point(10, 0)
city_area = Polygon([(-2, -2), (8, -2), (8, 8), (-2, 8), (-2, -2)])
suburb = Polygon([(6, -3), (12, -3), (12, 5), (6, 5), (6, -3)])
# ── Spatial predicates (return True/False) ────────────────────────
print("Spatial relationships:")
print(f" store_a within city_area: {store_a.within(city_area)}") # True
print(f" store_b within city_area: {store_b.within(city_area)}") # True
print(f" store_c within city_area: {store_c.within(city_area)}") # False
print(f" store_a within suburb: {store_a.within(suburb)}") # False
print(f" city_area intersects suburb: {city_area.intersects(suburb)}") # True
print(f" city_area contains store_a: {city_area.contains(store_a)}") # True
print(f" city_area touches suburb: {city_area.touches(suburb)}") # False
# ── Distance ──────────────────────────────────────────────────────
dist_ab = store_a.distance(store_b)
dist_ac = store_a.distance(store_c)
print(f"\nDistance A→B (degrees): {dist_ab}") # 5.0
print(f"Distance A→C (degrees): {dist_ac}") # 10.0
# In a projected CRS (meters), these would be in kilometers
# ── Geometric operations ──────────────────────────────────────────
# Buffer: expand a geometry by a distance
store_catchment = store_a.buffer(3) # Circle of radius 3 around store_a
print(f"\nBuffer type: {store_catchment.geom_type}") # Polygon
print(f"Buffer area: {store_catchment.area:.4f}") # π × 3² ≈ 28.27
# Union: combine two polygons
combined_area = city_area.union(suburb)
print(f"\nUnion type: {combined_area.geom_type}") # Polygon or MultiPolygon
# Intersection: overlapping portion
overlap = city_area.intersection(suburb)
print(f"Intersection area: {overlap.area:.4f}") # The overlapping region
# Difference: city_area minus the suburb overlap
city_only = city_area.difference(suburb)
print(f"City-only area: {city_only.area:.4f}")
# Convex hull: smallest convex polygon containing all points
points = [Point(1, 1), Point(3, 5), Point(7, 2), Point(5, 8), Point(2, 6)]
from shapely.geometry import MultiPoint
hull = MultiPoint(points).convex_hull
print(f"\nConvex hull type: {hull.geom_type}") # PolygonGeoPandas: Spatial DataFrames
GeoPandas extends pandas DataFrames with a geometry column that holds Shapely geometry objects. Every GeoDataFrame knows its CRS and supports spatial operations applied to the entire column at once.
Creating GeoDataFrames
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import numpy as np
# ── From a pandas DataFrame with lat/lon columns ──────────────────
stores_df = pd.DataFrame({
"store_id": ["S001", "S002", "S003", "S004", "S005"],
"store_name": ["Downtown Austin", "North Austin", "Round Rock",
"Cedar Park", "Pflugerville"],
"latitude": [30.2672, 30.4012, 30.5085, 30.5230, 30.4394],
"longitude": [-97.7431, -97.7207, -97.6784, -97.7999, -97.6201],
"revenue_2024":[1250000, 870000, 1050000, 920000, 680000]
})
# Convert to GeoDataFrame — specify geometry and CRS
stores_gdf = gpd.GeoDataFrame(
stores_df,
geometry=gpd.points_from_xy(stores_df["longitude"], stores_df["latitude"]),
crs="EPSG:4326" # WGS84 — the standard for lat/lon GPS coordinates
)
print(type(stores_gdf)) # <class 'geopandas.geodataframe.GeoDataFrame'>
print(stores_gdf.crs) # EPSG:4326
print(stores_gdf.geometry.dtype) # geometry
print(stores_gdf.head())
# ── From a file (Shapefile, GeoJSON, GeoPackage) ──────────────────
# Load a US states shapefile (from Natural Earth or US Census TIGER)
# states_gdf = gpd.read_file("data/us_states/cb_2023_us_state_20m.shp")
# Load from GeoJSON URL directly
countries_gdf = gpd.read_file(
"https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson"
)
print(f"\nCountries: {len(countries_gdf)} features")
print(countries_gdf[["ADMIN", "ISO_A3", "geometry"]].head())
print(f"CRS: {countries_gdf.crs}")
# ── From GeoPandas built-in datasets (for practice) ───────────────
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
print(f"\nWorld dataset columns: {world.columns.tolist()}")
print(f"CRS: {world.crs}")CRS Transformation
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
# Stores in WGS84 (lat/lon degrees)
stores_wgs84 = gpd.GeoDataFrame(
{"store_id": ["S001", "S002", "S003"],
"revenue": [1250000, 870000, 1050000]},
geometry=[
Point(-97.7431, 30.2672),
Point(-97.7207, 30.4012),
Point(-97.6784, 30.5085)
],
crs="EPSG:4326"
)
# Project to UTM Zone 14N (meters) for distance calculations in Texas
# EPSG:32614 = WGS84 / UTM Zone 14N
stores_utm = stores_wgs84.to_crs("EPSG:32614")
print("WGS84 coordinates (degrees):")
print(stores_wgs84.geometry.iloc[0]) # POINT (-97.7431 30.2672)
print("\nUTM Zone 14N coordinates (meters):")
print(stores_utm.geometry.iloc[0]) # POINT (596453.15 3348574.82)
# NOW distances are in meters
dist_m = stores_utm.geometry.iloc[0].distance(stores_utm.geometry.iloc[1])
print(f"\nDistance S001 → S002: {dist_m/1000:.2f} km") # ~15.3 km
# Don't forget to reproject back for mapping:
stores_back = stores_utm.to_crs("EPSG:4326")Reading and Writing Geospatial Files
import geopandas as gpd
# ── Reading formats ───────────────────────────────────────────────
# Shapefile (collection of .shp, .dbf, .shx files)
states = gpd.read_file("data/cb_2023_us_state_20m.shp")
# GeoJSON
regions = gpd.read_file("data/regions.geojson")
# GeoPackage (preferred modern format — single file, supports multiple layers)
parcels = gpd.read_file("data/land_parcels.gpkg")
parcels = gpd.read_file("data/land_parcels.gpkg", layer="residential") # Specific layer
# CSV with lat/lon
import pandas as pd
df = pd.read_csv("data/customers_with_coordinates.csv")
customers_gdf = gpd.GeoDataFrame(
df,
geometry=gpd.points_from_xy(df["longitude"], df["latitude"]),
crs="EPSG:4326"
)
# From PostGIS database
# from sqlalchemy import create_engine
# engine = create_engine("postgresql://user:pass@host/db")
# gdf = gpd.read_postgis("SELECT * FROM stores", engine, geom_col="location")
# ── Writing formats ───────────────────────────────────────────────
stores_gdf.to_file("output/stores.geojson", driver="GeoJSON")
stores_gdf.to_file("output/stores.gpkg", driver="GPKG")
stores_gdf.to_file("output/stores/stores.shp") # Shapefile (multiple files)
# Write to Parquet (fast, preserves CRS)
stores_gdf.to_parquet("output/stores.parquet")
stores_loaded = gpd.read_parquet("output/stores.parquet")Spatial Joins: The Core Geospatial Operation
The spatial join is the most important geospatial operation for data science. It answers: “Which polygon does each point fall within?” — assigning geographic context (country, state, zip code, census tract) to individual records.
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import numpy as np
# Synthetic customer data with locations
np.random.seed(42)
n_customers = 500
# Random customers scattered around Austin metro area
customers_df = pd.DataFrame({
"customer_id": [f"CUST_{i:04d}" for i in range(n_customers)],
"longitude": np.random.uniform(-98.0, -97.4, n_customers),
"latitude": np.random.uniform(30.1, 30.7, n_customers),
"total_spend": np.random.lognormal(mean=5.5, sigma=1.2, size=n_customers)
})
customers_gdf = gpd.GeoDataFrame(
customers_df,
geometry=gpd.points_from_xy(customers_df["longitude"], customers_df["latitude"]),
crs="EPSG:4326"
)
# Load zip code polygons for the Austin area
# (In practice, download from US Census TIGER or similar)
# For this example, we'll create simplified synthetic zip code areas
zip_codes = gpd.GeoDataFrame({
"zip_code": ["78701", "78702", "78703", "78704", "78745", "78750"],
"city_name": ["Austin Downtown", "East Austin", "West Austin",
"South Austin", "South Congress", "Northwest Austin"],
"median_income": [65000, 48000, 92000, 71000, 58000, 88000]
}, geometry=[
Point(-97.74, 30.27).buffer(0.05), # Rough city centers as circles
Point(-97.72, 30.26).buffer(0.05),
Point(-97.77, 30.28).buffer(0.05),
Point(-97.75, 30.24).buffer(0.05),
Point(-97.77, 30.22).buffer(0.05),
Point(-97.77, 30.42).buffer(0.05),
], crs="EPSG:4326")
# ── The spatial join ──────────────────────────────────────────────
# Assign each customer to a zip code
customers_with_zip = gpd.sjoin(
customers_gdf, # Left: points (customers)
zip_codes, # Right: polygons (zip codes)
how="left", # Keep all customers, even those outside any zip
predicate="within" # Spatial relationship: point is within polygon
)
print(f"Customers before join: {len(customers_gdf)}")
print(f"Customers after join: {len(customers_with_zip)}")
print(customers_with_zip[["customer_id", "total_spend", "zip_code", "city_name"]].head(10))
# Analyze: spending by zip code
spending_by_zip = (
customers_with_zip
.groupby("zip_code")
.agg(
n_customers=("customer_id", "count"),
total_revenue=("total_spend", "sum"),
avg_spend=("total_spend", "mean")
)
.round(2)
.sort_values("total_revenue", ascending=False)
.reset_index()
)
spending_by_zip = spending_by_zip.merge(
zip_codes[["zip_code", "city_name", "median_income"]],
on="zip_code"
)
print("\nRevenue by zip code:")
print(spending_by_zip.to_string(index=False))
# Spatial join predicates available:
# "within": point is within polygon
# "contains": geometry contains the other
# "intersects": geometries share any space
# "touches": geometries share boundary but not interior
# "crosses": geometry crosses another
# "overlaps": geometries overlap but neither contains the otherNearest Neighbor Join
Find the closest feature (e.g., the nearest store to each customer):
import geopandas as gpd
import numpy as np
# Find nearest store to each customer
customers_with_nearest_store = gpd.sjoin_nearest(
customers_gdf.to_crs("EPSG:32614"), # Project to meters first!
stores_gdf.to_crs("EPSG:32614"),
how="left",
distance_col="distance_to_store_m" # Column to store the distance
)
customers_with_nearest_store["distance_to_store_km"] = (
customers_with_nearest_store["distance_to_store_m"] / 1000
)
print("\nNearest store analysis:")
print(customers_with_nearest_store[
["customer_id", "store_name", "distance_to_store_km", "total_spend"]
].head(10).round(2))
# Average distance to nearest store
avg_dist = customers_with_nearest_store["distance_to_store_km"].mean()
print(f"\nAverage distance to nearest store: {avg_dist:.2f} km")
# Customers more than 20km from any store
far_customers = customers_with_nearest_store[
customers_with_nearest_store["distance_to_store_km"] > 20
]
print(f"Customers > 20km from nearest store: {len(far_customers):,}")Distance Calculations
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import numpy as np
def calculate_distances_km(
gdf_points: gpd.GeoDataFrame,
reference_point: Point,
crs_meters: str = "EPSG:32614"
) -> pd.Series:
"""
Calculate distance in kilometers from each point to a reference point.
Parameters
----------
gdf_points : GeoDataFrame
Points to measure from (in any CRS).
reference_point : Point
Reference point (must be in the same CRS as gdf_points).
crs_meters : str
A projected CRS in meters for accurate distance calculation.
Returns
-------
pd.Series
Distance in kilometers for each row.
"""
# Project to meters for accurate distances
gdf_proj = gdf_points.to_crs(crs_meters)
# Project the reference point to the same CRS
ref_gdf = gpd.GeoDataFrame(
geometry=[reference_point], crs=gdf_points.crs
).to_crs(crs_meters)
ref_proj = ref_gdf.geometry.iloc[0]
# Calculate distances in meters, convert to km
distances_km = gdf_proj.geometry.distance(ref_proj) / 1000
return distances_km
# Distance from each store to the Austin city center
austin_center = Point(-97.7431, 30.2672)
stores_gdf["dist_to_center_km"] = calculate_distances_km(
stores_gdf, austin_center
)
print("Stores by distance to Austin center:")
print(stores_gdf[["store_name", "dist_to_center_km", "revenue_2024"]]
.sort_values("dist_to_center_km").round(2).to_string(index=False))
def haversine_distance_km(
lat1: float, lon1: float,
lat2: float, lon2: float
) -> float:
"""
Calculate the great-circle distance between two points
using the Haversine formula.
More accurate than Euclidean distance for lat/lon coordinates.
Returns distance in kilometers.
"""
R = 6371 # Earth's radius in km
lat1_r, lat2_r = np.radians(lat1), np.radians(lat2)
dlat = np.radians(lat2 - lat1)
dlon = np.radians(lon2 - lon1)
a = np.sin(dlat/2)**2 + np.cos(lat1_r) * np.cos(lat2_r) * np.sin(dlon/2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
return R * c
# Vectorized haversine for pandas DataFrames
def haversine_series(
lat1: pd.Series, lon1: pd.Series,
lat2: float, lon2: float
) -> pd.Series:
"""Vectorized haversine for a column of coordinates vs. a single point."""
R = 6371
lat1_r = np.radians(lat1)
lat2_r = np.radians(lat2)
dlat = np.radians(lat2 - lat1)
dlon = np.radians(lon2 - lon1)
a = np.sin(dlat/2)**2 + np.cos(lat1_r) * np.cos(lat2_r) * np.sin(dlon/2)**2
c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
return R * c
# Find customers within 10 km of downtown Austin
customers_df["dist_to_downtown_km"] = haversine_series(
customers_df["latitude"], customers_df["longitude"],
30.2672, -97.7431 # Austin downtown coordinates
)
downtown_customers = customers_df[customers_df["dist_to_downtown_km"] <= 10]
print(f"\nCustomers within 10km of downtown Austin: {len(downtown_customers):,}")
print(f"Their average spend: ${downtown_customers['total_spend'].mean():.2f}")Spatial Aggregation: Choropleth Analysis
Aggregating point data to polygon areas is a fundamental geospatial pattern:
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors
def spatial_aggregation(
points_gdf: gpd.GeoDataFrame,
polygons_gdf: gpd.GeoDataFrame,
polygon_id_col: str,
agg_col: str,
agg_func: str = "sum"
) -> gpd.GeoDataFrame:
"""
Aggregate point values by containing polygon.
Classic pattern: count/sum customers per zip code,
sales per census tract, events per neighborhood.
"""
# Spatial join: assign polygon to each point
joined = gpd.sjoin(
points_gdf,
polygons_gdf[[polygon_id_col, "geometry"]],
how="left",
predicate="within"
)
# Aggregate
agg_result = (
joined
.groupby(polygon_id_col)[agg_col]
.agg(agg_func)
.reset_index()
.rename(columns={agg_col: f"{agg_col}_{agg_func}"})
)
# Join aggregated values back to polygon GeoDataFrame
result_gdf = polygons_gdf.merge(agg_result, on=polygon_id_col, how="left")
result_gdf[f"{agg_col}_{agg_func}"] = (
result_gdf[f"{agg_col}_{agg_func}"].fillna(0)
)
return result_gdf
# Aggregate customer revenue by zip code and visualize
revenue_by_zip_gdf = spatial_aggregation(
customers_gdf, zip_codes, "zip_code", "total_spend"
)
# Plot a choropleth map
fig, ax = plt.subplots(figsize=(10, 8))
revenue_by_zip_gdf.plot(
column="total_spend_sum",
ax=ax,
legend=True,
legend_kwds={"label": "Total Revenue ($)", "shrink": 0.6},
cmap="YlOrRd", # Yellow → Orange → Red color scheme
edgecolor="white",
linewidth=0.5
)
# Plot store locations on top
stores_gdf.plot(
ax=ax,
color="blue",
markersize=100,
marker="*",
label="Stores",
zorder=5
)
ax.set_title("Customer Revenue by Zip Code — Austin Metro", fontsize=14, fontweight="bold")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.legend(fontsize=10)
ax.axis("equal")
plt.tight_layout()
plt.savefig("output/choropleth_revenue.png", dpi=150, bbox_inches="tight")
plt.show()
print("Choropleth map saved.")Building Interactive Maps with Folium
Folium creates interactive HTML maps using Leaflet.js — perfect for exploring geospatial data and sharing results with stakeholders:
import folium
from folium.plugins import MarkerCluster, HeatMap
import geopandas as gpd
import json
def create_interactive_map(
center_lat: float = 30.35,
center_lon: float = -97.75,
zoom: int = 10
) -> folium.Map:
"""Create a base interactive map centered on Austin."""
return folium.Map(
location=[center_lat, center_lon],
zoom_start=zoom,
tiles="CartoDB positron" # Clean, minimal basemap
)
m = create_interactive_map()
# ── Add store markers ─────────────────────────────────────────────
for _, store in stores_gdf.iterrows():
folium.CircleMarker(
location=[store.geometry.y, store.geometry.x],
radius=12,
color="navy",
fill=True,
fill_color="royalblue",
fill_opacity=0.8,
popup=folium.Popup(
f"<b>{store['store_name']}</b><br>"
f"Revenue: ${store['revenue_2024']:,.0f}",
max_width=200
),
tooltip=store["store_name"]
).add_to(m)
# ── Add customer heatmap ──────────────────────────────────────────
heat_data = [[row["latitude"], row["longitude"]]
for _, row in customers_df.iterrows()]
HeatMap(
heat_data,
radius=15,
blur=20,
gradient={0.2: "blue", 0.5: "lime", 0.8: "orange", 1.0: "red"},
name="Customer Density"
).add_to(m)
# ── Add zip code polygons ─────────────────────────────────────────
# Convert revenue data to dict for styling
revenue_dict = revenue_by_zip_gdf.set_index("zip_code")["total_spend_sum"].to_dict()
max_rev = max(revenue_dict.values()) if revenue_dict else 1
def style_zip(feature):
"""Color zip codes by revenue intensity."""
zip_code = feature["properties"].get("zip_code", "")
revenue = revenue_dict.get(zip_code, 0)
intensity = revenue / max_rev
# Interpolate color: light yellow → deep red
r = int(255)
g = int(255 * (1 - intensity * 0.8))
b = int(100 * (1 - intensity))
color = f"#{r:02x}{g:02x}{b:02x}"
return {
"fillColor": color,
"fillOpacity": 0.4,
"color": "white",
"weight": 1.5
}
folium.GeoJson(
zip_codes.__geo_interface__,
style_function=style_zip,
tooltip=folium.GeoJsonTooltip(
fields=["zip_code", "city_name"],
aliases=["Zip Code:", "Neighborhood:"]
),
name="Zip Codes"
).add_to(m)
# ── Add layer control ─────────────────────────────────────────────
folium.LayerControl().add_to(m)
# Save the interactive map
m.save("output/austin_stores_map.html")
print("Interactive map saved: output/austin_stores_map.html")Common Geospatial Data Science Patterns
Pattern 1: Catchment Area Analysis
import geopandas as gpd
import pandas as pd
def catchment_area_analysis(
stores_gdf: gpd.GeoDataFrame,
customers_gdf: gpd.GeoDataFrame,
radius_km: float = 10.0,
projected_crs: str = "EPSG:32614"
) -> pd.DataFrame:
"""
For each store, find all customers within a radius and compute metrics.
Parameters
----------
stores_gdf : GeoDataFrame
Store locations (Points).
customers_gdf : GeoDataFrame
Customer locations with spend data (Points).
radius_km : float
Catchment radius in kilometers.
projected_crs : str
Metric CRS for accurate distance calculations.
Returns
-------
pd.DataFrame
One row per store with catchment metrics.
"""
stores_proj = stores_gdf.to_crs(projected_crs)
customers_proj = customers_gdf.to_crs(projected_crs)
radius_m = radius_km * 1000
results = []
for _, store in stores_proj.iterrows():
# Create a circular catchment buffer
catchment = store.geometry.buffer(radius_m)
# Find customers within the catchment
in_catchment = customers_proj[
customers_proj.geometry.within(catchment)
]
results.append({
"store_id": store.get("store_id", ""),
"store_name": store.get("store_name", ""),
"catchment_radius_km": radius_km,
"n_customers_in_catchment": len(in_catchment),
"total_catchment_spend": in_catchment["total_spend"].sum()
if "total_spend" in in_catchment.columns else 0,
"avg_catchment_spend": in_catchment["total_spend"].mean()
if "total_spend" in in_catchment.columns and len(in_catchment) > 0 else 0,
})
return pd.DataFrame(results).round(2)
catchment_report = catchment_area_analysis(
stores_gdf, customers_gdf, radius_km=15.0
)
print("Catchment Area Analysis (15km radius):")
print(catchment_report.sort_values("total_catchment_spend", ascending=False)
.to_string(index=False))Pattern 2: Spatial Clustering
from sklearn.cluster import DBSCAN
import numpy as np
import geopandas as gpd
import pandas as pd
def spatial_cluster_customers(
customers_gdf: gpd.GeoDataFrame,
eps_km: float = 2.0,
min_samples: int = 5
) -> gpd.GeoDataFrame:
"""
Cluster customers by geographic proximity using DBSCAN.
DBSCAN is preferred over K-Means for spatial clustering because
it discovers clusters of arbitrary shape and handles noise (outliers).
Parameters
----------
customers_gdf : GeoDataFrame
Customer points with spend data.
eps_km : float
Maximum distance between neighbors (km) to be in the same cluster.
min_samples : int
Minimum customers needed to form a cluster.
Returns
-------
GeoDataFrame
Input GeoDataFrame with cluster_id column added (-1 = noise/outlier).
"""
# Project to meters for accurate distance
gdf_proj = customers_gdf.to_crs("EPSG:32614")
# Extract coordinates as numpy array
coords = np.column_stack([
gdf_proj.geometry.x,
gdf_proj.geometry.y
])
# DBSCAN with epsilon in meters
dbscan = DBSCAN(
eps=eps_km * 1000, # Convert km to meters
min_samples=min_samples,
metric="euclidean",
algorithm="ball_tree",
n_jobs=-1
)
clusters = dbscan.fit_predict(coords)
result = customers_gdf.copy()
result["cluster_id"] = clusters
n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
n_noise = (clusters == -1).sum()
print(f"Spatial clustering: {n_clusters} clusters, {n_noise} noise points")
return result
clustered = spatial_cluster_customers(
customers_gdf, eps_km=2.0, min_samples=8
)
# Cluster summary
cluster_summary = (
clustered[clustered["cluster_id"] >= 0]
.groupby("cluster_id")
.agg(
n_customers = ("customer_id", "count"),
total_spend = ("total_spend", "sum"),
avg_spend = ("total_spend", "mean"),
centroid_lat = ("latitude", "mean"),
centroid_lon = ("longitude", "mean")
)
.round(2)
.sort_values("total_spend", ascending=False)
.reset_index()
)
print("\nCluster Summary:")
print(cluster_summary.head(10).to_string(index=False))Pattern 3: Point-in-Polygon with Large Datasets
import geopandas as gpd
import pandas as pd
import numpy as np
def efficient_point_in_polygon(
points_df: pd.DataFrame,
polygons_gdf: gpd.GeoDataFrame,
lat_col: str = "latitude",
lon_col: str = "longitude",
polygon_id_col: str = "polygon_id",
chunk_size: int = 100_000
) -> pd.Series:
"""
Efficiently assign polygon IDs to a large DataFrame of points.
Processes in chunks to avoid memory issues.
Returns a Series of polygon IDs (NaN for points outside all polygons).
"""
results = []
n_chunks = (len(points_df) + chunk_size - 1) // chunk_size
for i in range(n_chunks):
chunk = points_df.iloc[i * chunk_size : (i + 1) * chunk_size]
chunk_gdf = gpd.GeoDataFrame(
chunk,
geometry=gpd.points_from_xy(chunk[lon_col], chunk[lat_col]),
crs="EPSG:4326"
)
joined = gpd.sjoin(
chunk_gdf[["geometry"]],
polygons_gdf[[polygon_id_col, "geometry"]],
how="left",
predicate="within"
)
results.append(joined[polygon_id_col])
if i % 5 == 0:
print(f" Processed chunk {i+1}/{n_chunks} "
f"({(i+1)*chunk_size:,} / {len(points_df):,} rows)")
return pd.concat(results)
# Assign zip codes to 500K customers efficiently
# customers_large_df["zip_code"] = efficient_point_in_polygon(
# customers_large_df, zip_codes, polygon_id_col="zip_code"
# )Summary
Geospatial data science opens the spatial dimension for any analysis — it’s the difference between knowing that sales are high in “Austin” and knowing they’re concentrated in a 2-mile radius around the university, or that a proposed new store location has 45,000 potential customers within a 10km drive.
The foundational concepts — geometry types (Point, LineString, Polygon), coordinate reference systems (WGS84 for lat/lon, UTM for accurate measurements), and the critical rule that distance and area calculations must use a projected CRS in meters — are the prerequisite for everything else. The core Python stack is Shapely for geometry objects, GeoPandas for spatial DataFrames, and Folium or Matplotlib for maps.
The most important single operation is the spatial join (gpd.sjoin()): assigning geographic context (zip code, census tract, sales territory) to point data (customers, events, sensors). This single operation unlocks most geospatial data science work — once every record has geographic context, standard pandas groupby, aggregation, and visualization take over.
Key Takeaways
- Geospatial data uses three geometry types: Point (single location), LineString (path), and Polygon (area) — corresponding to Shapely’s
Point,LineString, andPolygonclasses in Python - Coordinate Reference Systems are the most important concept to get right: WGS84 (EPSG:4326) stores coordinates as latitude/longitude degrees and is used for GPS and web maps; projected CRS like UTM zones store coordinates in meters and are required for accurate distance and area calculations
- Always reproject to a metric CRS before calculating distances or areas —
gdf.to_crs("EPSG:32614")converts to UTM Zone 14N (Texas) in meters;geometry.distance(other)then returns meters, not degrees gpd.sjoin()is the core geospatial operation for data science — it performs a spatial join that assigns polygon attributes (zip code, census tract, city) to point data (customers, events) based on spatial relationships rather than a shared keygpd.sjoin_nearest()finds the nearest feature with an optionaldistance_colparameter that returns the distance in the GeoDataFrame’s units — always use a metric CRS before calling this- Folium creates interactive HTML maps with markers, heatmaps, choropleths, and layer controls in pure Python — ideal for sharing exploratory geospatial findings with stakeholders
- For large-scale spatial joins (millions of points), process in chunks with
efficient_point_in_polygon()to avoid memory errors — GeoPandas uses an R-tree spatial index internally to make joins fast - The standard geospatial data science workflow: load points (customers, events) → load polygons (zip codes, census tracts) → spatial join to assign context → aggregate by polygon → visualize with choropleth








