Working with Geospatial Data in Python

Learn to work with geospatial data in Python. Master coordinate systems, Shapely geometries, GeoPandas, spatial joins, choropleth maps, distance calculations, and common geospatial analysis patterns.

Working with Geospatial Data in Python

Geospatial data represents information tied to locations on Earth — points (a store location), lines (a road), or polygons (a country boundary). In Python, Shapely provides the geometry objects (Point, LineString, Polygon), GeoPandas extends pandas DataFrames to hold these geometries alongside regular data, and Folium or Matplotlib with contextily creates maps. The most common geospatial data science task is the spatial join: finding which polygon (country, city, zip code, census tract) each point (customer, event, sensor) falls within — using gpd.sjoin() to merge datasets by spatial relationship rather than a shared key column.

Introduction

Location matters. A restaurant’s success depends on foot traffic from nearby offices and apartments. A retailer’s sales correlate with neighborhood income and competitors within five miles. An epidemiologist tracking disease spread needs to know which census tracts have outbreaks and which hospitals are within driving distance. A logistics company routing deliveries must minimize total distance traveled across hundreds of stops.

All of these are geospatial problems — problems where the spatial dimension of data is analytically important. Geospatial analysis is a specialized but increasingly central part of data science, and Python has a mature ecosystem for it. You don’t need to become a geographic information systems (GIS) expert to do productive geospatial data science, but you do need to understand a handful of foundational concepts — coordinate systems, geometry types, and spatial operations — before the code makes sense.

This article teaches you working with geospatial data from the ground up: what geospatial data is, the key Python libraries, coordinate reference systems, reading and writing geospatial formats, creating and manipulating geometries, performing spatial joins and operations, calculating distances, and building maps. By the end you’ll be equipped to add the spatial dimension to any data science project.

Foundational Concepts

Geometry Types

Geospatial data uses three fundamental geometry types, each representing a different kind of geographic feature:

Point: A single location with longitude and latitude coordinates. Used for stores, customers, events, sensor locations, cities.

LineString: An ordered sequence of points forming a line. Used for roads, rivers, flight paths, trade routes.

Polygon: A closed shape defined by a ring of coordinates. Used for country/state/city boundaries, zip code areas, census tracts, buildings, lakes.

Multi-variants: MultiPoint, MultiLineString, MultiPolygon for features that consist of multiple disconnected parts (e.g., Hawaii is a MultiPolygon — multiple islands that together form one US state).

Coordinate Reference Systems (CRS)

A Coordinate Reference System (CRS) defines how the coordinates in your data map to actual locations on Earth. This is the most important and most frequently misunderstood concept in geospatial work.

Geographic CRS: Coordinates expressed as longitude and latitude in degrees. The most common is WGS84 / EPSG:4326 — used by GPS, Google Maps, and virtually all web mapping services. Coordinates look like (-97.7431, 30.2672) for Austin, Texas.

Projected CRS: Coordinates expressed in meters (or feet) on a flat surface, after projecting the spherical Earth onto a 2D plane. Different projections are optimized for different parts of the world. EPSG:3857 (Web Mercator) is used for web maps. UTM (Universal Transverse Mercator) zones provide accurate measurements within local regions.

Why this matters: Distance calculations and area calculations must be done in a projected CRS (in meters), not a geographic CRS (in degrees). One degree of longitude in Austin is approximately 96 km, but one degree of longitude near the poles is only a few km. If you calculate distances in degrees, your results will be wildly wrong at non-equatorial latitudes.

Plaintext
# The CRS workflow:
# 1. Load data (often in WGS84 / EPSG:4326)
# 2. Reproject to a local projected CRS for accurate measurements
# 3. Do distance/area calculations
# 4. Reproject back to WGS84 for mapping if needed

The EPSG Code System

EPSG codes are standardized identifiers for coordinate reference systems:

EPSG CodeNameUsed For
4326WGS84 (lat/lon degrees)GPS, global datasets, web APIs
3857Web MercatorWeb maps (Google Maps, OpenStreetMap tiles)
32614UTM Zone 14NCentral USA (accurate measurements)
27700British National GridUK datasets
4269NAD83US federal datasets

Installing the Libraries

Python
pip install geopandas shapely folium contextily pyproj fiona
  • geopandas: Spatial extension of pandas — the primary workhorse
  • shapely: Geometry objects and spatial operations
  • folium: Interactive web maps using Leaflet.js
  • contextily: Adds basemap tiles to matplotlib maps
  • pyproj: CRS transformation
  • fiona: Reads/writes vector geospatial formats (used by GeoPandas)

Shapely: Geometry Objects in Python

Shapely provides the fundamental geometry objects. Before using GeoPandas, understanding Shapely geometries is essential.

Python
from shapely.geometry import Point, LineString, Polygon, MultiPoint, MultiPolygon
from shapely import wkt, wkb
import numpy as np

# ── Points ────────────────────────────────────────────────────────
# Point(longitude, latitude) — NOTE: x (longitude) comes FIRST
austin     = Point(-97.7431, 30.2672)
new_york   = Point(-74.0060, 40.7128)
london     = Point(-0.1276,  51.5074)
tokyo      = Point(139.6917, 35.6895)

print(f"Austin: {austin}")        # POINT (-97.7431 30.2672)
print(f"Type: {austin.geom_type}")  # Point
print(f"X (lon): {austin.x}")    # -97.7431
print(f"Y (lat): {austin.y}")    # 30.2672
print(f"Coords: {list(austin.coords)}")  # [(-97.7431, 30.2672)]

# ── LineStrings ───────────────────────────────────────────────────
route = LineString([
    (-97.7431, 30.2672),   # Austin
    (-95.3698, 29.7604),   # Houston
    (-90.0715, 29.9511),   # New Orleans
    (-84.3880, 33.7490),   # Atlanta
])

print(f"\nRoute length (degrees): {route.length:.4f}")
# WARNING: This is length in DEGREES, not km!
# For km, use a projected CRS (see below)

print(f"Start point: {route.coords[0]}")
print(f"Number of vertices: {len(route.coords)}")
print(f"Bounding box: {route.bounds}")  # (min_x, min_y, max_x, max_y)

# ── Polygons ──────────────────────────────────────────────────────
# Define the exterior ring (must close — first = last point)
texas_simplified = Polygon([
    (-106.65, 31.90),  # El Paso area
    (-100.00, 28.00),  # South
    (-97.00,  26.00),  # South tip
    (-93.50,  29.75),  # East
    (-94.00,  33.55),  # Northeast
    (-100.00, 36.50),  # Panhandle
    (-103.00, 36.50),  # West panhandle
    (-106.65, 31.90),  # Back to start
])

print(f"\nTexas (simplified):")
print(f"  Area (sq degrees): {texas_simplified.area:.4f}")
print(f"  Perimeter (degrees): {texas_simplified.length:.4f}")
print(f"  Centroid: {texas_simplified.centroid}")
print(f"  Bounds: {texas_simplified.bounds}")

# Polygon with a hole (ring) — exterior + list of interior rings
polygon_with_hole = Polygon(
    exterior=[(0, 0), (10, 0), (10, 10), (0, 10), (0, 0)],
    holes=[   [(3, 3), (7, 3), (7, 7),   (3, 7),  (3, 3)]]
)
print(f"\nPolygon area with hole: {polygon_with_hole.area}")  # 100 - 16 = 84

Spatial Relationships and Operations

Python
from shapely.geometry import Point, Polygon, LineString
from shapely.ops import unary_union

# Define some geometries
store_a   = Point(0, 0)
store_b   = Point(5, 0)
store_c   = Point(10, 0)

city_area = Polygon([(-2, -2), (8, -2), (8, 8), (-2, 8), (-2, -2)])
suburb    = Polygon([(6, -3), (12, -3), (12, 5), (6, 5), (6, -3)])

# ── Spatial predicates (return True/False) ────────────────────────
print("Spatial relationships:")
print(f"  store_a within city_area: {store_a.within(city_area)}")      # True
print(f"  store_b within city_area: {store_b.within(city_area)}")      # True
print(f"  store_c within city_area: {store_c.within(city_area)}")      # False
print(f"  store_a within suburb:    {store_a.within(suburb)}")          # False
print(f"  city_area intersects suburb: {city_area.intersects(suburb)}") # True
print(f"  city_area contains store_a:  {city_area.contains(store_a)}")  # True
print(f"  city_area touches suburb:    {city_area.touches(suburb)}")    # False

# ── Distance ──────────────────────────────────────────────────────
dist_ab = store_a.distance(store_b)
dist_ac = store_a.distance(store_c)
print(f"\nDistance A→B (degrees): {dist_ab}")   # 5.0
print(f"Distance A→C (degrees): {dist_ac}")   # 10.0
# In a projected CRS (meters), these would be in kilometers

# ── Geometric operations ──────────────────────────────────────────
# Buffer: expand a geometry by a distance
store_catchment = store_a.buffer(3)   # Circle of radius 3 around store_a
print(f"\nBuffer type: {store_catchment.geom_type}")  # Polygon
print(f"Buffer area: {store_catchment.area:.4f}")    # π × 3² ≈ 28.27

# Union: combine two polygons
combined_area = city_area.union(suburb)
print(f"\nUnion type: {combined_area.geom_type}")    # Polygon or MultiPolygon

# Intersection: overlapping portion
overlap = city_area.intersection(suburb)
print(f"Intersection area: {overlap.area:.4f}")      # The overlapping region

# Difference: city_area minus the suburb overlap
city_only = city_area.difference(suburb)
print(f"City-only area: {city_only.area:.4f}")

# Convex hull: smallest convex polygon containing all points
points = [Point(1, 1), Point(3, 5), Point(7, 2), Point(5, 8), Point(2, 6)]
from shapely.geometry import MultiPoint
hull = MultiPoint(points).convex_hull
print(f"\nConvex hull type: {hull.geom_type}")       # Polygon

GeoPandas: Spatial DataFrames

GeoPandas extends pandas DataFrames with a geometry column that holds Shapely geometry objects. Every GeoDataFrame knows its CRS and supports spatial operations applied to the entire column at once.

Creating GeoDataFrames

Python
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import numpy as np

# ── From a pandas DataFrame with lat/lon columns ──────────────────
stores_df = pd.DataFrame({
    "store_id":    ["S001", "S002", "S003", "S004", "S005"],
    "store_name":  ["Downtown Austin", "North Austin", "Round Rock",
                    "Cedar Park", "Pflugerville"],
    "latitude":    [30.2672, 30.4012, 30.5085, 30.5230, 30.4394],
    "longitude":   [-97.7431, -97.7207, -97.6784, -97.7999, -97.6201],
    "revenue_2024":[1250000, 870000, 1050000, 920000, 680000]
})

# Convert to GeoDataFrame — specify geometry and CRS
stores_gdf = gpd.GeoDataFrame(
    stores_df,
    geometry=gpd.points_from_xy(stores_df["longitude"], stores_df["latitude"]),
    crs="EPSG:4326"   # WGS84 — the standard for lat/lon GPS coordinates
)

print(type(stores_gdf))           # <class 'geopandas.geodataframe.GeoDataFrame'>
print(stores_gdf.crs)             # EPSG:4326
print(stores_gdf.geometry.dtype)  # geometry
print(stores_gdf.head())

# ── From a file (Shapefile, GeoJSON, GeoPackage) ──────────────────
# Load a US states shapefile (from Natural Earth or US Census TIGER)
# states_gdf = gpd.read_file("data/us_states/cb_2023_us_state_20m.shp")

# Load from GeoJSON URL directly
countries_gdf = gpd.read_file(
    "https://raw.githubusercontent.com/datasets/geo-countries/master/data/countries.geojson"
)
print(f"\nCountries: {len(countries_gdf)} features")
print(countries_gdf[["ADMIN", "ISO_A3", "geometry"]].head())
print(f"CRS: {countries_gdf.crs}")

# ── From GeoPandas built-in datasets (for practice) ───────────────
world = gpd.read_file(gpd.datasets.get_path("naturalearth_lowres"))
print(f"\nWorld dataset columns: {world.columns.tolist()}")
print(f"CRS: {world.crs}")

CRS Transformation

Python
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point

# Stores in WGS84 (lat/lon degrees)
stores_wgs84 = gpd.GeoDataFrame(
    {"store_id": ["S001", "S002", "S003"],
     "revenue":  [1250000, 870000, 1050000]},
    geometry=[
        Point(-97.7431, 30.2672),
        Point(-97.7207, 30.4012),
        Point(-97.6784, 30.5085)
    ],
    crs="EPSG:4326"
)

# Project to UTM Zone 14N (meters) for distance calculations in Texas
# EPSG:32614 = WGS84 / UTM Zone 14N
stores_utm = stores_wgs84.to_crs("EPSG:32614")

print("WGS84 coordinates (degrees):")
print(stores_wgs84.geometry.iloc[0])  # POINT (-97.7431 30.2672)

print("\nUTM Zone 14N coordinates (meters):")
print(stores_utm.geometry.iloc[0])    # POINT (596453.15 3348574.82)

# NOW distances are in meters
dist_m = stores_utm.geometry.iloc[0].distance(stores_utm.geometry.iloc[1])
print(f"\nDistance S001 → S002: {dist_m/1000:.2f} km")   # ~15.3 km

# Don't forget to reproject back for mapping:
stores_back = stores_utm.to_crs("EPSG:4326")

Reading and Writing Geospatial Files

Python
import geopandas as gpd

# ── Reading formats ───────────────────────────────────────────────
# Shapefile (collection of .shp, .dbf, .shx files)
states  = gpd.read_file("data/cb_2023_us_state_20m.shp")

# GeoJSON
regions = gpd.read_file("data/regions.geojson")

# GeoPackage (preferred modern format — single file, supports multiple layers)
parcels = gpd.read_file("data/land_parcels.gpkg")
parcels = gpd.read_file("data/land_parcels.gpkg", layer="residential")  # Specific layer

# CSV with lat/lon
import pandas as pd
df = pd.read_csv("data/customers_with_coordinates.csv")
customers_gdf = gpd.GeoDataFrame(
    df,
    geometry=gpd.points_from_xy(df["longitude"], df["latitude"]),
    crs="EPSG:4326"
)

# From PostGIS database
# from sqlalchemy import create_engine
# engine = create_engine("postgresql://user:pass@host/db")
# gdf = gpd.read_postgis("SELECT * FROM stores", engine, geom_col="location")

# ── Writing formats ───────────────────────────────────────────────
stores_gdf.to_file("output/stores.geojson", driver="GeoJSON")
stores_gdf.to_file("output/stores.gpkg", driver="GPKG")
stores_gdf.to_file("output/stores/stores.shp")   # Shapefile (multiple files)

# Write to Parquet (fast, preserves CRS)
stores_gdf.to_parquet("output/stores.parquet")
stores_loaded = gpd.read_parquet("output/stores.parquet")

Spatial Joins: The Core Geospatial Operation

The spatial join is the most important geospatial operation for data science. It answers: “Which polygon does each point fall within?” — assigning geographic context (country, state, zip code, census tract) to individual records.

Python
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import numpy as np

# Synthetic customer data with locations
np.random.seed(42)
n_customers = 500

# Random customers scattered around Austin metro area
customers_df = pd.DataFrame({
    "customer_id": [f"CUST_{i:04d}" for i in range(n_customers)],
    "longitude":   np.random.uniform(-98.0, -97.4, n_customers),
    "latitude":    np.random.uniform(30.1, 30.7, n_customers),
    "total_spend": np.random.lognormal(mean=5.5, sigma=1.2, size=n_customers)
})

customers_gdf = gpd.GeoDataFrame(
    customers_df,
    geometry=gpd.points_from_xy(customers_df["longitude"], customers_df["latitude"]),
    crs="EPSG:4326"
)

# Load zip code polygons for the Austin area
# (In practice, download from US Census TIGER or similar)
# For this example, we'll create simplified synthetic zip code areas
zip_codes = gpd.GeoDataFrame({
    "zip_code":  ["78701", "78702", "78703", "78704", "78745", "78750"],
    "city_name": ["Austin Downtown", "East Austin", "West Austin",
                   "South Austin", "South Congress", "Northwest Austin"],
    "median_income": [65000, 48000, 92000, 71000, 58000, 88000]
}, geometry=[
    Point(-97.74, 30.27).buffer(0.05),   # Rough city centers as circles
    Point(-97.72, 30.26).buffer(0.05),
    Point(-97.77, 30.28).buffer(0.05),
    Point(-97.75, 30.24).buffer(0.05),
    Point(-97.77, 30.22).buffer(0.05),
    Point(-97.77, 30.42).buffer(0.05),
], crs="EPSG:4326")

# ── The spatial join ──────────────────────────────────────────────
# Assign each customer to a zip code
customers_with_zip = gpd.sjoin(
    customers_gdf,          # Left: points (customers)
    zip_codes,              # Right: polygons (zip codes)
    how="left",             # Keep all customers, even those outside any zip
    predicate="within"      # Spatial relationship: point is within polygon
)

print(f"Customers before join: {len(customers_gdf)}")
print(f"Customers after join:  {len(customers_with_zip)}")
print(customers_with_zip[["customer_id", "total_spend", "zip_code", "city_name"]].head(10))

# Analyze: spending by zip code
spending_by_zip = (
    customers_with_zip
    .groupby("zip_code")
    .agg(
        n_customers=("customer_id", "count"),
        total_revenue=("total_spend", "sum"),
        avg_spend=("total_spend", "mean")
    )
    .round(2)
    .sort_values("total_revenue", ascending=False)
    .reset_index()
)

spending_by_zip = spending_by_zip.merge(
    zip_codes[["zip_code", "city_name", "median_income"]],
    on="zip_code"
)
print("\nRevenue by zip code:")
print(spending_by_zip.to_string(index=False))

# Spatial join predicates available:
# "within":     point is within polygon
# "contains":   geometry contains the other
# "intersects": geometries share any space
# "touches":    geometries share boundary but not interior
# "crosses":    geometry crosses another
# "overlaps":   geometries overlap but neither contains the other

Nearest Neighbor Join

Find the closest feature (e.g., the nearest store to each customer):

Python
import geopandas as gpd
import numpy as np

# Find nearest store to each customer
customers_with_nearest_store = gpd.sjoin_nearest(
    customers_gdf.to_crs("EPSG:32614"),  # Project to meters first!
    stores_gdf.to_crs("EPSG:32614"),
    how="left",
    distance_col="distance_to_store_m"   # Column to store the distance
)

customers_with_nearest_store["distance_to_store_km"] = (
    customers_with_nearest_store["distance_to_store_m"] / 1000
)

print("\nNearest store analysis:")
print(customers_with_nearest_store[
    ["customer_id", "store_name", "distance_to_store_km", "total_spend"]
].head(10).round(2))

# Average distance to nearest store
avg_dist = customers_with_nearest_store["distance_to_store_km"].mean()
print(f"\nAverage distance to nearest store: {avg_dist:.2f} km")

# Customers more than 20km from any store
far_customers = customers_with_nearest_store[
    customers_with_nearest_store["distance_to_store_km"] > 20
]
print(f"Customers > 20km from nearest store: {len(far_customers):,}")

Distance Calculations

Python
import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
import numpy as np

def calculate_distances_km(
    gdf_points: gpd.GeoDataFrame,
    reference_point: Point,
    crs_meters: str = "EPSG:32614"
) -> pd.Series:
    """
    Calculate distance in kilometers from each point to a reference point.

    Parameters
    ----------
    gdf_points : GeoDataFrame
        Points to measure from (in any CRS).
    reference_point : Point
        Reference point (must be in the same CRS as gdf_points).
    crs_meters : str
        A projected CRS in meters for accurate distance calculation.

    Returns
    -------
    pd.Series
        Distance in kilometers for each row.
    """
    # Project to meters for accurate distances
    gdf_proj = gdf_points.to_crs(crs_meters)

    # Project the reference point to the same CRS
    ref_gdf = gpd.GeoDataFrame(
        geometry=[reference_point], crs=gdf_points.crs
    ).to_crs(crs_meters)
    ref_proj = ref_gdf.geometry.iloc[0]

    # Calculate distances in meters, convert to km
    distances_km = gdf_proj.geometry.distance(ref_proj) / 1000
    return distances_km


# Distance from each store to the Austin city center
austin_center = Point(-97.7431, 30.2672)

stores_gdf["dist_to_center_km"] = calculate_distances_km(
    stores_gdf, austin_center
)
print("Stores by distance to Austin center:")
print(stores_gdf[["store_name", "dist_to_center_km", "revenue_2024"]]
      .sort_values("dist_to_center_km").round(2).to_string(index=False))


def haversine_distance_km(
    lat1: float, lon1: float,
    lat2: float, lon2: float
) -> float:
    """
    Calculate the great-circle distance between two points
    using the Haversine formula.

    More accurate than Euclidean distance for lat/lon coordinates.
    Returns distance in kilometers.
    """
    R = 6371  # Earth's radius in km
    lat1_r, lat2_r = np.radians(lat1), np.radians(lat2)
    dlat = np.radians(lat2 - lat1)
    dlon = np.radians(lon2 - lon1)

    a = np.sin(dlat/2)**2 + np.cos(lat1_r) * np.cos(lat2_r) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    return R * c


# Vectorized haversine for pandas DataFrames
def haversine_series(
    lat1: pd.Series, lon1: pd.Series,
    lat2: float, lon2: float
) -> pd.Series:
    """Vectorized haversine for a column of coordinates vs. a single point."""
    R = 6371
    lat1_r = np.radians(lat1)
    lat2_r = np.radians(lat2)
    dlat = np.radians(lat2 - lat1)
    dlon = np.radians(lon2 - lon1)
    a = np.sin(dlat/2)**2 + np.cos(lat1_r) * np.cos(lat2_r) * np.sin(dlon/2)**2
    c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
    return R * c


# Find customers within 10 km of downtown Austin
customers_df["dist_to_downtown_km"] = haversine_series(
    customers_df["latitude"], customers_df["longitude"],
    30.2672, -97.7431   # Austin downtown coordinates
)

downtown_customers = customers_df[customers_df["dist_to_downtown_km"] <= 10]
print(f"\nCustomers within 10km of downtown Austin: {len(downtown_customers):,}")
print(f"Their average spend: ${downtown_customers['total_spend'].mean():.2f}")

Spatial Aggregation: Choropleth Analysis

Aggregating point data to polygon areas is a fundamental geospatial pattern:

Python
import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.colors as mcolors

def spatial_aggregation(
    points_gdf: gpd.GeoDataFrame,
    polygons_gdf: gpd.GeoDataFrame,
    polygon_id_col: str,
    agg_col: str,
    agg_func: str = "sum"
) -> gpd.GeoDataFrame:
    """
    Aggregate point values by containing polygon.

    Classic pattern: count/sum customers per zip code,
    sales per census tract, events per neighborhood.
    """
    # Spatial join: assign polygon to each point
    joined = gpd.sjoin(
        points_gdf,
        polygons_gdf[[polygon_id_col, "geometry"]],
        how="left",
        predicate="within"
    )

    # Aggregate
    agg_result = (
        joined
        .groupby(polygon_id_col)[agg_col]
        .agg(agg_func)
        .reset_index()
        .rename(columns={agg_col: f"{agg_col}_{agg_func}"})
    )

    # Join aggregated values back to polygon GeoDataFrame
    result_gdf = polygons_gdf.merge(agg_result, on=polygon_id_col, how="left")
    result_gdf[f"{agg_col}_{agg_func}"] = (
        result_gdf[f"{agg_col}_{agg_func}"].fillna(0)
    )

    return result_gdf


# Aggregate customer revenue by zip code and visualize
revenue_by_zip_gdf = spatial_aggregation(
    customers_gdf, zip_codes, "zip_code", "total_spend"
)

# Plot a choropleth map
fig, ax = plt.subplots(figsize=(10, 8))

revenue_by_zip_gdf.plot(
    column="total_spend_sum",
    ax=ax,
    legend=True,
    legend_kwds={"label": "Total Revenue ($)", "shrink": 0.6},
    cmap="YlOrRd",           # Yellow → Orange → Red color scheme
    edgecolor="white",
    linewidth=0.5
)

# Plot store locations on top
stores_gdf.plot(
    ax=ax,
    color="blue",
    markersize=100,
    marker="*",
    label="Stores",
    zorder=5
)

ax.set_title("Customer Revenue by Zip Code — Austin Metro", fontsize=14, fontweight="bold")
ax.set_xlabel("Longitude")
ax.set_ylabel("Latitude")
ax.legend(fontsize=10)
ax.axis("equal")
plt.tight_layout()
plt.savefig("output/choropleth_revenue.png", dpi=150, bbox_inches="tight")
plt.show()
print("Choropleth map saved.")

Building Interactive Maps with Folium

Folium creates interactive HTML maps using Leaflet.js — perfect for exploring geospatial data and sharing results with stakeholders:

Python
import folium
from folium.plugins import MarkerCluster, HeatMap
import geopandas as gpd
import json

def create_interactive_map(
    center_lat: float = 30.35,
    center_lon: float = -97.75,
    zoom: int = 10
) -> folium.Map:
    """Create a base interactive map centered on Austin."""
    return folium.Map(
        location=[center_lat, center_lon],
        zoom_start=zoom,
        tiles="CartoDB positron"    # Clean, minimal basemap
    )


m = create_interactive_map()

# ── Add store markers ─────────────────────────────────────────────
for _, store in stores_gdf.iterrows():
    folium.CircleMarker(
        location=[store.geometry.y, store.geometry.x],
        radius=12,
        color="navy",
        fill=True,
        fill_color="royalblue",
        fill_opacity=0.8,
        popup=folium.Popup(
            f"<b>{store['store_name']}</b><br>"
            f"Revenue: ${store['revenue_2024']:,.0f}",
            max_width=200
        ),
        tooltip=store["store_name"]
    ).add_to(m)

# ── Add customer heatmap ──────────────────────────────────────────
heat_data = [[row["latitude"], row["longitude"]]
             for _, row in customers_df.iterrows()]
HeatMap(
    heat_data,
    radius=15,
    blur=20,
    gradient={0.2: "blue", 0.5: "lime", 0.8: "orange", 1.0: "red"},
    name="Customer Density"
).add_to(m)

# ── Add zip code polygons ─────────────────────────────────────────
# Convert revenue data to dict for styling
revenue_dict = revenue_by_zip_gdf.set_index("zip_code")["total_spend_sum"].to_dict()
max_rev = max(revenue_dict.values()) if revenue_dict else 1

def style_zip(feature):
    """Color zip codes by revenue intensity."""
    zip_code = feature["properties"].get("zip_code", "")
    revenue  = revenue_dict.get(zip_code, 0)
    intensity = revenue / max_rev

    # Interpolate color: light yellow → deep red
    r = int(255)
    g = int(255 * (1 - intensity * 0.8))
    b = int(100 * (1 - intensity))
    color = f"#{r:02x}{g:02x}{b:02x}"

    return {
        "fillColor":   color,
        "fillOpacity": 0.4,
        "color":       "white",
        "weight":      1.5
    }

folium.GeoJson(
    zip_codes.__geo_interface__,
    style_function=style_zip,
    tooltip=folium.GeoJsonTooltip(
        fields=["zip_code", "city_name"],
        aliases=["Zip Code:", "Neighborhood:"]
    ),
    name="Zip Codes"
).add_to(m)

# ── Add layer control ─────────────────────────────────────────────
folium.LayerControl().add_to(m)

# Save the interactive map
m.save("output/austin_stores_map.html")
print("Interactive map saved: output/austin_stores_map.html")

Common Geospatial Data Science Patterns

Pattern 1: Catchment Area Analysis

Python
import geopandas as gpd
import pandas as pd

def catchment_area_analysis(
    stores_gdf: gpd.GeoDataFrame,
    customers_gdf: gpd.GeoDataFrame,
    radius_km: float = 10.0,
    projected_crs: str = "EPSG:32614"
) -> pd.DataFrame:
    """
    For each store, find all customers within a radius and compute metrics.

    Parameters
    ----------
    stores_gdf : GeoDataFrame
        Store locations (Points).
    customers_gdf : GeoDataFrame
        Customer locations with spend data (Points).
    radius_km : float
        Catchment radius in kilometers.
    projected_crs : str
        Metric CRS for accurate distance calculations.

    Returns
    -------
    pd.DataFrame
        One row per store with catchment metrics.
    """
    stores_proj    = stores_gdf.to_crs(projected_crs)
    customers_proj = customers_gdf.to_crs(projected_crs)

    radius_m = radius_km * 1000
    results  = []

    for _, store in stores_proj.iterrows():
        # Create a circular catchment buffer
        catchment = store.geometry.buffer(radius_m)

        # Find customers within the catchment
        in_catchment = customers_proj[
            customers_proj.geometry.within(catchment)
        ]

        results.append({
            "store_id":         store.get("store_id", ""),
            "store_name":       store.get("store_name", ""),
            "catchment_radius_km": radius_km,
            "n_customers_in_catchment": len(in_catchment),
            "total_catchment_spend": in_catchment["total_spend"].sum()
                if "total_spend" in in_catchment.columns else 0,
            "avg_catchment_spend": in_catchment["total_spend"].mean()
                if "total_spend" in in_catchment.columns and len(in_catchment) > 0 else 0,
        })

    return pd.DataFrame(results).round(2)


catchment_report = catchment_area_analysis(
    stores_gdf, customers_gdf, radius_km=15.0
)
print("Catchment Area Analysis (15km radius):")
print(catchment_report.sort_values("total_catchment_spend", ascending=False)
      .to_string(index=False))

Pattern 2: Spatial Clustering

Python
from sklearn.cluster import DBSCAN
import numpy as np
import geopandas as gpd
import pandas as pd

def spatial_cluster_customers(
    customers_gdf: gpd.GeoDataFrame,
    eps_km: float = 2.0,
    min_samples: int = 5
) -> gpd.GeoDataFrame:
    """
    Cluster customers by geographic proximity using DBSCAN.

    DBSCAN is preferred over K-Means for spatial clustering because
    it discovers clusters of arbitrary shape and handles noise (outliers).

    Parameters
    ----------
    customers_gdf : GeoDataFrame
        Customer points with spend data.
    eps_km : float
        Maximum distance between neighbors (km) to be in the same cluster.
    min_samples : int
        Minimum customers needed to form a cluster.

    Returns
    -------
    GeoDataFrame
        Input GeoDataFrame with cluster_id column added (-1 = noise/outlier).
    """
    # Project to meters for accurate distance
    gdf_proj = customers_gdf.to_crs("EPSG:32614")

    # Extract coordinates as numpy array
    coords = np.column_stack([
        gdf_proj.geometry.x,
        gdf_proj.geometry.y
    ])

    # DBSCAN with epsilon in meters
    dbscan = DBSCAN(
        eps=eps_km * 1000,   # Convert km to meters
        min_samples=min_samples,
        metric="euclidean",
        algorithm="ball_tree",
        n_jobs=-1
    )

    clusters = dbscan.fit_predict(coords)

    result = customers_gdf.copy()
    result["cluster_id"] = clusters

    n_clusters = len(set(clusters)) - (1 if -1 in clusters else 0)
    n_noise    = (clusters == -1).sum()
    print(f"Spatial clustering: {n_clusters} clusters, {n_noise} noise points")

    return result


clustered = spatial_cluster_customers(
    customers_gdf, eps_km=2.0, min_samples=8
)

# Cluster summary
cluster_summary = (
    clustered[clustered["cluster_id"] >= 0]
    .groupby("cluster_id")
    .agg(
        n_customers  = ("customer_id", "count"),
        total_spend  = ("total_spend",  "sum"),
        avg_spend    = ("total_spend",  "mean"),
        centroid_lat = ("latitude",     "mean"),
        centroid_lon = ("longitude",    "mean")
    )
    .round(2)
    .sort_values("total_spend", ascending=False)
    .reset_index()
)

print("\nCluster Summary:")
print(cluster_summary.head(10).to_string(index=False))

Pattern 3: Point-in-Polygon with Large Datasets

Python
import geopandas as gpd
import pandas as pd
import numpy as np

def efficient_point_in_polygon(
    points_df: pd.DataFrame,
    polygons_gdf: gpd.GeoDataFrame,
    lat_col: str = "latitude",
    lon_col: str = "longitude",
    polygon_id_col: str = "polygon_id",
    chunk_size: int = 100_000
) -> pd.Series:
    """
    Efficiently assign polygon IDs to a large DataFrame of points.
    Processes in chunks to avoid memory issues.

    Returns a Series of polygon IDs (NaN for points outside all polygons).
    """
    results = []
    n_chunks = (len(points_df) + chunk_size - 1) // chunk_size

    for i in range(n_chunks):
        chunk = points_df.iloc[i * chunk_size : (i + 1) * chunk_size]

        chunk_gdf = gpd.GeoDataFrame(
            chunk,
            geometry=gpd.points_from_xy(chunk[lon_col], chunk[lat_col]),
            crs="EPSG:4326"
        )

        joined = gpd.sjoin(
            chunk_gdf[["geometry"]],
            polygons_gdf[[polygon_id_col, "geometry"]],
            how="left",
            predicate="within"
        )

        results.append(joined[polygon_id_col])

        if i % 5 == 0:
            print(f"  Processed chunk {i+1}/{n_chunks} "
                  f"({(i+1)*chunk_size:,} / {len(points_df):,} rows)")

    return pd.concat(results)


# Assign zip codes to 500K customers efficiently
# customers_large_df["zip_code"] = efficient_point_in_polygon(
#     customers_large_df, zip_codes, polygon_id_col="zip_code"
# )

Summary

Geospatial data science opens the spatial dimension for any analysis — it’s the difference between knowing that sales are high in “Austin” and knowing they’re concentrated in a 2-mile radius around the university, or that a proposed new store location has 45,000 potential customers within a 10km drive.

The foundational concepts — geometry types (Point, LineString, Polygon), coordinate reference systems (WGS84 for lat/lon, UTM for accurate measurements), and the critical rule that distance and area calculations must use a projected CRS in meters — are the prerequisite for everything else. The core Python stack is Shapely for geometry objects, GeoPandas for spatial DataFrames, and Folium or Matplotlib for maps.

The most important single operation is the spatial join (gpd.sjoin()): assigning geographic context (zip code, census tract, sales territory) to point data (customers, events, sensors). This single operation unlocks most geospatial data science work — once every record has geographic context, standard pandas groupby, aggregation, and visualization take over.

Key Takeaways

  • Geospatial data uses three geometry types: Point (single location), LineString (path), and Polygon (area) — corresponding to Shapely’s Point, LineString, and Polygon classes in Python
  • Coordinate Reference Systems are the most important concept to get right: WGS84 (EPSG:4326) stores coordinates as latitude/longitude degrees and is used for GPS and web maps; projected CRS like UTM zones store coordinates in meters and are required for accurate distance and area calculations
  • Always reproject to a metric CRS before calculating distances or areasgdf.to_crs("EPSG:32614") converts to UTM Zone 14N (Texas) in meters; geometry.distance(other) then returns meters, not degrees
  • gpd.sjoin() is the core geospatial operation for data science — it performs a spatial join that assigns polygon attributes (zip code, census tract, city) to point data (customers, events) based on spatial relationships rather than a shared key
  • gpd.sjoin_nearest() finds the nearest feature with an optional distance_col parameter that returns the distance in the GeoDataFrame’s units — always use a metric CRS before calling this
  • Folium creates interactive HTML maps with markers, heatmaps, choropleths, and layer controls in pure Python — ideal for sharing exploratory geospatial findings with stakeholders
  • For large-scale spatial joins (millions of points), process in chunks with efficient_point_in_polygon() to avoid memory errors — GeoPandas uses an R-tree spatial index internally to make joins fast
  • The standard geospatial data science workflow: load points (customers, events) → load polygons (zip codes, census tracts) → spatial join to assign context → aggregate by polygon → visualize with choropleth
Share:
Subscribe
Notify of
0 Comments

Discover More

Implementing Linear Regression from Scratch in Python

Implementing Linear Regression from Scratch in Python

Learn to implement linear regression from scratch in Python using NumPy. Build gradient descent, the…

Ohm’s Law: Relationship Between Voltage, Current and Resistance

Learn about Ohm’s Law, its applications and practical examples. Discover how voltage, current and resistance…

Moving into Data Science from a Business Background

Learn how to transition from business roles to data science. Discover how your business acumen…

What Is System Performance Monitoring?

What Is System Performance Monitoring?

Learn what system performance monitoring is, which metrics matter, how operating systems track CPU, memory,…

Anomaly Detection: Finding Outliers in Your Data

Anomaly Detection: Finding Outliers in Your Data

Master anomaly detection from first principles. Learn Isolation Forest, Local Outlier Factor, One-Class SVM, statistical…

Operator Overloading in C++: Making Your Classes Intuitive

Operator Overloading in C++: Making Your Classes Intuitive

Learn C++ operator overloading to create intuitive custom classes. Master arithmetic, comparison, stream, and assignment…

Click For More
0
Would love your thoughts, please comment.x
()
x