Geospatial Data Cleaning

Definition

Geospatial data cleaning is the process of detecting and fixing issues—invalid geometries, duplicates, spikes, attribute typos, CRS mismatches, and topology errors—that erode analytical reliability. Because spatial operations amplify small defects, cleaning is a prerequisite to trustworthy maps and models. The outcome is not just prettier data but reproducible processes and documented quality thresholds.

Application

Cities clean address points before geocoding, ecologists remove GPS outliers, and utilities snap linework to network rules. Pipelines codify these steps so new data enters the lake in analysis‑ready form.

FAQ

What are common geometry fixes?

Make valid (repair self‑intersections), dissolve slivers, close rings, snap nodes, and simplify with tolerance. Always preserve topology constraints.

How do you detect location outliers?

Use bounding boxes, speed/acceleration limits for tracks, density‑based clustering, and cross‑checks against reference layers such as coastlines or parcels.

Can attribute errors break spatial analysis?

Yes—wrong units or categories can create subtle bias. Validate attribute domains, units, and joins alongside geometries.

How do we keep cleaning auditable?

Script it, version outputs, and log changes. Store both original and corrected datasets with issue summaries for governance.