Chapter 6 Data Cleaning and Quality Assurance

If there’s one task that dominates a GIS analyst’s working life, it’s data cleaning. Estimates vary, but most spatial data scientists spend 60–80% of their time wrangling data before any actual analysis happens. This is where Claude Code can make the biggest immediate difference.

6.1 Why This Chapter Matters

Data cleaning is repetitive, rule-based, and often tedious — exactly the kind of work that AI-assisted coding excels at. But it’s also where mistakes are most consequential: a bad join or a dropped geometry can silently cascade through your entire analysis.

The approach here is: let Claude write the cleaning code, but always validate the results yourself.

6.2 Common GIS Data Cleaning Tasks

Here are the types of tasks where Claude Code adds the most value, along with example prompts you might use:

6.2.1 Geometry validation and repair

  • Check for invalid geometries: “Read in data/raw/boundaries.gpkg and check which features have invalid geometries. Report how many are invalid and what the issues are.”
  • Repair geometries: “Use st_make_valid() to fix any invalid geometries and save the result to data/processed/boundaries_clean.gpkg”
  • Remove empty geometries: “Filter out any features with empty or null geometries”

6.2.2 CRS management

  • Check and standardise CRS: “What CRS is this dataset in? Reproject it to EPSG:27700 (British National Grid)”
  • Handle mixed CRS datasets: “I have three shapefiles that might be in different projections. Check each one and reproject them all to EPSG:27700”

6.2.3 Attribute cleaning

  • Standardise column names: “Rename all columns to lowercase snake_case”
  • Handle missing values: “Show me a summary of NA values by column, then drop rows where the ‘area_name’ field is missing”
  • Fix encoding issues: “Some of the place names have garbled characters. Try to fix the encoding, assuming the original was UTF-8”
  • Standardise categories: “The ‘land_use’ column has inconsistent values — ‘Residential’, ‘residential’, ‘RESIDENTIAL’, and ‘res.’ all appear. Standardise these.”

6.2.4 Joins and merges

  • Spatial joins: “Join the points dataset to the polygon boundaries using st_join. Keep all points even if they don’t fall within a polygon.”
  • Attribute joins: “Join the census data CSV to the boundary polygons on the ‘ward_code’ field. Flag any codes that don’t match.”
  • Identifying join issues: “How many records from each dataset matched? Show me the unmatched records from both sides.”

6.3 Building a QA Checklist

A good practice is to ask Claude to generate a QA report at the end of any cleaning step:

“Generate a QA summary for the cleaned dataset: number of features, CRS, column names and types, count of NAs per column, geometry type, and bounding box.”

This gives you a quick sanity check before moving on. You might even include this as a standard function in every project, stored in your CLAUDE.md as a template Claude can reuse.

6.4 Things to Watch Out For

  • Silent geometry drops. Some operations quietly remove features with invalid geometries. Always compare your row count before and after cleaning.
  • CRS assumptions. If a dataset has no CRS metadata, Claude will guess — and it might guess wrong. Always verify with a visual check in QGIS or by plotting.
  • Encoding traps. Shapefiles are particularly prone to character encoding issues. If attribute values look garbled after cleaning, ask Claude to try different encodings.
  • Join cardinality. A spatial join between points and overlapping polygons can produce duplicate rows. Make sure you understand whether a one-to-one or one-to-many result is expected.