← Back to Work

Costbook ETL Pipeline

Turning messy HVAC supplier price books into clean, importable costbook data using a Bronze/Silver/Gold medallion architecture. Ingests Excel and PDF, normalizes with LLM-powered transformation, enriches with AHRI efficiency ratings, and exports import-ready files. Deployed on AWS EKS.

ETLDoclingClaude SonnetFirecrawlAHRIAWS EKSPython
FieldPulse 2025 Production
Costbook ETL Pipeline

Context

HVAC contractors need accurate part costs to quote jobs

FieldPulse serves HVAC contractors who install, repair, and maintain heating and cooling systems. Before a technician can price a job, they need to know what the parts cost. HVAC suppliers publish price books for this purpose, but every supplier does it differently -- some deliver Excel files with deeply nested columns, others send PDFs with embedded tables. Terminology varies across manufacturers, equipment ratings reference AHRI certification numbers that require separate lookups, and nothing maps cleanly to anything else. FieldPulse's Costbook feature lets contractors store and query these costs when building quotes, but the gap was getting from raw supplier files to data that could actually live in the system. Before this pipeline, it was a manual re-entry process.

Problem

Three layers of complexity standing between raw files and usable data

This was not a simple format conversion. The challenge broke into three distinct dimensions that each required a different kind of solution.

Input diversity

Supplier price books arrive as Excel files with irregular layouts, PDFs with embedded tables, or both. No standard format exists. The pipeline needed to handle genuine structural messiness without human cleanup before ingestion.

Domain complexity

HVAC systems are component-based -- a complete system is an outdoor condensing unit paired with an indoor air handler, a furnace, a thermostat, and accessories. Price books list these individually. The pipeline needed to understand which components form a matched system, not just extract flat line items.

Efficiency enrichment

Many components carry AHRI certification numbers for independently tested efficiency ratings (SEER, EER, HSPF). These ratings were not in the supplier files at all -- the pipeline needed to look them up externally and attach them to produce complete Costbook entries.

Approach

Medallion architecture with an LLM at the transformation layer

The pipeline follows a Bronze, Silver, Gold medallion pattern where data fidelity and structure increase at each stage. Rather than building rigid schema parsers that would break on every new supplier format, the system uses Claude Sonnet as the transformation engine -- handling semantic variation across column names, units, and grouping conventions without per-supplier custom code.

1

Bronze Layer -- Raw Ingestion

Docling extracts embedded tables from PDFs into structured JSON. Pandas absorbs irregular Excel layouts -- merged cells, multi-row headers, blank separator rows -- into flat representations. The bronze output is a faithful replica of the original. Nothing is renamed, inferred, or dropped.

2

Silver Layer -- LLM Transform + System Assembly + AHRI Lookup

Claude Sonnet handles semantic normalization, mapping varied supplier column names to FieldPulse's internal taxonomy. System assembly groups individual components into matched HVAC systems. Firecrawl queries the AHRI Directory to pull certified efficiency ratings for each model number.

3

Gold Layer -- Validated Export

Generates import-ready Excel files built to FieldPulse's Costbook specification. Once exported, contractors import directly into the platform to reference accurate, supplier-current pricing when building quotes and estimates.

Architecture

End-to-end pipeline from raw supplier files to importable data

Each layer operates independently and is queryable on its own. The principle: never lose the raw input, make transformation explicit and auditable, and validate against the import schema before anything reaches production.

Costbook ETL Pipeline Architecture

Bronze → Silver → Gold medallion architecture

Outcome

Manual re-entry replaced with an automated pipeline

Excel + PDF

Input Formats Handled

3

Pipeline Stages (B/S/G)

AHRI

Automated Efficiency Lookup

EKS

Production Deployment

Contractors onboarding with multiple supplier relationships no longer needed someone to re-key price book data by hand. They imported the supplier file, the pipeline ran, and the Costbook was populated with normalized, enriched, import-ready data.

Next Project

Customer Onboarding App →
All Projects