carBot: building automotive intelligence from messy market data.
carBot is an internal automotive intelligence system that turns noisy car listings into normalized records, market cohorts, deal scores, VIN/SPZ context, alerting, and a dashboard built for repeated analyst workflows.
The interesting part of carBot is not that it collects listings. The hard part is that the source material is semistructured, multilingual, duplicated, inconsistent, and constantly changing. A useful system has to preserve the original evidence, extract the fields analysts actually need, quantify uncertainty, and keep the resulting market model current enough to support decisions.
The challenge: car listings are not clean data.
Used-car markets look simple from the outside: title, price, mileage, year, photos, seller, and maybe a VIN. In practice, the same vehicle can appear across sources with different wording, different currencies, partial specifications, missing mileage, non-standard trims, auction-specific fields, and inconsistent status updates. The problem is closer to data engineering for a moving, adversarial-looking corpus than to ordinary CRUD ingestion.
carBot was designed around that reality. It keeps raw evidence when useful, creates a structured representation only after validation, and treats confidence as a first-class output rather than hiding parser uncertainty. This matters because a wrong year, mileage, fuel type, or trim can distort the market cohort and produce a misleading deal score.
Descriptions mix seller prose, abbreviations, Czech number formats, trim names, warranty text, dates, and irrelevant numeric values.
The same physical car can reappear across marketplaces, auctions, or dealers. VIN-aware deduplication prevents double-counting.
Some cohorts have enough recent data; rare trims or power bands need fallbacks before they can be scored responsibly.
System architecture.
At a high level, carBot is a Python 3.11+ asynchronous pipeline with a PostgreSQL core, a FastAPI inspection/control API, a Next.js analyst frontend, and separate subsystems for market modeling, scoring, alert delivery, VIN context, and SPZ recognition. The project is intentionally modular: each source adapter can fail independently, while the post-processing stages run behind serialized locks so expensive recompute work does not collide.
The core record is deliberately boring.
The listing table carries both raw input fields and normalized fields. This sounds mundane, but it is one of the most important engineering choices in the system: analysts can inspect what the parser believed, compare it with the original text, and fix normalization rules without losing provenance.
{
"source": "marketplace_a",
"url_fingerprint": "stable_dedup_key",
"title_raw": "seller-provided title",
"price_raw": "raw localized price text",
"price_czk": 389000,
"price_confidence": "high",
"mileage_km": 142000,
"mileage_confidence": "medium",
"year": 2018,
"make": "normalized_make",
"model": "normalized_model",
"trim": "normalized_trim_or_null",
"vin": "validated_or_null",
"duplicate_of_id": null,
"deal_score": 82.4,
"score_details": {
"scoring_method": "regression_3d",
"expected_price_czk": 445000,
"deal_pct": 12.6,
"regression_samples": 41
}
}
Why the parser works.
The parser does not assume that source pages are clean forms. It treats each field as a scored extraction result: value, confidence, source, and raw evidence. That makes the parser explainable. A mileage extracted from a labeled "km" phrase can be trusted more than a bare number. A standalone year is only accepted when the surrounding context does not look like a date, financing period, or unrelated numeric fragment.
The normalizer layer also encodes domain-specific catalogs and aliases. This is where carBot turns human language into comparable units: model aliases, trim patterns, fuel keywords, gearbox phrases, power units, Czech number formatting, EUR/CZK separation, and guard rails for suspicious values.
# conceptual parser contract, not production source
def parse_listing(raw):
text = join_evidence(raw.title, raw.description, raw.params)
price = normalize_price(raw.price_text)
mileage = extract_mileage(text).with_guard(min_km=0, max_km=900_000)
year = extract_year(text).with_guard(min_year=1980, max_year=current_year + 1)
make = extract_make(raw.title, raw.description)
model = extract_model(raw.title, make.value)
trim = extract_trim(text, make.value, model.value)
return ParsedListing(
fields=[price, mileage, year, make, model, trim],
coverage=coverage_tracker.snapshot(),
parser_version=PARSER_VERSION,
)
This confidence-aware approach is what keeps the downstream model honest. Low-confidence or missing fields can still be useful for search and inspection, but they should not silently poison pricing cohorts or high-value alerts.
Market modeling: comparable cars before clever scoring.
A naive deal detector compares a price to an average. carBot does more work before assigning a score. It builds market cohorts from comparable cars, clips price outliers with an IQR pass, prefers recent data, falls back to full history for thin cohorts, and widens the year band only when needed. That gives the scoring engine a baseline that is local to the vehicle family instead of global to the whole market.
| Stage | Purpose | Why it matters |
|---|---|---|
| Recent exact cohort | Use recent listings for the same make/model/year/fuel/power/seller context. | Keeps pricing sensitive to current market movement. |
| IQR clipping | Remove extreme price outliers before calculating medians. | Prevents broken listings, damaged cars, or unrealistic prices from moving the baseline. |
| Full-history fallback | Use older records when the recent cohort is too thin. | Rare variants still get a responsible comparison set. |
| Two-year widening | Broaden only cohorts that remain under-sampled. | Balances sample size against generation/facelift drift. |
-- conceptual cohort computation
WITH candidate_prices AS (
SELECT make, model, year, fuel_type, seller_type, price_czk, mileage_km
FROM listings
WHERE listing_status = 'active'
AND duplicate_of_id IS NULL
AND price_czk IS NOT NULL
),
clipped AS (
SELECT *
FROM candidate_prices
WHERE price_czk BETWEEN q1 - 1.5 * iqr AND q3 + 1.5 * iqr
)
SELECT
cohort_key,
percentile_cont(0.5) WITHIN GROUP (ORDER BY price_czk) AS median_price_czk,
percentile_cont(0.5) WITHIN GROUP (ORDER BY mileage_km) AS median_mileage_km,
COUNT(*) AS sample_count
FROM clipped
GROUP BY cohort_key;
Deal scoring: regression first, cohort fallback.
carBot's scoring engine combines guard rails with multiple estimation paths. Duplicates are not scored. Listings missing required identity fields are separated from scoreable records. Then the engine tries a 3D regression model using mileage, age, and power where enough samples exist; falls back to a 2D regression model using mileage and age; and finally falls back to the cohort median when regression is unavailable or fails quality gates.
This matters because two cars in the same broad cohort can have very different expected prices. Mileage, age, power band, seller type, and damage language all influence whether a listing is genuinely interesting or merely cheap for a reason.
Predicts expected price from mileage, age, and optionally power when the cohort has enough samples and the model quality is acceptable.
Uses median market price and median mileage when regression is unavailable, weak, or too sparse.
Applies a penalty when non-negated damage keywords are detected, and persists damage flags to avoid repeated expensive scans.
# conceptual scoring order
def score_listing(listing, cohorts, regression_2d, regression_3d):
if listing.duplicate_of_id:
return unscored("duplicate")
if missing_identity_fields(listing):
return unscored("missing_fields")
if listing.price_czk is None:
return unscored("no_price")
estimate = try_regression_3d(listing, regression_3d)
estimate = estimate or try_regression_2d(listing, regression_2d)
estimate = estimate or lookup_cohort_median(listing, cohorts)
if estimate is None:
return unscored("no_cohort")
deal_pct = (estimate.expected_price - listing.price_czk) / estimate.expected_price * 100
score = clamp(50 + deal_pct * 2.5 - damage_penalty(listing), 0, 100)
return scored(score, method=estimate.method, deal_pct=deal_pct)
Operations: the runner is part of the product.
Unattended data systems fail in boring ways: one source slows down, a parser starts returning nulls, a market recompute overlaps another expensive job, or a notification loop sends duplicate alerts. carBot treats runtime control as a first-class subsystem. Each source runs in its own async loop with independent state. Post-pipeline work such as market recompute, scoring, alert matching, and VIN profile computation is serialized behind a lock, with cooldown logic to avoid piling up redundant work.
The FastAPI layer exposes health, source status, listings, low-confidence records, deals, price history, market stats, sold stats, VIN history, auctions, runner control, watch filters, and alert settings. The Next.js frontend turns those endpoints into an operations surface: browse listings, inspect low-confidence parser output, review price changes, monitor auctions, inspect VIN flags, control runner state, and tune alert filters.
VIN, SPZ, and identity graph enrichment.
Vehicles are not just rows. The same car can move between sources, change price, appear in an auction, disappear, return later, or be sold and relisted. carBot uses VIN-aware profiles and event history to connect these appearances. A separate SPZ subsystem performs image-based license plate detection and OCR on listing photos, then validates possible plate strings by country format and confidence.
The public takeaway is the architecture, not the private data. VINs, plates, raw images, and registry results are not publishable artifacts. What matters technically is the separation of concerns: detection, OCR, validation, confidence scoring, result storage, and analyst-facing inspection are independent enough to be improved without rewriting the core listing pipeline.
# conceptual enrichment flow
for listing in listings_without_identity_context:
vin_profile = lookup_or_compute_vin_profile(listing.vin)
plate_candidates = detect_plates_from_public_listing_images(listing.image_refs)
enriched = merge_context(
listing=listing,
vin_profile=vin_profile,
plate_candidates=validated_candidates(plate_candidates),
)
store_enrichment(enriched, redact_sensitive_outputs=True)
Alerting: matching is pure, delivery is deduplicated.
The alert engine is split into pure matching logic and side-effecting delivery. Watch filters define desired make/model/trim/fuel/year/mileage/price/score constraints. Matching runs in memory over candidate listings. Delivery is deduplicated in the database by listing, filter, and event type, so a new match does not repeatedly notify unless there is a new event such as a later price drop.
This structure makes alerting testable. The filter matcher can be validated without Telegram, network access, or a live runner. The delivery layer can then focus on formatting, rate-limit handling, retries, and recording successful sends.
What carBot is useful for.
- Deal discovery: rank listings by expected market value instead of raw price alone.
- Price-drop monitoring: detect meaningful changes and send alerts to configured watch filters.
- Market research: inspect comparable cohorts, median prices, trim distributions, and sold-listing behavior.
- Parser quality work: surface low-confidence records so extraction rules can be improved with evidence.
- Auction monitoring: track estimated price, current bid, hammer price, end time, and related listing context.
- Vehicle identity analysis: connect duplicate or recurring vehicles through VIN-aware history and enrichment caches.
- Operational intelligence: monitor source health, runner status, cycle metrics, error rates, and pipeline freshness.
This article intentionally does not publish credentials, source-specific request recipes, private endpoints, raw listing dumps, phone numbers, VIN/SPZ examples, anti-abuse bypass details, production screenshots, or internal security deliverables. The goal is to document the engineering model and capabilities without exposing sensitive implementation material.
Why this is interesting.
carBot is useful because it sits between two extremes. A spreadsheet is too weak: it cannot maintain live market baselines, parser confidence, identity history, alert deduplication, or analyst workflows. A generic web scraper is also too weak: it can fetch pages, but it does not understand what makes a car comparable, when a deal score is invalid, or why a parser claim should be trusted.
The system works because it treats automotive data as a domain model, not as page text. The core loop is simple but disciplined: acquire evidence, normalize carefully, persist state, compute comparable baselines, score with fallbacks, expose uncertainty, and make the workflow inspectable. That is the difference between collecting listings and building a decision system.
Future outlook.
The next useful direction is deeper feedback between analyst review and parser behavior. Low-confidence queues already identify weak extraction cases; the natural extension is a correction loop that turns review outcomes into regression tests, catalog updates, and measurable coverage improvements. On the market side, richer trim and option modeling can make expected-price estimates more sensitive without overfitting small cohorts.
Longer term, carBot points toward a broader pattern for vertical intelligence systems: domain-specific ingestion, confidence-aware normalization, explainable scoring, enrichment under strict privacy boundaries, and an operations surface that makes the automation accountable.