Agent²
“Voice-to-new-home-in-under-90-seconds”
2026
Demo
If you want a home between $500k-$600k on Zillow, there are 100+ pages to scroll through. We automated that. Agent² takes a spoken request over a phone call, extracts structured search criteria, scrapes and ranks Zillow listings, and submits a contact form on the top match. End to end in under 90 seconds. Form automation hits 94% across 200+ test submissions. The judge at GenAI Genesis 2026 called it one of the most technically impressive projects he'd seen.
Home buyers spend 10+ hours a week manually browsing listings and filling out contact forms. The whole workflow is automatable from a single spoken description. Why aren't we doing that yet.
Stack
What was hard
Dual-Stage Noise Suppression on Telephony Audio
Raw 8kHz telephony audio from Telnyx contains HVAC hum, road noise, and GSM compression artifacts that dragged transcription accuracy down to ~71% WER.
Two-stage pipeline: SpeexDSP for adaptive echo cancellation and AGC as the first pass, then RNNoise for neural residual noise removal on what SpeexDSP leaves behind.
ResultWord error rate dropped from ~29% to 8.3% across 150 test calls recorded in cars, kitchens, and outdoors.
Structured Criteria Extraction from Conversational Speech
People describe homes the way they think out loud: "maybe three or four bedrooms, not too expensive." Contradictory, underspecified. A naive extraction produces garbage search criteria.
I engineered a constrained extraction prompt that maps freeform speech to a typed schema (bedrooms, price range, location, must-haves, dealbreakers). A real-time validation pass flags ambiguity back to the caller before anything hits the search pipeline.
Result96.2% extraction accuracy on a 200-utterance test set. Validation pass caught 89% of remaining edge cases.
Anti-Bot Evasion for Automated Form Submission
Zillow's contact forms use fingerprinting, rate limiting, and behavioral heuristics. Default Playwright automation gets through about 12% of the time.
Human-like interaction patterns throughout: randomized mouse trajectories, gaussian-distributed keystroke timing, viewport-realistic scroll behavior, and residential proxy rotation through ScraperAPI.
Result94% submission success rate across 200+ test runs. Zero account bans.
Architecture
How it works
Audio Processing
Telnyx SIP trunk routes inbound calls to the EC2 instance. Raw 8kHz audio passes through SpeexDSP (adaptive echo cancellation + AGC) then RNNoise (neural residual noise removal). Two-stage approach cuts word error rate from ~29% to 8.3% across real-world recording conditions.
Criteria Extraction
PersonaPlex (4-bit quantized on EC2 GPU) runs real-time voice interaction during the call. FastAPI backend applies a constrained extraction prompt to map the cleaned transcript to a typed schema. A real-time validation pass flags ambiguity back to the caller before the criteria touch the search pipeline. 96.2% extraction accuracy on a 200-utterance test set.
Scraping and Automation
Validated criteria feed into ScraperAPI to query Zillow. Results ranked by a weighted scoring function across price delta, commute distance, and feature match. Top listing triggers Playwright with the full anti-detection stack: randomized mouse trajectories, gaussian keystroke timing, viewport-realistic scrolling, and residential proxy rotation. 94% submission success across 200+ test runs.