Production Notes

Speak Oculus

“Behind the Scenes”

2026

Synopsis

Speak Oculus came from wanting to practice French without the pressure of talking to native speakers. It gives you an AI-powered conversational partner for language practice.

The Problem

When you're learning a language, the hardest part is actually speaking it. Apps teach you vocab and grammar, but when it comes to real conversation, there's a huge gap. Talking to native speakers can feel intimidating, and most practice tools are just flashcards in disguise. I wanted something that feels like chatting with a patient friend who happens to speak the language.

Starring

OpenAIReact NativeWebSocketsNode.js

Production Challenges

Real-Time Conversational AI

Building low-latency conversational interactions over WebSockets meant carefully managing streaming responses and audio playback timing so it actually feels natural.

Mobile Audio Pipeline

Dealing with microphone input, speech-to-text, and text-to-speech in React Native across iOS and Android surfaced all kinds of platform-specific audio quirks that each needed their own workaround.

Natural Conversation Flow

Making the AI feel like a real conversation partner instead of a chatbot took a lot of prompt tuning for natural turn-taking, corrections, and encouragement.

Production Design

The system uses a three-tier WebSocket architecture: a React Native mobile client (Expo SDK 50+) connects over WebSocket to a lightweight Node.js relay server on AWS EC2 (us-east-1), which maintains a persistent WebSocket connection to the OpenAI Realtime API (gpt-realtime-mini). The relay exists for two reasons: the API key never touches the client, and co-locating with OpenAI's servers in us-east-1 saves around 200ms of round-trip latency. We chose a custom relay over a managed service like API Gateway WebSocket because we needed full control over the binary audio frame routing, and API Gateway adds overhead and complexity for raw PCM streaming that a simple Node.js server handles natively. The tradeoff is we manage our own uptime, but for a real-time audio app the latency savings were non-negotiable. Audio flows as 24kHz PCM16 in 40ms chunks with a gapless playback scheduling algorithm that buffers the first chunk before starting to eliminate jitter gaps. We went with 40ms chunks instead of larger buffers because smaller chunks give tighter conversational turn-taking at the cost of more WebSocket messages. Testing showed that anything above 60ms felt laggy for natural conversation, so 40ms was the sweet spot between responsiveness and overhead. The client implements optimistic barge-in (around 120ms) using hardware echo cancellation (Android VOICE_COMMUNICATION audioSource). Three consecutive 40ms frames exceeding a speech threshold trigger instant playback stop, which is 200 to 500ms faster than waiting for server-side VAD. We deliberately chose client-side barge-in detection over relying on OpenAI's server-side VAD because the round-trip delay made server-side interruption feel unnatural. The tradeoff is occasional false positives from background noise, but tuning the speech threshold and requiring three consecutive frames keeps those rare. Computer vision uses accelerometer-based stability detection (1.2s hold threshold) to auto-capture frames, which are cropped and compressed to 384x384 JPEG (roughly 25KB) and injected into OpenAI's multimodal format through the relay. We picked accelerometer-based capture over a manual shutter button because hands-free operation matters when you are mid-conversation, and the stability threshold prevents blurry or accidental captures. Vocabulary tracking runs as a tool-call loop: when the user drops an English word mid-sentence, the AI logs it as a "gap word" and naturally recasts it, following SLA research principles (one correction per turn, three-turn cooldown). State persists client-side via AsyncStorage for call history, tutor agents, and gap words across sessions. We chose client-side persistence over a cloud database because the app works offline-first and we wanted users to own their data without creating accounts, though the tradeoff is that progress does not sync across devices.