This application demonstrates OpenAI Realtime API usage on an ESP32-S3 device with a 5-inch HMI LCD panel. It provides a graphical user interface (GUI) for configuring WiFi settings and entering your OpenAI API key, then establishes a WebRTC communication with the OpenAI Realtime API. Audio input is sent to the model, which returns text responses and a transcription of the audio.
- Embedded Device Focus: Designed for ELECROW CrowPanel Advance 5.0-HMI. For detailed device hardware information, see Device Hardware Documentation.
- Real-time Communication: Establishes a WebRTC connection with OpenAI Realtime API.
- Voice Interaction: Transcribes audio input and displays the model’s text responses.
- OpenAI Responses API: After audio is captured/streamed, the transcription is sent to the OpenAI Responses API for final processing when the mic is toggled off.
- User-friendly GUI: Built using LVGL 8.4.
- Session Persistence: WiFi settings and session configurations are saved in non-volatile storage.
- Easy Build & Flash: Build from source using ESP-IDF v5.4 or flash prebuilt images.
- LLM Function Calling: Map natural language requests into robot control functions (movement, speed, headlights, music) using OpenAI API’s function calling.
- Install ESP-IDF framework v5.4.
- Clone the repository.
- Dependencies are installed via the framework component manager (see
idf_component.yml). - Build and flash using the following commands:
idf.py build idf.py -p PORT flash
- Use
flash_tool.exeto flash the prebuilt images.
-
WiFi Setup: Navigate to the WiFi tab and enter your SSID and password.
-
Authentication: Go to the Auth tab and input your OpenAI API key (non-free tier account required).
-
Mic Control & Realtime Communication:
- Tap the on-screen mic button to start and stop audio capture.
- While the mic is on, audio is streamed to the OpenAI Realtime API for live transcription.
- When you tap the mic off, the complete audio request is sent to the OpenAI Responses API for final processing.
- Transcriptions, final responses, and any invoked functions are displayed in the terminal.
-
Function Calling & Supported Commands: Users can speak naturally—exact phrasing isn’t required—and the model will map intents into robot actions. Example requests include:
- Movement:
- Direct: “move forward”, “turn right”
- Indirect: “go ahead a bit”, “spin to the left”
- Speed Adjustment:
- Direct: “go faster”, “go slower”
- Indirect: “speed up”, “take it easy on the throttle”
- Headlights:
- Direct: “turn headlights on”, “turn headlights off”
- Indirect: “it’s too dark here”, “lights, please”
- Audio:
- Direct: “play music”
- Indirect: “start some tunes”, “let’s have some background music”
Internally, these map to functions:
control_robot_movement(direction),change_robot_speed(speed),robot_headlights(headlights_state), andplay_music(). Any unsupported or invalid request triggersreject_request(). - Movement:
-
Wireless module: Wireless module is used to send control commands to the robot.
-
Session Controls: Use the terminal to clear the screen or disconnect and stop communication.
- ESP-IDF Components: All dependencies are listed in the
idf_component.ymlfile and are downloaded automatically. - LVGL 8.4: Used for the user interface.
- ESP WebRTC Examples: Heavily inspired by Espressif's WebRTC Solution.