From Pixels to Molecules: A Video-to-Scent Translation Pipeline powered by Multimodal LLMs.
VLM2SMELL is a pioneering research framework designed to bridge the gap between Visual Perception and Olfactory Inference. By leveraging the advanced sequence understanding capabilities of modern Visual Language Models (specifically Gemini 2.5 Flash), this system transforms raw video footage into structured chemical and sensory olfactory profiles.
Unlike traditional frame-by-frame analysis, VLM2SMELL treats video as a continuous temporal sequence, allowing it to:
- Extract Visual Ground Truth: Identify objects, states, and thermal cues over time.
- Infer Olfactory Events: Detect activities that release scent (e.g., slicing lemon, frying steak, rain on asphalt).
- Map to Chemistry: Translate semantic descriptions into specific odorant molecules (e.g., Limonene, Maillard reaction products, Geosmin).
This project is tailored for HCI researchers and developers exploring multi-sensory digital experiences, immersive environments, and digital scent technologies.
-
Sequence-First Architecture: Analyzes full video sequences to capture temporal context (e.g., understanding the transition from "whole onion"
$\to$ "chopped onion"$\to$ "sautéed onion"). -
Visual-Olfactory Separation (VOS): Strictly enforces a two-step logic (Visual Evidence
$\to$ Chemical Inference) to minimize hallucinations and ground predictions in visual facts. - Chemical Mapping: Outputs structured data including Scent Category, Descriptors, Intensity, and Candidate Molecules.
- Ground Truth Preservation: Automatically extracts and saves indexed frames for manual verification against the generated JSON report.
- Gemini 2.5 Flash Integration: Utilizes the latest efficient multimodal model from Google for fast and accurate long-context understanding.
This version introduces a rigorous Scientific Protocol to ensure reproducibility, quantifiability, and physical consistency.
Instead of vague descriptions, the visual engine now outputs precise metrics:
proximity: Categorical distance (near/mid/far).frame_coverage: Estimated screen area percentage (0.0 - 1.0).activity_level: Standardized motion intensity (low/medium/high).proximity_trend: Dynamic state (approaching/receding/stable).
Olfactory intensity is no longer guessed but calculated based on first principles: $$ \text{Intensity} \approx (\text{Base Volatility} \times \text{Activity}) \times \text{Frame Coverage} \times \text{Proximity Factor} $$
- Activity Multiplier: High activity (e.g., splashing) boosts volatility by up to 2.5x.
- Proximity Factor: Distance causes linear decay (e.g., Far = 0.2x multiplier).
To prevent LLM "laziness", we enforce strict constraints:
- Max Interval Duration: No high-activity interval can exceed 4.0s. Long actions MUST be split to capture evolution.
- Zero Tolerance: Merging distinct spatial states (Near -> Far) into one interval is forbidden.
VLM2SMELL features a powerful Dynamic Configuration System located in config.json. This allows you to toggle advanced physical simulations and switch underlying AI models for each step independently.
{
"project_settings": {
"target_fps": 4
},
"step1_visual_config": {
"model_name": "gemini-2.5-flash",
"detect_temperature_cues": true, // Look for steam, boiling, frost
"detect_airflow_indicators": true, // Look for wind, smoke direction
"detect_humidity_visuals": false,
"detect_spatial_context": true // Indoor vs Outdoor
},
"step2_olfactory_config": {
"model_name": "gemini-2.5-flash",
"apply_thermodynamics": true, // Heat = Higher Volatility
"apply_aerodynamics": true, // Wind = Dispersion
"apply_hygrometry": false,
"apply_spatial_concentration": true // Confined space = Accumulation
}
}The system prompts are now decoupled from the code for easier research iteration:
step1_visual.txt: Instructions for the Visual Analysis engine.step2_olfactory.txt: Guidelines and Rules for the Chemical Inference engine.
You can edit these text files directly to test new prompting strategies without modifying the Python code.
Get up and running in minutes!
-
Clone & Install:
git clone https://github.com/yourusername/VLM2SMELL.git cd VLM2SMELL pip install -r requirements.txt -
Configure API Key: Create a
.envfile and add your Google Gemini API Key:GOOGLE_API_KEY=your_actual_api_key_here
-
Run Analysis:
python3 main.py "test_videos/test video 5.mp4"The report will be generated in the
output_reports/folder.
- Python 3.9 or higher
- A Google Cloud Project with the Gemini API enabled
- An API Key from Google AI Studio
-
Clone the repository:
git clone https://github.com/yourusername/VLM2SMELL.git cd VLM2SMELL -
Install dependencies: It is recommended to use a virtual environment.
python3 -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Environment Configuration: Create a
.envfile in the root directory.echo "GOOGLE_API_KEY=your_api_key_here" > .env
The core script main.py handles the entire pipeline: frame extraction, VLM inference, and report generation.
python3 main.py "path/to/video.mp4"python3 main.py [video_path] [FPS] [options]| Argument | Type | Default | Description |
|---|---|---|---|
video_path |
str |
Required | Path to the input video file (e.g., test_videos/test video 5.mp4). |
FPS |
int |
4 |
Frames Per Second to extract. Higher FPS = finer detail but higher API cost. |
--output |
str |
output_reports/ |
Custom path for the output JSON file. |
--fps |
int |
4 |
Alternative flag to specify FPS. |
1. Standard Analysis (Recommended)
Runs at 4 FPS and saves to output_reports/.
python3 main.py "test_videos/test video 5.mp4"2. High-Frequency Analysis (10 FPS) Better for fast-paced actions like chopping or rapid chemical reactions.
python3 main.py "test_videos/test video 5.mp4" 103. Custom Output Filename
python3 main.py "test_videos/test video 5.mp4" --output "results/custom_report.json"Under the new Scientific Protocol, you should verify the following in your JSON output:
- Temporal Resolution: High-intensity intervals (
activity_level: high) should NOT exceed 4.0 seconds. - Physical Progression: Look for logical intensity curves (e.g., approach -> peak -> decay) rather than static values.
- Data Consistency: Ensure
frame_coverageandproximitymetrics align with theintensitycalculation.
The system generates a comprehensive JSON report and a folder of extracted frames.
The JSON file contains a structured analysis of the video.
meta: Metadata about the analysis (source video, timestamp, model used).visual_timeline: High-level event log (e.g., "0.5s: Lemon is sliced").frame_log: Detailed periodic analysis.
Example Snippet:
{
"timestamp": 2.5,
"scene": "Kitchen counter close-up",
"objects": [
{
"name": "Lemon",
"visual_state": "Sliced/Juicy",
"interaction": "Being squeezed"
}
],
"scent": {
"category": "Citrus",
"descriptors": ["Fresh", "Zesty", "Acidic", "Sharp"],
"molecules": ["Limonene", "Citral", "beta-Pinene"],
"intensity": "High",
"reasoning": "Mechanical action (squeezing) ruptures oil glands in the flavedo (peel), releasing volatile oils."
}
}A folder named temp_frames/<video_name>_<timestamp> is created containing the extracted JPEG images.
Note: These are preserved to serve as the Ground Truth for verification. You can manually inspect
frame_000XX.jpgto verify the VLM's description.
The system follows a strict Two-Step VOS (Visual-Olfactory Separation) pipeline to ensure accuracy and minimize hallucinations.
graph LR
A["Input Video"] -->|OpenCV| B["Frame Extraction"]
B -->|JPEGs| C["Step 1: Visual Analysis"]
C -->|"Gemini 2.5 Flash"| D["Visual Report (JSON)"]
D -->|"Step 2: Semantic Translation"| E["Gemini 2.5 Flash"]
E -->|"Chemical Inference"| F["Final JSON Report"]
-
Step 1: Visual Understanding via VLM
- Input: Sequence of video frames.
- Model: Gemini 2.5 Flash (Vision + Text).
- Task: Identify scenes, objects, actions, and physical state changes. Strictly forbidden from inferring smells.
- Output: Pure visual semantic data.
- Reliability Enforcement: Includes strict coverage validation (>95% duration) and auto-retry logic to prevent hallucinated summaries.
-
Step 2: Semantic-to-Chemical Translation via LLM
- Input: The structured visual report from Step 1.
- Model: Gemini 2.5 Flash (Text-only mode).
- Task: Map visual triggers (e.g., "sliced lemon") to olfactory data (e.g., "Limonene", "Citrus").
- Strategy: Uses Guideline-Based Prompting (Explicit Rules for Intensity, Molecular Complexity, and Causal Reasoning) instead of simple few-shot examples to ensure scientific accuracy across diverse scenarios.
- Output: The final JSON report with populated
scentfields.
- Error:
Video file not found- Ensure the path to your video file is correct. Use absolute paths if unsure.
- Error:
Google API key not found- Make sure you have created the
.envfile in the root directory and definedGOOGLE_API_KEY.
- Make sure you have created the
- Error:
429 Resource Exhausted- You may have hit the rate limit for the Gemini API. Wait a minute and try again, or check your quota in Google AI Studio.
- Analysis is too slow?
- Try reducing the FPS (e.g., use
1or2FPS). - Ensure your video is not excessively long (recommended < 2 minutes).
- Try reducing the FPS (e.g., use
Contributions are welcome! We are especially looking for:
- Improved system prompts for better chemical accuracy.
- Support for other VLM models (e.g., GPT-4o, Claude 3.5).
- Real-time processing capabilities.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature) - Commit your Changes (
git commit -m 'Add some AmazingFeature') - Push to the Branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Paradigm Shift: Transitioned from qualitative description to quantitative computation.
- Visual Engine: Introduced
proximity,frame_coverage, andactivity_levelmetrics. - Inference Engine: Implemented physically-grounded intensity formulas (Base × Activity × Proximity).
- Reliability: Added "Divine Constraints" (max 4.0s interval) to enforce high temporal resolution.
- Validation: Achieved 100% pass rate on
test video 5automated QC benchmarks.
- Physics Engine: Added toggleable modules for Thermodynamics, Aerodynamics, and Hygrometry.
- Configuration: Introduced
config.jsonfor granular control over model selection and physics rules. - Prompt Engineering: Decoupled prompts into
step1_visual.txtandstep2_olfactory.txt.
- Core Architecture: Established the Two-Step VOS (Visual-Olfactory Separation) pipeline.
- Basic Capabilities: Frame extraction, basic visual captioning, and simple scent mapping.
Distributed under the MIT License. See LICENSE for more information.
Note: This project uses Google's Generative AI. Ensure you comply with their usage policies.