getml · awaismirza92 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025 · Dec 3, 2025
diff --git a/.gitignore b/.gitignore
@@ -150,5 +150,7 @@ dmypy.json
 
 # generated
 *_spark/
+metastore_db/
+spark-warehouse/
 *_pipeline/
 .vscode/
diff --git a/README.md b/README.md
@@ -31,6 +31,7 @@ This repository contains different [Jupyter Notebooks](https://jupyter.org) to d
   - [Experimenting Locally](#experimenting-locally)
     - [Using Docker](#using-docker)
     - [On the Machine (Linux/x64 \& arm64)](#on-the-machine-linuxx64--arm64)
+      - [Optional: Spark and Databricks Support](#optional-spark-and-databricks-support)
 - [Notebooks](#notebooks)
   - [Overview](#overview)
   - [Descriptions](#descriptions)
@@ -93,13 +94,29 @@ The following commands will set up a Python environment with necessary Python li
 ```
 $ git clone https://github.com/getml/getml-demo.git  
 $ cd getml-demo  
-$ pipx install hatch
-$ hatch env create
-$ hatch shell
-$ pip install -r requirements.txt
-$ jupyter-lab
+$ pipx install uv
+$ uv run jupyter-lab
 ```
 
+#### Optional: Spark and Databricks Support
+
+Some notebooks (e.g., `imdb.ipynb`, `online_retail.ipynb`) demonstrate exporting features to Spark SQL. For these, you need to install additional dependencies:
+
+> [!IMPORTANT]  
+> The `spark` and `databricks` dependency groups are **mutually exclusive** and cannot be installed together. The `--isolated` flag runs the command in a temporary environment without affecting your main installation.
+
+**For local Spark execution** (running Spark locally on your machine):
+```
+$ uv run --group spark --isolated jupyter-lab
+```
+
+**For Databricks integration** (connecting to Databricks compute):
+```
+$ uv run --group databricks jupyter-lab
+```
+
+See [integration/databricks/README.md](integration/databricks/README.md) for Databricks setup instructions.
+
 > [!TIP]  
 > Install the [Enterprise trial version](https://getml.com/latest/enterprise/request-trial) via the  [Install getML on Linux guide](https://getml.com/latest/install/packages/linux#install-getml-on-linux) to try the Enterprise features.
 

diff --git a/integration/__init__.py b/integration/__init__.py
@@ -0,0 +1 @@
+"""Integration modules for connecting getML with external platforms."""
diff --git a/integration/databricks/README.md b/integration/databricks/README.md
@@ -0,0 +1,119 @@
+# Databricks Data Integration
+
+This directory contains modules for ingesting data from GCS into Databricks Delta Lake and preparing population tables for getML feature engineering.
+
+
+## Prerequisites
+
+- **Python 3.12** 
+- **Databricks Free Edition account** (or higher tier)
+- **Databricks CLI** installed
+
+## Setup
+
+### 1. Install Databricks CLI
+
+```bash
+# macOS
+brew install databricks/tap/databricks
+
+# Linux & macOS & Windows
+curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
+```
+
+More: https://docs.databricks.com/gcp/en/dev-tools/cli/install
+
+### 2. Install Dependencies with uv
+
+> [!IMPORTANT]  
+> The `databricks` dependency group uses `databricks-connect`, which **cannot be installed alongside `pyspark`**. These packages are mutually exclusive. If you need local Spark execution (e.g., for notebooks like `imdb.ipynb`), use `uv run --group spark --isolated` instead to run in a temporary isolated environment.
+
+```bash
+# From the repository root
+cd getml-demo
+
+# Install uv if not already installed
+pipx install uv
+
+# Run jupyter lab after install dependencies included in the databricks group
+$ uv run --group databricks jupyter-lab
+```
+
+### 3. Authenticate with Databricks
+
+```bash
+# Get your workspace URL from your Databricks Free Edition account
+# It looks like: https://<workspace-id>.cloud.databricks.com
+
+databricks auth login --host https://<your-workspace>.cloud.databricks.com
+```
+
+This will open a browser for OAuth authentication. After successful login, your credentials are cached locally.
+
+### 4. Verify Authentication
+
+```bash
+databricks auth profiles
+```
+
+You should see your workspace listed.
+
+## Usage
+
+### Python API (Recommended)
+
+Use the modules directly in notebooks or scripts:
+
+```python
+from integration.databricks.data import ingestion, preparation
+
+# Load raw data from GCS to Databricks
+loaded_tables = ingestion.load_from_gcs(
+    bucket="https://static.getml.com/datasets/jaffle_shop/",
+    destination_schema="jaffle_shop"
+)
+print(f"Loaded {len(loaded_tables)} tables")
+```
+
+### Load Specific Tables
+
+```python
+from integration.databricks.data import ingestion
+
+# Load only the tables you need
+ingestion.load_from_gcs(
+    destination_schema="RAW",
+    tables=["raw_customers", "raw_orders", "raw_items", "raw_products"]
+)
+```
+
+## Troubleshooting
+
+### Authentication Errors
+
+```bash
+# Re-authenticate
+databricks auth login --host https://<your-workspace>.cloud.databricks.com
+
+# Check your profile
+databricks auth env
+```
+
+### Python Version Issues
+
+Databricks serverless requires Python 3.12:
+
+```bash
+python --version  # Should show 3.12.x
+
+# If not, install Python 3.12 and recreate venv
+brew install python@3.12  # macOS
+```
+
+### Connection Timeout
+
+Free Edition has limited compute resources. If you see timeouts:
+- Wait a few minutes and retry (serverless cold start can take few seconds or minutes)
+- Check your quota in the Databricks workspace
+
+
diff --git a/integration/databricks/__init__.py b/integration/databricks/__init__.py
@@ -0,0 +1 @@
+"""Databricks integration for getML demos."""
diff --git a/integration/databricks/data/__init__.py b/integration/databricks/data/__init__.py
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		"""Integration modules for connecting getML with external platforms."""