Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -150,5 +150,7 @@ dmypy.json

# generated
*_spark/
metastore_db/
spark-warehouse/
*_pipeline/
.vscode/
27 changes: 22 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ This repository contains different [Jupyter Notebooks](https://jupyter.org) to d
- [Experimenting Locally](#experimenting-locally)
- [Using Docker](#using-docker)
- [On the Machine (Linux/x64 \& arm64)](#on-the-machine-linuxx64--arm64)
- [Optional: Spark and Databricks Support](#optional-spark-and-databricks-support)
- [Notebooks](#notebooks)
- [Overview](#overview)
- [Descriptions](#descriptions)
Expand Down Expand Up @@ -93,13 +94,29 @@ The following commands will set up a Python environment with necessary Python li
```
$ git clone https://github.com/getml/getml-demo.git
$ cd getml-demo
$ pipx install hatch
$ hatch env create
$ hatch shell
$ pip install -r requirements.txt
$ jupyter-lab
$ pipx install uv
$ uv run jupyter-lab
```

#### Optional: Spark and Databricks Support

Some notebooks (e.g., `imdb.ipynb`, `online_retail.ipynb`) demonstrate exporting features to Spark SQL. For these, you need to install additional dependencies:

> [!IMPORTANT]
> The `spark` and `databricks` dependency groups are **mutually exclusive** and cannot be installed together. The `--isolated` flag runs the command in a temporary environment without affecting your main installation.

**For local Spark execution** (running Spark locally on your machine):
```
$ uv run --group spark --isolated jupyter-lab
```

**For Databricks integration** (connecting to Databricks compute):
```
$ uv run --group databricks jupyter-lab
```

See [integration/databricks/README.md](integration/databricks/README.md) for Databricks setup instructions.

> [!TIP]
> Install the [Enterprise trial version](https://getml.com/latest/enterprise/request-trial) via the [Install getML on Linux guide](https://getml.com/latest/install/packages/linux#install-getml-on-linux) to try the Enterprise features.

Expand Down
1 change: 1 addition & 0 deletions integration/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Integration modules for connecting getML with external platforms."""
119 changes: 119 additions & 0 deletions integration/databricks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
# Databricks Data Integration

This directory contains modules for ingesting data from GCS into Databricks Delta Lake and preparing population tables for getML feature engineering.


## Prerequisites

- **Python 3.12**
- **Databricks Free Edition account** (or higher tier)
- **Databricks CLI** installed

## Setup

### 1. Install Databricks CLI

```bash
# macOS
brew install databricks/tap/databricks

# Linux & macOS & Windows
curl -fsSL https://raw.githubusercontent.com/databricks/setup-cli/main/install.sh | sh
```

More: https://docs.databricks.com/gcp/en/dev-tools/cli/install

### 2. Install Dependencies with uv

> [!IMPORTANT]
> The `databricks` dependency group uses `databricks-connect`, which **cannot be installed alongside `pyspark`**. These packages are mutually exclusive. If you need local Spark execution (e.g., for notebooks like `imdb.ipynb`), use `uv run --group spark --isolated` instead to run in a temporary isolated environment.

```bash
# From the repository root
cd getml-demo

# Install uv if not already installed
pipx install uv

# Run jupyter lab after install dependencies included in the databricks group
$ uv run --group databricks jupyter-lab
```

### 3. Authenticate with Databricks

```bash
# Get your workspace URL from your Databricks Free Edition account
# It looks like: https://<workspace-id>.cloud.databricks.com

databricks auth login --host https://<your-workspace>.cloud.databricks.com
```

This will open a browser for OAuth authentication. After successful login, your credentials are cached locally.

### 4. Verify Authentication

```bash
databricks auth profiles
```

You should see your workspace listed.

## Usage

### Python API (Recommended)

Use the modules directly in notebooks or scripts:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

References non-existent module - preparation doesn't exist in this PR.

```python
from integration.databricks.data import ingestion, preparation

# Load raw data from GCS to Databricks
loaded_tables = ingestion.load_from_gcs(
bucket="https://static.getml.com/datasets/jaffle_shop/",
destination_schema="jaffle_shop"
)
print(f"Loaded {len(loaded_tables)} tables")
```

### Load Specific Tables

```python
from integration.databricks.data import ingestion

# Load only the tables you need
ingestion.load_from_gcs(
destination_schema="RAW",
tables=["raw_customers", "raw_orders", "raw_items", "raw_products"]
)
```

## Troubleshooting

### Authentication Errors

```bash
# Re-authenticate
databricks auth login --host https://<your-workspace>.cloud.databricks.com

# Check your profile
databricks auth env
```

### Python Version Issues

Databricks serverless requires Python 3.12:

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Irrelevant troubleshooting - Python version belongs in pyproject.toml, not README.

```bash
python --version # Should show 3.12.x

# If not, install Python 3.12 and recreate venv
brew install python@3.12 # macOS
```

### Connection Timeout

Free Edition has limited compute resources. If you see timeouts:
- Wait a few minutes and retry (serverless cold start can take few seconds or minutes)
- Check your quota in the Databricks workspace


1 change: 1 addition & 0 deletions integration/databricks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
"""Databricks integration for getML demos."""
Empty file.
Loading