-You may need to request a quota increase for specific machine types and GPUs in the region where you plan to deploy the model. Check your GCP quotas before proceeding.
+
+For large models you may need to request a quota increase for specific machine types and GPUs in the region where you plan to deploy the model. Check your GCP quotas before proceeding.
## Step 1: Setting Up Magemaker for GCP
@@ -17,11 +17,11 @@ Run the following command to configure Magemaker for GCP Vertex AI deployment:
magemaker --cloud gcp
```
-This initializes Magemaker with the necessary configurations for deploying models to Vertex AI.
+This initializes Magemaker with the necessary environment variables (PROJECT_ID, GCLOUD_REGION, etc.) and creates/updates your `.env` file.
## Step 2: YAML-based Deployment
-For reproducible deployments, use YAML configuration:
+For reproducible deployments, use a YAML configuration:
```sh
magemaker --deploy .magemaker_config/your-model.yaml
@@ -33,92 +33,79 @@ Example YAML for GCP deployment:
deployment: !Deployment
destination: gcp
endpoint_name: llama3-endpoint
+ instance_type: n1-standard-8 # a.k.a. machine_type in Vertex AI UI
+ accelerator_type: NVIDIA_T4 # GPU type
accelerator_count: 1
- instance_type: n1-standard-8
- accelerator_type: NVIDIA_T4
- num_gpus: 1
- quantization: null
+ num_gpus: 1 # optional – overrides accelerator_count if set
+ quantization: null # optional – bf16 / int8 / bitsandbytes
models:
- !Model
id: meta-llama/Meta-Llama-3-8B-Instruct
- location: null
- predict: null
source: huggingface
task: text-generation
+ predict: null # optional inference parameters
version: null
+ location: null # optional custom artefact path
```
+
- For gated models like llama from Meta, you have to accept terms of use for model on hugging face and adding Hugging face token to the environment are necessary for deployment to go through.
+For gated models like Llama you must (1) accept the terms of use on Hugging Face and (2) add your `HUGGING_FACE_HUB_KEY` to the `.env` file.
-
-### Selecting an Appropriate Instance
-For Llama 3, a machine type such as `n1-standard-8` with an attached NVIDIA T4 GPU (`NVIDIA_T4`) is a suitable configuration for most use cases. Adjust the instance type and GPU based on your workload requirements.
+### Choosing an Instance Type
+A common configuration for Llama 3 (8B) is `n1-standard-8` with one T4 GPU. Adjust `instance_type`, `accelerator_type`, and `accelerator_count` according to your latency and cost requirements.
-If you encounter quota issues, submit a quota increase request in the GCP console under "IAM & Admin > Quotas" for the specific GPU type in your deployment region.
+If you encounter quota issues, submit a quota increase request in the GCP console under **IAM & Admin → Quotas** for the specific GPU type and region.
## Step 3: Querying the Deployed Model
-Once the deployment is complete, note down the endpoint id.
-
-You can use the interactive dropdown menu to quickly query the model.
+After deployment completes, note the **endpoint ID** printed by Magemaker.
-### Querying Models
-
-From the dropdown, select `Query a Model Endpoint` to see the list of model endpoints. Press space to select the endpoint you want to query. Enter your query in the text box and press enter to get the response.
+### Option A – Interactive Dropdown
+Use `Query a Model Endpoint` from the Magemaker menu to send prompts directly from your terminal.

-Or you can use the following code:
-```python
-from google.cloud import aiplatform
-from google.protobuf import json_format
-from google.protobuf.struct_pb2 import Value
-import json
+### Option B – Python (Vertex AI REST)
+If you prefer code, you can call the endpoint with a simple REST helper:
+
+```python
from dotenv import dotenv_values
+import google.auth
+import google.auth.transport.requests
+import requests
-def query_vertexai_endpoint_rest(
- endpoint_id: str,
- input_text: str,
- token_path: str = None
-):
- import google.auth
- import google.auth.transport.requests
- import requests
+def query_vertexai_endpoint_rest(endpoint_id: str, input_text: str, token_path: str | None = None):
+ """Send a text-generation request to a Vertex AI endpoint."""
- # TODO: this will have to come from config files
- project_id = dotenv_values('.env').get('PROJECT_ID')
- location = dotenv_values('.env').get('GCLOUD_REGION')
+ project_id = dotenv_values(".env").get("PROJECT_ID")
+ location = dotenv_values(".env").get("GCLOUD_REGION")
-
- # Get credentials
+ # Obtain an access token
if token_path:
- credentials, project = google.auth.load_credentials_from_file(token_path)
+ credentials, _ = google.auth.load_credentials_from_file(token_path)
else:
- credentials, project = google.auth.default()
-
- # Refresh token
- auth_req = google.auth.transport.requests.Request()
- credentials.refresh(auth_req)
-
- # Prepare headers and URL
+ credentials, _ = google.auth.default()
+ credentials.refresh(google.auth.transport.requests.Request())
+
headers = {
"Authorization": f"Bearer {credentials.token}",
"Content-Type": "application/json"
}
-
- url = f"https://{location}-aiplatform.googleapis.com/v1/projects/{project_id}/locations/{location}/endpoints/{endpoint_id}:predict"
-
- # Prepare payload
+
+ url = (
+ f"https://{location}-aiplatform.googleapis.com/v1/projects/"
+ f"{project_id}/locations/{location}/endpoints/{endpoint_id}:predict"
+ )
+
payload = {
"instances": [
{
"inputs": input_text,
- # TODO: this also needs to come from configs
"parameters": {
"max_new_tokens": 100,
"temperature": 0.7,
@@ -127,20 +114,28 @@ def query_vertexai_endpoint_rest(
}
]
}
-
- # Make request
- response = requests.post(url, headers=headers, json=payload)
- print('Raw Response Content:', response.content.decode())
+ response = requests.post(url, headers=headers, json=payload, timeout=300)
+ response.raise_for_status()
return response.json()
-endpoint_id="your-endpoint-id-here"
-input_text='What are you?"'
-resp = query_vertexai_endpoint_rest(endpoint_id=endpoint_id, input_text=input_text)
-print(resp)
+# Example usage
+if __name__ == "__main__":
+ ENDPOINT_ID = "your-endpoint-id-here"
+ prompt = "What are you?"
+ print(query_vertexai_endpoint_rest(ENDPOINT_ID, prompt))
```
-## Conclusion
-You have successfully deployed and queried Llama 3 on GCP Vertex AI using Magemaker's interactive dropdown menu. For any questions or feedback, feel free to contact us at [support@slashml.com](mailto:support@slashml.com).
+### Option C – Built-in Helper
+Magemaker also ships with a helper function that wraps the above logic:
+
+```python
+from magemaker.gcp.query_endpoint import query_vertexai_endpoint
+answer = query_vertexai_endpoint(endpoint_name="your-endpoint-id-here", query="What are you?")
+print(answer)
+```
+
+## Conclusion
+You have successfully deployed and queried Llama 3 on GCP Vertex AI using Magemaker. For questions or feedback, reach us at [support@slashml.com](mailto:support@slashml.com).
From 6e426784ce9dc38a885291f131686a28bf38473c Mon Sep 17 00:00:00 2001
From: "pr-test1[bot]" <226697212+pr-test1[bot]@users.noreply.github.com>
Date: Tue, 16 Sep 2025 13:32:49 +0000
Subject: [PATCH 16/18] docs: sync updated_readme.md with latest code
---
updated_readme.md | 191 +++++++++++++++++-----------------------------
1 file changed, 70 insertions(+), 121 deletions(-)
diff --git a/updated_readme.md b/updated_readme.md
index bcfc60b..1e76877 100644
--- a/updated_readme.md
+++ b/updated_readme.md
@@ -1,4 +1,3 @@
-
@@ -7,7 +6,7 @@
Magemaker v0.1, by SlashML
- Deploy open source AI models to AWS in minutes.
+ Deploy open source AI models to AWS, GCP, and Azure in minutes.
@@ -21,8 +20,7 @@
About Magemaker
-
- Getting Started
+ Getting Started
- Prerequisites
- Installation
@@ -39,93 +37,81 @@
## About Magemaker
-Magemaker is a Python tool that simplifies the process of deploying an open source AI model to your own cloud. Instead of spending hours digging through documentation to figure out how to get AWS working, Magemaker lets you deploy open source AI models directly from the command line.
+Magemaker is a Python tool that simplifies the process of deploying an open-source AI model to **your** cloud. Instead of spending hours digging through platform-specific docs, Magemaker spins up production-ready endpoints on AWS SageMaker, Google Cloud Vertex AI, or Azure Machine Learning from a single CLI command.
-Choose a model from Hugging Face or SageMaker, and Magemaker will spin up a SageMaker instance with a ready-to-query endpoint in minutes.
+Choose a model from Hugging Face, AWS JumpStart, or point Magemaker at your own model artefacts—an endpoint will be online in minutes.
## Getting Started
-Magemaker works with AWS. Azure and GCP support are coming soon!
-
-To get a local copy up and running follow these simple steps.
+Magemaker currently supports **AWS, GCP and Azure**. The first run guides you through the minimal configuration required for your chosen provider(s).
### Prerequisites
-* Python
-* An AWS account
-* Quota for AWS SageMaker instances (by default, you get 2 instances of ml.m5.xlarge for free)
-* Certain Hugging Face models (e.g. Llama2) require an access token ([hf docs](https://huggingface.co/docs/hub/en/models-gated#access-gated-models-as-a-user))
-
-### Configuration
-
-**Step 1: Set up AWS and SageMaker**
-
-To get started, you’ll need an AWS account which you can create at https://aws.amazon.com/. Then you’ll need to create access keys for SageMaker.
-
-We wrote up the steps in [Google Doc](https://docs.google.com/document/d/1NvA6uZmppsYzaOdkcgNTRl7Nb4LbpP9Koc4H_t5xNSg/edit?tab=t.0#heading=h.farbxuv3zrzm) as well.
+* Python 3.11 + (3.12 currently unsupported due to Azure SDK issue)
+* Cloud account(s) & sufficient quota:
+ * AWS for SageMaker
+ * GCP for Vertex AI
+ * Azure for Azure ML
+* CLI tools (optional but recommended):
+ * AWS CLI, Google Cloud SDK, Azure CLI
+* Hugging Face account & token for gated models (e.g. Llama-3)
-
-
-### Installing the package
-
-**Step 1**
+### Installation
```sh
pip install magemaker
```
-**Step 2: Running magemaker**
-
-Run it by simply doing the following:
+### First-time configuration
+Run Magemaker with your desired cloud flag. The wizard collects credentials (or points you to the relevant CLI login command), writes them to a local `.env`, and verifies quota.
```sh
-magemaker
+magemaker --cloud [aws|gcp|azure|all]
```
-If this is your first time running this command. It will configure the AWS client so you’re ready to start deploying models. You’ll be prompted to enter your Access Key and Secret here. You can also specify your AWS region. The default is us-east-1. You only need to change this if your SageMaker instance quota is in a different region.
-
-Once configured, it will create a `.env` file and save the credentials there. You can also add your Hugging Face Hub Token to this file if you have one.
-
-```sh
-HUGGING_FACE_HUB_KEY="KeyValueHere"
+Typical `.env` variables (auto-generated):
+```bash
+AWS_ACCESS_KEY_ID="..." # if aws selected
+AWS_SECRET_ACCESS_KEY="..."
+AWS_REGION="us-east-1"
+PROJECT_ID="my-gcp-project" # if gcp selected
+GCLOUD_REGION="us-central1"
+AZURE_SUBSCRIPTION_ID="..." # if azure selected
+AZURE_RESOURCE_GROUP="ml-resources"
+AZURE_WORKSPACE_NAME="ml-workspace"
+AZURE_REGION="eastus"
+HUGGING_FACE_HUB_KEY="hf_..." # optional
```
(back to top)
-
## Using Magemaker
-### Deploying models from dropdown
-
-When you run `magemaker` comamnd it will give you an interactive menu to deploy models. You can choose from a dropdown of models to deploy.
+### Interactive dropdown
+`magemaker --cloud ...` opens an interactive menu where you can:
+* Deploy a new model
+* List / delete active endpoints
+* Query an endpoint directly from the terminal
-#### Deploying Hugging Face models
-If you're deploying with Hugging Face, copy/paste the full model name from Hugging Face. For example, `google-bert/bert-base-uncased`. Note that you’ll need larger, more expensive instance types in order to run bigger models. It takes anywhere from 2 minutes (for smaller models) to 10+ minutes (for large models) to spin up the instance with your model.
+### YAML-based deployment (recommended)
+For CI/CD or reproducibility, pass a YAML config:
-#### Deploying Sagemaker models
-If you are deploying a Sagemaker model, select a framework and search from a model. If you a deploying a custom model, provide either a valid S3 path or a local path (and the tool will automatically upload it for you). Once deployed, we will generate a YAML file with the deployment and model in the `CONFIG_DIR=.magemaker_config` folder. You can modify the path to this folder by setting the `CONFIG_DIR` environment variable.
-
-#### Deploy using a yaml file
-We recommend deploying through a yaml file for reproducability and IAC. From the cli, you can deploy a model without going through all the menus. You can even integrate us with your Github Actions to deploy on PR merge. Deploy via YAML files simply by passing the `--deploy` option with local path like so:
-
-```
-magemaker --deploy .magemaker_config/bert-base-uncased.yaml
+```sh
+magemaker --deploy .magemaker_config/bert-aws.yaml
```
-Following is a sample yaml file for deploying a model the same google bert model mentioned above:
-
+Example – **AWS SageMaker (Hugging Face)**
```yaml
deployment: !Deployment
destination: aws
- # Endpoint name matches model_id for querying atm.
- endpoint_name: test-bert-uncased
+ endpoint_name: bert-uncased-dev
instance_count: 1
instance_type: ml.m5.xlarge
@@ -135,83 +121,46 @@ models:
source: huggingface
```
-Following is a yaml file for deploying a llama model from HF:
+Example – **GCP Vertex AI**
```yaml
deployment: !Deployment
- destination: aws
- endpoint_name: test-llama2-7b
- instance_count: 1
- instance_type: ml.g5.12xlarge
- num_gpus: 4
- # quantization: bitsandbytes
+ destination: gcp
+ endpoint_name: llama3-gcp
+ accelerator_count: 1
+ instance_type: n1-standard-8
+ accelerator_type: NVIDIA_T4
models:
- !Model
id: meta-llama/Meta-Llama-3-8B-Instruct
source: huggingface
- predict:
- temperature: 0.9
- top_p: 0.9
- top_k: 20
- max_new_tokens: 250
```
-#### Fine-tuning a model using a yaml file
-
-You can also fine-tune a model using a yaml file, by using the `train` option in the command and passing path to the yaml file
-
-`
-magemaker --train .magemaker_config/train-bert.yaml
-`
-
-Here is an example yaml file for fine-tuning a hugging-face model:
-
+Example – **Azure ML**
```yaml
-training: !Training
- destination: aws
- instance_type: ml.p3.2xlarge
+deployment: !Deployment
+ destination: azure
+ endpoint_name: llama3-azure
instance_count: 1
- training_input_path: s3://jumpstart-cache-prod-us-east-1/training-datasets/tc/data.csv
- hyperparameters: !Hyperparameters
- epochs: 1
- per_device_train_batch_size: 32
- learning_rate: 0.01
+ instance_type: Standard_NC24ads_A100_v4
models:
- !Model
- id: meta-textgeneration-llama-3-8b-instruct
+ id: meta-llama-meta-llama-3-8b-instruct # Azure model-catalog id
source: huggingface
```
+### AWS JumpStart & Custom models
+Deploy marketplace models or your own artefacts stored locally/S3—see the [JumpStart & Custom Models](docs/concepts/jumpstart-custom-models.mdx) guide.
-
-
-
-If you’re using the `ml.m5.xlarge` instance type, here are some small Hugging Face models that work great:
-
-
-
-**Model: [google-bert/bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased)**
-
-- **Type:** Fill Mask: tries to complete your sentence like Madlibs
-- **Query format:** text string with `[MASK]` somewhere in it that you wish for the transformer to fill
--
-
-
-
-**Model: [sentence-transformers/all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)**
-
-- **Type:** Feature extraction: turns text into a 384d vector embedding for semantic search / clustering
-- **Query format:** "*type out a sentence like this one.*"
-
-
-
-
-
-### Deactivating models
-
-Any model endpoints you spin up will run continuously unless you deactivate them! Make sure to delete endpoints you’re no longer using so you don’t keep getting charged for your SageMaker instance.
+### Fine-tuning
+```sh
+magemaker --train .magemaker_config/train-bert.yaml
+```
+YAML follows the `!Training` schema; only AWS is supported for training today.
+### Deactivating endpoints
+Endpoints accrue cloud charges until deleted. Use the menu option **Delete a Model Endpoint** or run your cloud console cleanup.
(back to top)
@@ -220,10 +169,10 @@ Any model endpoints you spin up will run continuously unless you deactivate them
## What we're working on next
-- [ ] More robust error handling for various edge cases
-- [ ] Verbose logging
-- [ ] Enabling / disabling autoscaling
-- [ ] Deployment to Azure and GCP
+- [ ] Enhanced error handling & verbose logging
+- [ ] Autoscaling controls
+- [ ] Streaming support in OpenAI-compatible proxy
+- [ ] Additional cloud-specific optimisations
(back to top)
@@ -232,9 +181,9 @@ Any model endpoints you spin up will run continuously unless you deactivate them
## Known issues
-- [ ] Querying within Magemaker currently only works with text-based model - doesn’t work with multimodal, image generation, etc.
-- [ ] Deleting a model is not instant, it may show up briefly after it was queued for deletion
-- [ ] Deploying the same model within the same minute will break
+- Query helper currently supports text-based pipelines only
+- Endpoint deletion is asynchronous—may appear active for a short time after request
+- Deploying the same model repeatedly within ~60 seconds can fail due to name collision (timestamp workaround in progress)
(back to top)
@@ -251,6 +200,6 @@ Distributed under the Apache 2.0 License. See `LICENSE` for more information.
## Contact
-You can reach us, faizan & jneid, at [support@slashml.com](mailto:support@slashml.com).
+Questions, bugs, ideas? Reach us (Faizan & Jneid) at [support@slashml.com](mailto:support@slashml.com).
-We’d love to hear from you! We’re excited to learn how we can make this more valuable for the community and welcome any and all feedback and suggestions.
+We’d love your feedback!
From 99600ebb987bd98d096617af13d5a6fb9ee6f90d Mon Sep 17 00:00:00 2001
From: "pr-test1[bot]" <226697212+pr-test1[bot]@users.noreply.github.com>
Date: Tue, 16 Sep 2025 13:32:50 +0000
Subject: [PATCH 17/18] docs: create concepts/jumpstart-custom-models.mdx
---
concepts/jumpstart-custom-models.mdx | 88 ++++++++++++++++++++++++++++
1 file changed, 88 insertions(+)
create mode 100644 concepts/jumpstart-custom-models.mdx
diff --git a/concepts/jumpstart-custom-models.mdx b/concepts/jumpstart-custom-models.mdx
new file mode 100644
index 0000000..1df3461
--- /dev/null
+++ b/concepts/jumpstart-custom-models.mdx
@@ -0,0 +1,88 @@
+---
+title: JumpStart & Custom Models
+description: Deploy AWS JumpStart marketplace and custom models with Magemaker
+---
+
+## Overview
+Besides Hugging Face models, Magemaker can now deploy two additional model types **on AWS SageMaker**:
+
+1. **JumpStart models** – pre-packaged models provided by Amazon or third-party sellers.
+2. **Custom models** – your own fine-tuned artefacts (.tar.gz or directory) stored locally or in S3.
+
+
+
+---
+## 1 · Deploying a JumpStart Model
+
+### 1.1 Find the *model_id*
+Browse the [JumpStart Model Zoo](https://docs.aws.amazon.com/sagemaker/latest/dg/jumpstart-model-zoo.html) or call:
+```python
+from sagemaker.jumpstart.notebook_utils import list_jumpstart_models
+list_jumpstart_models()
+```
+
+### 1.2 Create a YAML
+```yaml
+models:
+- !Model
+ id: huggingface-text2text-flan-t5-large # JumpStart model_id
+ source: sagemaker
+
+deployment: !Deployment
+ destination: aws
+ instance_type: ml.g5.2xlarge
+ instance_count: 1
+```
+
+### 1.3 Deploy
+```bash
+magemaker --deploy .magemaker_config/flan-t5.yaml
+```
+Magemaker handles the EULA acceptance (`accept_eula=True`) automatically.
+
+---
+## 2 · Deploying a Custom Model
+Custom models are useful when you have already trained a model locally or with SageMaker Training.
+
+### 2.1 Package Artifacts
+- For **Hugging Face** style, create a directory containing `config.json`, `tokenizer.json`, etc.
+- Optionally compress to `model.tar.gz`.
+
+### 2.2 Upload (optional)
+If your artefact is not yet on S3, Magemaker will upload it for you.
+
+### 2.3 YAML Example
+```yaml
+models:
+- !Model
+ id: my-distilbert-finetuned
+ source: huggingface
+ location: ./artifacts/distilbert.tar.gz # or s3://bucket/key
+
+deployment: !Deployment
+ destination: aws
+ instance_type: ml.m5.xlarge
+```
+
+### 2.4 Deploy
+```bash
+magemaker --deploy .magemaker_config/my-distilbert.yaml
+```
+Magemaker will:
+1. Upload the local artefact to `s3:///models/my-distilbert-finetuned/` (if needed)
+2. Create a `HuggingFaceModel` pointing to that S3 path
+3. Spin up the endpoint.
+
+---
+## Tips & Quotas
+- **GPU quota** – JumpStart LLMs often need a GPU instance (`ml.g5.*`). Request a quota increase first.
+- **Endpoint names** – If `endpoint_name` is omitted, Magemaker appends a timestamp to ensure uniqueness.
+- **Cost control** – Delete endpoints when not in use: `magemaker --cloud aws → Delete a Model Endpoint`.
+
+---
+## FAQ
+**Q : Do JumpStart models work on GCP / Azure?**
+*A : Not yet. JumpStart is an AWS-specific marketplace.*
+
+**Q : Can I pass inference parameters to a JumpStart model?**
+*A : Yes. Use the same `predict:` block as with Hugging Face models.*
From 277c5e61d0e2c6c16f2f31ae2f3c7151ccb5c8a1 Mon Sep 17 00:00:00 2001
From: "pr-test1[bot]" <226697212+pr-test1[bot]@users.noreply.github.com>
Date: Tue, 16 Sep 2025 13:32:52 +0000
Subject: [PATCH 18/18] docs: create concepts/openai-proxy.mdx
---
concepts/openai-proxy.mdx | 60 +++++++++++++++++++++++++++++++++++++++
1 file changed, 60 insertions(+)
create mode 100644 concepts/openai-proxy.mdx
diff --git a/concepts/openai-proxy.mdx b/concepts/openai-proxy.mdx
new file mode 100644
index 0000000..d44ccd2
--- /dev/null
+++ b/concepts/openai-proxy.mdx
@@ -0,0 +1,60 @@
+---
+title: OpenAI-Compatible Proxy
+description: Use your SageMaker / Vertex AI / AzureML endpoint with OpenAI SDKs via Magemaker's built-in proxy
+---
+
+## Overview
+Magemaker ships with a lightweight FastAPI server (`server.py`) that exposes a **Chat Completion**-compatible API. This lets you swap OpenAI’s URL with `http://:8000` and keep using the official `openai` Python / JS SDKs, LangChain, or any other OpenAI-compatible client.
+
+
+The proxy is an **optional** component. You still deploy models the usual way; the server only forwards requests to the selected endpoint under the hood.
+
+
+## 1 · Start the proxy
+```bash
+python -m magemaker.server # default: 0.0.0.0:8000
+```
+
+Environment variables from your `.env` file are automatically loaded, so the server knows how to reach SageMaker / Vertex AI / Azure ML.
+
+You should see:
+```
+INFO: Started server process […]
+INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
+```
+
+## 2 · Point the OpenAI SDK to the proxy
+```python
+import openai
+
+openai.api_key = "sk-ignore" # any non-empty value works
+openai.base_url = "http://localhost:8000/v1" # important – include /v1
+
+response = openai.ChatCompletion.create(
+ model="your-endpoint-name", # equals SageMaker / Vertex / Azure endpoint
+ messages=[{"role": "user", "content": "Hello!"}]
+)
+print(response.choices[0].message.content)
+```
+
+### Supported routes
+| Method | Path | Description |
+| ------ | ------------------- | ----------------------------------- |
+| POST | `/v1/chat/completions` | Streams request to the backing endpoint |
+| GET | `/endpoints` | Lists all active endpoints found via Magemaker |
+
+## 3 · Authentication & CORS
+The proxy does **not** enforce authentication. Run it behind a reverse proxy (NGINX, CloudFront, APIM, etc.) if you need tokens, rate-limiting, or HTTPS termination.
+
+## 4 · Deployment suggestions
+- **Docker**: Build an image with your `.env` baked in and deploy to ECS / Cloud Run / ACI.
+- **Systemd**: On a small EC2/VPS, create a service that starts `python -m magemaker.server` on boot.
+
+## 5 · Limitations
+1. Only the `chat/completions` endpoint is implemented at present.
+2. Streaming (`stream=True`) is not yet supported – responses are returned once inference completes.
+3. The proxy currently forwards to a **single** cloud endpoint per request; multi-model routing is on the roadmap.
+
+
+Do **not** expose the proxy publicly without additional security controls – anyone could burn cloud inference credits.
+