llama-api is a Python-based project that provides an API endpoint for performing inference using the Llama 2 large language model (LLM). The project uses FastAPI for creating the API and Uvicorn for serving it. The endpoint accepts a prompt and model parameters, and returns the text generated by the LLM. This project is designed to be easy to set up and run an endpoint that can be used as a standalone server or integrated into other projects.
-
Clone the project repository to your local machine:
git clone https://github.com/jmcconne/llama-api.git -
Create and activate Python virtual environment:
python3 -m venv venv source venv/bin/activate -
Install project dependencies:
pip install -r requirements.txtIf running on Apple Silicon, reinstall llama-cpp-python by running the following to take advantage of GPU acceleration (Metal):
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir -
Download Llama 2 Models (must be in GGUF format):
Example using the Hugging Face CLI and the popular quantized, GGUF formatted Llama 2 models from TheBloke:
huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks FalseAlternatively, you can simply copy GGUF formatted Llama 2 models to the
./modelsdirectory.
Option 1 - Run locally
python main.py
Option 2 - Run in Docker container
Build Docker image:
docker build -t llama-api .
Create and start Docker container:
docker run -p 8000:8000 --name llama-api llama-api
import requests
url = "http://localhost:8000/complete"
data = {"prompt": "What are the first five prime numbers?", "model": "llama-2-7b-chat.Q4_K_M.gguf", "temp": 0}
response = requests.post(url, json=data, stream=True)
print(response.content.decode("utf-8").strip())
If you would like to contribute to this project, please follow these steps:
- Fork the project repository to your own GitHub account.
- Clone the forked repository to your local machine.
- Create a new branch for your changes.
- Make your changes and commit them to your branch.
- Push your branch to your forked repository.
- Open a pull request to the original project repository.