llama-api

llama-api is a Python-based project that provides an API endpoint for performing inference using the Llama 2 large language model (LLM). The project uses FastAPI for creating the API and Uvicorn for serving it. The endpoint accepts a prompt and model parameters, and returns the text generated by the LLM. This project is designed to be easy to set up and run an endpoint that can be used as a standalone server or integrated into other projects.

Installation

Clone the project repository to your local machine:

git clone https://github.com/jmcconne/llama-api.git

Create and activate Python virtual environment:

python3 -m venv venv
source venv/bin/activate

Install project dependencies:
```
pip install -r requirements.txt
```
If running on Apple Silicon, reinstall llama-cpp-python by running the following to take advantage of GPU acceleration (Metal):
```
CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
```
Download Llama 2 Models (must be in GGUF format):

Example using the Hugging Face CLI and the popular quantized, GGUF formatted Llama 2 models from TheBloke:
```
huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False
```
Alternatively, you can simply copy GGUF formatted Llama 2 models to the ./models directory.

Usage

Start API endpoint

Option 1 - Run locally

python main.py

Option 2 - Run in Docker container

Build Docker image:

docker build -t llama-api .

Create and start Docker container:

docker run -p 8000:8000 --name llama-api llama-api

Send request to API

import requests

url = "http://localhost:8000/complete"
data = {"prompt": "What are the first five prime numbers?", "model": "llama-2-7b-chat.Q4_K_M.gguf", "temp": 0}
response = requests.post(url, json=data, stream=True)
print(response.content.decode("utf-8").strip())

Contributing

If you would like to contribute to this project, please follow these steps:

Fork the project repository to your own GitHub account.
Clone the forked repository to your local machine.
Create a new branch for your changes.
Make your changes and commit them to your branch.
Push your branch to your forked repository.
Open a pull request to the original project repository.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.gitignore		.gitignore
.python-version		.python-version
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

llama-api

Installation

Usage

Start API endpoint

Send request to API

Contributing

About

Uh oh!

Releases

Packages

Languages

License

jmcconne/llama-api

Folders and files

Latest commit

History

Repository files navigation

llama-api

Installation

Usage

Start API endpoint

Send request to API

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages