Skip to content

jmcconne/llama-api

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

llama-api

llama-api is a Python-based project that provides an API endpoint for performing inference using the Llama 2 large language model (LLM). The project uses FastAPI for creating the API and Uvicorn for serving it. The endpoint accepts a prompt and model parameters, and returns the text generated by the LLM. This project is designed to be easy to set up and run an endpoint that can be used as a standalone server or integrated into other projects.

Installation

  1. Clone the project repository to your local machine:

    git clone https://github.com/jmcconne/llama-api.git
    
  2. Create and activate Python virtual environment:

    python3 -m venv venv
    source venv/bin/activate
    
  3. Install project dependencies:

    pip install -r requirements.txt
    

    If running on Apple Silicon, reinstall llama-cpp-python by running the following to take advantage of GPU acceleration (Metal):

    CMAKE_ARGS="-DLLAMA_METAL=on" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir
    
  4. Download Llama 2 Models (must be in GGUF format):

    Example using the Hugging Face CLI and the popular quantized, GGUF formatted Llama 2 models from TheBloke:

    huggingface-cli download TheBloke/Llama-2-7b-Chat-GGUF llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models --local-dir-use-symlinks False
    

    Alternatively, you can simply copy GGUF formatted Llama 2 models to the ./models directory.

Usage

Start API endpoint

Option 1 - Run locally

python main.py

Option 2 - Run in Docker container

Build Docker image:

docker build -t llama-api .

Create and start Docker container:

docker run -p 8000:8000 --name llama-api llama-api

Send request to API

import requests

url = "http://localhost:8000/complete"
data = {"prompt": "What are the first five prime numbers?", "model": "llama-2-7b-chat.Q4_K_M.gguf", "temp": 0}
response = requests.post(url, json=data, stream=True)
print(response.content.decode("utf-8").strip())

Contributing

If you would like to contribute to this project, please follow these steps:

  1. Fork the project repository to your own GitHub account.
  2. Clone the forked repository to your local machine.
  3. Create a new branch for your changes.
  4. Make your changes and commit them to your branch.
  5. Push your branch to your forked repository.
  6. Open a pull request to the original project repository.

About

API endpoint for performing inference using Meta's Llama 2 Large Language Model (LLM)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published