Local LLM

The prismeai-llm microservice enables you to integrate open-source language models into your Prisme.ai environment. This service supports both LocalAI and Ollama as runtime engines, giving you flexibility in how you deploy and manage your models.

Overview

The prismeai-llm microservice allows you to:

Run Local Models

Deploy and run open-source language models directly within your infrastructure

Text Generation

Generate text completions and chat responses using various models

Embeddings

Create vector embeddings for semantic search and content understanding

OpenAI-Compatible API

Interface with models using the familiar OpenAI API format

By default, the service uses LocalAI with the pre-built image available at quay.io/go-skynet/local-ai. You can also configure it to use Ollama as an alternative runtime.

Installation Prerequisites

Before deploying the prismeai-llm microservice, ensure you have:

Storage Volume

A persistent volume to store model files, which can be substantial in size

Sufficient Resources

Adequate CPU, RAM, and potentially GPU resources depending on model size

Language models can be resource-intensive. We recommend starting with a 16 vCPU machine for initial deployment. For production use with larger models, consider dedicated resources and possibly GPU acceleration.

Deployment Options

vLLM

When using vLLM, you only need to add your endpoints in the AI Knowledge configuration. Please refer to the AI Knowledge configuration documentation for instructions on how to add these endpoints and the corresponding credentials.

LocalAI (Default)

When using LocalAI, you’ll need to provide specific files for each model in the ./models directory:

YAML File

Configuration file describing the model

Template File

.tmpl file defining the prompt format (not needed for embedding models)

Model File

GGUF (CPU-compatible) file containing the actual model weights

Ollama Configuration

To use Ollama instead of LocalAI, modify your prismeai-apps/values.yaml configuration:

prismeai-llm:
  image:
    repository: ollama/ollama
    tag: latest
    pullPolicy: Always

  env:
     - name: OLLAMA_HOST
       value: 0.0.0.0:5000
     - name: OLLAMA_MODELS
       value: /models/models/ollama

Model Installation

Create Model Directory

Ensure you have a /models directory in your specified storage volume.

Download Model Files

For each model, you’ll need to download the GGUF model file. For example, for Mistral:

curl -L "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_0.gguf?download=true" -o ./models/mistral-7b-instruct-v0.2.Q4_0.gguf

The file size can be several gigabytes. Q4 models are more compressed (smaller, faster, less accurate) while Q8 models are less compressed (larger, slower, more accurate).

Add Configuration Files

Create or copy the required YAML and template files for your model.

Example YAML (mistral.yaml):

name: mistral-7b-instruct
parameters:
  model: /models/mistral-7b-instruct-v0.2.Q4_0.gguf
  temperature: 0.7
  top_k: 40
  top_p: 0.5
  context_size: 4096
template: mistral-instruct

Example Template (mistral-instruct.tmpl):

{{- if .System }}
<s>{{ .System }}</s>
{{- end }}

{{- range $i, $message := .Messages }}
{{- if eq $message.Role "user" }}
[INST] {{ $message.Content }} [/INST]
{{- else if eq $message.Role "assistant" }}
{{ $message.Content }}
{{- end }}
{{- end }}

Special Configuration for Embedding Models

For embedding models like MPNET:

# Clone the model repository
git clone https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1 ./models/multi-qa-mpnet-base-dot-v1

# Create configuration file (mpnet.yaml)
cat > ./models/mpnet.yaml << EOL
name: mpnet
parameters:
  model: /models/multi-qa-mpnet-base-dot-v1
  embedding_model_type: sentence_transformers
EOL

Restart the Service

After adding new models, restart the service to make them available:

kubectl rollout restart deployment prismeai-llm -n apps

Create Model Directory

Ensure you have a /models directory in your specified storage volume.

Download Model Files

For each model, you’ll need to download the GGUF model file. For example, for Mistral:

curl -L "https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.2-GGUF/resolve/main/mistral-7b-instruct-v0.2.Q4_0.gguf?download=true" -o ./models/mistral-7b-instruct-v0.2.Q4_0.gguf

The file size can be several gigabytes. Q4 models are more compressed (smaller, faster, less accurate) while Q8 models are less compressed (larger, slower, more accurate).

Add Configuration Files

Create or copy the required YAML and template files for your model.

Example YAML (mistral.yaml):

name: mistral-7b-instruct
parameters:
  model: /models/mistral-7b-instruct-v0.2.Q4_0.gguf
  temperature: 0.7
  top_k: 40
  top_p: 0.5
  context_size: 4096
template: mistral-instruct

Example Template (mistral-instruct.tmpl):

{{- if .System }}
<s>{{ .System }}</s>
{{- end }}

{{- range $i, $message := .Messages }}
{{- if eq $message.Role "user" }}
[INST] {{ $message.Content }} [/INST]
{{- else if eq $message.Role "assistant" }}
{{ $message.Content }}
{{- end }}
{{- end }}

Special Configuration for Embedding Models

For embedding models like MPNET:

# Clone the model repository
git clone https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1 ./models/multi-qa-mpnet-base-dot-v1

# Create configuration file (mpnet.yaml)
cat > ./models/mpnet.yaml << EOL
name: mpnet
parameters:
  model: /models/multi-qa-mpnet-base-dot-v1
  embedding_model_type: sentence_transformers
EOL

Restart the Service

After adding new models, restart the service to make them available:

kubectl rollout restart deployment prismeai-llm -n apps

Access the Ollama Container

Connect to your Ollama container:

kubectl exec -it deployment/prismeai-llm -n apps -- bash

Download Models

Download models from the Ollama library:

# Download Phi model
ollama run phi

# Download Mistral model
ollama run mistral

# Download Llama model
ollama run llama2

Models will be automatically downloaded to the directory specified by OLLAMA_MODELS environment variable.

Verify Model Installation

Check that models were correctly downloaded:

ollama list

This should show all available models.

For offline installation, you can run Ollama on an internet-connected machine, then copy the model files (typically from ~/.ollama/models on macOS, /usr/share/ollama/.ollama/models on Linux, or C:\Users\<username>\.ollama\models on Windows) to your target environment.

Microservice Testing

After deployment, test the service with these commands:

Test Text Generation

To test text generation capabilities, use this curl command:

curl http://localhost:5000/v1/chat/completions -X POST \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-2",
    "messages": [{
      "role": "user",
      "content": "Give me a random number."
    }],
    "temperature": 0.7,
    "max_token": 10,
    "stream": true
  }'

Replace phi-2 with your installed model name (e.g., mistral-7b-instruct, orca, or airoboros).

You should receive a streamed response containing the generated text.

Test Embedding Generation

To test embedding generation, use this curl command:

curl http://localhost:5000/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "bert",
    "input": "A long time ago in a galaxy far, far away"
  }'

Replace bert with your installed embedding model name (e.g., mpnet).

You should receive a response containing a vector of floating-point numbers representing the text embedding.

The first inference request may take several minutes as the model is loaded into memory. Subsequent requests will be faster.

Performance Considerations

CPU Requirements

Resource Impact

Larger models require more CPU cores
16+ vCPUs recommended for production use
Mistral 7B is more demanding than Phi-2

Expected Performance (16 CPU)

Phi-2: ~3 minutes for large context
Mistral 7B: ~10 minutes for large context

Memory Requirements

Resource Impact

Model size directly affects memory usage
Q4 models use less RAM than Q8 models
Context window size impacts memory usage

Recommendations

Minimum 8GB RAM for smaller models
16-32GB RAM for 7B parameter models
Consider swap space configuration

Storage Requirements

Resource Impact

Model files can be several gigabytes each
Multiple models multiply storage needs

Recommendations

10GB minimum storage
50GB+ recommended for multiple models
Use high-performance storage when possible

Optimization Options

Strategies

Use quantized models (Q4_0, Q4_K_M) for better performance
Consider GPU acceleration for production
Adjust context size based on needs
Use streaming for better user experience

Troubleshooting

Slow Response Times

Model Loading Errors

Out of Memory Errors

Integration with AI Knowledge

To use your local models with Prisme.ai’s AI Knowledge:

Access Project Settings

Navigate to your AI Knowledge project and open the settings panel.

Configure Model Settings

Update the model settings to use your local models:

Text Generation Model: Enter the model name (e.g., “orca”, “airoboros”, “phi-2”)
Embedding Model: Enter the embedding model name (e.g., “bert”, “mpnet”)

If you change the embedding model of an existing project, you’ll need to create a new project instead. This is because different embedding models produce vectors of different dimensions, which affects the Redis indexing structure.

Save and Test

Save your settings and test with a few queries to ensure proper integration.

Advanced Configuration

Adding New Models

To add new models to LocalAI:

Find a model on Hugging Face (prefer models with GGUF versions)
Download the model file to your /models directory:
```
curl -L "URL_TO_MODEL" -o ./models/MODEL_NAME.gguf
```
Find or create an appropriate template from the model gallery
Create a YAML configuration file
Restart the service

For detailed instructions, refer to the LocalAI documentation.

GPU Acceleration

Custom Model Templates

Overview

Cloud Providers

Docker & Kubernetes Deployment

Entreprise Services

AI Products

Operations

Overview

Run Local Models

Text Generation

Embeddings

OpenAI-Compatible API

Installation Prerequisites

Storage Volume

Sufficient Resources

Deployment Options

vLLM

LocalAI (Default)

YAML File

Template File

Model File

Ollama Configuration

Model Installation

Microservice Testing

Performance Considerations

CPU Requirements

Memory Requirements

Storage Requirements

Optimization Options

Troubleshooting

Integration with AI Knowledge

Advanced Configuration

Overview

Cloud Providers

Docker & Kubernetes Deployment

Entreprise Services

AI Products

Operations

​Overview

Run Local Models

Text Generation

Embeddings

OpenAI-Compatible API

​Installation Prerequisites

Storage Volume

Sufficient Resources

​Deployment Options

​vLLM

​LocalAI (Default)

YAML File

Template File

Model File

​Ollama Configuration

​Model Installation

​Microservice Testing

​Performance Considerations

CPU Requirements

Memory Requirements

Storage Requirements

Optimization Options

​Troubleshooting

​Integration with AI Knowledge

​Advanced Configuration

Overview

Installation Prerequisites

Deployment Options

vLLM

LocalAI (Default)

Ollama Configuration

Model Installation

Microservice Testing

Performance Considerations

Troubleshooting

Integration with AI Knowledge

Advanced Configuration