Local LLM
Deploy and configure local language models in your Prisme.ai environment with the prismeai-llm microservice
The prismeai-llm
microservice enables you to integrate open-source language models into your Prisme.ai environment. This service supports both LocalAI and Ollama as runtime engines, giving you flexibility in how you deploy and manage your models.
Overview
The prismeai-llm
microservice allows you to:
Run Local Models
Deploy and run open-source language models directly within your infrastructure
Text Generation
Generate text completions and chat responses using various models
Embeddings
Create vector embeddings for semantic search and content understanding
OpenAI-Compatible API
Interface with models using the familiar OpenAI API format
By default, the service uses LocalAI with the pre-built image available at quay.io/go-skynet/local-ai
. You can also configure it to use Ollama as an alternative runtime.
Installation Prerequisites
Before deploying the prismeai-llm
microservice, ensure you have:
Storage Volume
A persistent volume to store model files, which can be substantial in size
Sufficient Resources
Adequate CPU, RAM, and potentially GPU resources depending on model size
Language models can be resource-intensive. We recommend starting with a 16 vCPU machine for initial deployment. For production use with larger models, consider dedicated resources and possibly GPU acceleration.
Deployment Options
vLLM
When using vLLM, you only need to add your endpoints in the AI Knowledge configuration. Please refer to the AI Knowledge configuration documentation for instructions on how to add these endpoints and the corresponding credentials.
LocalAI (Default)
When using LocalAI, you’ll need to provide specific files for each model in the ./models
directory:
YAML File
Configuration file describing the model
Template File
.tmpl file defining the prompt format (not needed for embedding models)
Model File
GGUF (CPU-compatible) file containing the actual model weights
Ollama Configuration
To use Ollama instead of LocalAI, modify your prismeai-apps/values.yaml
configuration:
Model Installation
Create Model Directory
Ensure you have a /models
directory in your specified storage volume.
Download Model Files
For each model, you’ll need to download the GGUF model file. For example, for Mistral:
The file size can be several gigabytes. Q4 models are more compressed (smaller, faster, less accurate) while Q8 models are less compressed (larger, slower, more accurate).
Add Configuration Files
Create or copy the required YAML and template files for your model.
Example YAML (mistral.yaml):
Example Template (mistral-instruct.tmpl):
Special Configuration for Embedding Models
For embedding models like MPNET:
Restart the Service
After adding new models, restart the service to make them available:
Create Model Directory
Ensure you have a /models
directory in your specified storage volume.
Download Model Files
For each model, you’ll need to download the GGUF model file. For example, for Mistral:
The file size can be several gigabytes. Q4 models are more compressed (smaller, faster, less accurate) while Q8 models are less compressed (larger, slower, more accurate).
Add Configuration Files
Create or copy the required YAML and template files for your model.
Example YAML (mistral.yaml):
Example Template (mistral-instruct.tmpl):
Special Configuration for Embedding Models
For embedding models like MPNET:
Restart the Service
After adding new models, restart the service to make them available:
Access the Ollama Container
Connect to your Ollama container:
Download Models
Download models from the Ollama library:
Models will be automatically downloaded to the directory specified by OLLAMA_MODELS
environment variable.
Verify Model Installation
Check that models were correctly downloaded:
This should show all available models.
For offline installation, you can run Ollama on an internet-connected machine, then copy the model files (typically from ~/.ollama/models
on macOS, /usr/share/ollama/.ollama/models
on Linux, or C:\Users\<username>\.ollama\models
on Windows) to your target environment.
Microservice Testing
After deployment, test the service with these commands:
Test Text Generation
Test Text Generation
To test text generation capabilities, use this curl command:
Replace phi-2
with your installed model name (e.g., mistral-7b-instruct
, orca
, or airoboros
).
You should receive a streamed response containing the generated text.
Test Embedding Generation
Test Embedding Generation
To test embedding generation, use this curl command:
Replace bert
with your installed embedding model name (e.g., mpnet
).
You should receive a response containing a vector of floating-point numbers representing the text embedding.
The first inference request may take several minutes as the model is loaded into memory. Subsequent requests will be faster.
Performance Considerations
CPU Requirements
Resource Impact
- Larger models require more CPU cores
- 16+ vCPUs recommended for production use
- Mistral 7B is more demanding than Phi-2
Expected Performance (16 CPU)
- Phi-2: ~3 minutes for large context
- Mistral 7B: ~10 minutes for large context
Memory Requirements
Resource Impact
- Model size directly affects memory usage
- Q4 models use less RAM than Q8 models
- Context window size impacts memory usage
Recommendations
- Minimum 8GB RAM for smaller models
- 16-32GB RAM for 7B parameter models
- Consider swap space configuration
Storage Requirements
Resource Impact
- Model files can be several gigabytes each
- Multiple models multiply storage needs
Recommendations
- 10GB minimum storage
- 50GB+ recommended for multiple models
- Use high-performance storage when possible
Optimization Options
Strategies
- Use quantized models (Q4_0, Q4_K_M) for better performance
- Consider GPU acceleration for production
- Adjust context size based on needs
- Use streaming for better user experience
Troubleshooting
Slow Response Times
Slow Response Times
Symptom: Extremely slow response times (10+ minutes)
Possible Causes:
- First inference requires loading the model into memory
- Insufficient CPU resources
- Large model with small resource allocation
Resolution Steps:
- Wait for the first inference to complete (can take 10+ minutes)
- Enable debug mode by setting
DEBUG: true
in your environment - Check logs for memory or resource constraints
- Consider using smaller models or increasing resources
Note: On resource-constrained environments like a MacBook M2, you might get as slow as 1 token every 7 seconds.
Model Loading Errors
Model Loading Errors
Symptom: Service fails to start or responds with model not found errors
Possible Causes:
- Incorrect file paths in YAML configuration
- Missing model files
- Incomplete model download
- Permission issues with model files
Resolution Steps:
- Verify model files exist in the correct location
- Check YAML configuration for correct paths
- Ensure file permissions allow the service to read files
- Try downloading the model files again
Out of Memory Errors
Out of Memory Errors
Symptom: Service crashes with OOM (Out of Memory) errors
Possible Causes:
- Model too large for allocated memory
- Multiple models loaded simultaneously
- Large context windows in requests
Resolution Steps:
- Increase memory allocation
- Use more quantized models (Q4 instead of Q8)
- Reduce context size in model configuration
- Consider container-level memory limits
Integration with AI Knowledge
To use your local models with Prisme.ai’s AI Knowledge:
Access Project Settings
Navigate to your AI Knowledge project and open the settings panel.
Configure Model Settings
Update the model settings to use your local models:
- Text Generation Model: Enter the model name (e.g., “orca”, “airoboros”, “phi-2”)
- Embedding Model: Enter the embedding model name (e.g., “bert”, “mpnet”)
If you change the embedding model of an existing project, you’ll need to create a new project instead. This is because different embedding models produce vectors of different dimensions, which affects the Redis indexing structure.
Save and Test
Save your settings and test with a few queries to ensure proper integration.
Advanced Configuration
Adding New Models
Adding New Models
To add new models to LocalAI:
- Find a model on Hugging Face (prefer models with GGUF versions)
- Download the model file to your
/models
directory: - Find or create an appropriate template from the model gallery
- Create a YAML configuration file
- Restart the service
For detailed instructions, refer to the LocalAI documentation.
GPU Acceleration
GPU Acceleration
For production environments, GPU acceleration can dramatically improve performance:
- Use a GPU-enabled container image
- Configure the appropriate GPU drivers and runtime
- Update the model configuration to use GPU resources
- Adjust container resource allocation to include GPU access
With proper GPU acceleration, models like Mistral 7B can achieve near real-time performance similar to cloud services.
Custom Model Templates
Custom Model Templates
To customize how prompts are formatted for specific models:
- Create a new template file (e.g.,
custom-format.tmpl
) - Define the prompt format, using variables like
.System
,.Messages
, etc. - Reference the template in your model’s YAML configuration
This allows you to optimize prompt formats for different model architectures and fine-tunings.