Triton Inference Server for Multilingual Sentiment Analysis
This documentation provides a comprehensive guide to understanding and running the multilingual sentiment analysis project using NVIDIA's Triton Inference Server with a FastAPI proxy.
Project Overview
This project implements a multilingual sentiment analysis service using NVIDIA's Triton Inference Server with a FastAPI proxy. It supports sentiment analysis in multiple languages, including English, Arabic, and a multilingual model that can handle multiple languages.
Key Components
- Triton Inference Server: Manages model execution and scaling
- FastAPI Proxy: Provides a user-friendly REST API for the sentiment analysis service
- Sentiment Analysis Pipeline: Each language model is split into two components:
- Tokenizer: Prepares text for the model
- Model: Runs inference on the tokenized inputs
Features
- Language detection and routing to the appropriate model
- Automated model management (load/unload based on usage)
- Detailed sentiment analysis results with probabilities
- Simple REST API interface
Project Structure
ai/Triton/
├── app/ # FastAPI application
│ ├── main.py # FastAPI server code
│ └── start.sh # Script to start both Triton and FastAPI
├── models/ # Triton model repository
│ ├── sentiment_ar/ # Arabic sentiment ensemble model
│ │ ├── 1/ # Model version directory
│ │ └── config.pbtxt # Ensemble configuration
│ ├── sentiment_ar-model/ # Arabic sentiment ONNX model
│ │ ├── 1/ # Model version with model.onnx
│ │ └── config.pbtxt # Model configuration
│ ├── sentiment_ar-tokenizer/ # Arabic tokenizer
│ │ ├── 1/ # Model version with model.py
│ │ └── config.pbtxt # Tokenizer configuration
│ ├── sentiment_en/ # English sentiment ensemble model
│ │ └── ... # Similar structure as Arabic model
│ ├── sentiment_multi/ # Multilingual sentiment ensemble model
│ │ └── ... # Similar structure as other models
└── notebooks/ # Test scripts and notebooks
└── test_fastapi_proxy.py # Script to test the API
Prerequisites
- NVIDIA GPU with CUDA support
- Docker
- NVIDIA Container Toolkit
- Python 3.10+ (for local development/testing)
Getting Started
Step 1: Build the Docker Image
Build the Docker image using the provided Dockerfile:
docker build -t triton:1
This builds an image based on NVIDIA's Triton Inference Server (version 24.12-py3) with the additional dependencies needed for the FastAPI proxy.
Step 2: Start the Container
Use Docker Compose to start the container:
docker-compose up -d
This will:
- Start the Triton container with GPU access
- Mount the ./models and ./app directories into the container
- Map the necessary ports:
- 8000: Triton HTTP API
- 8001: Triton gRPC API
- 8002: Triton Metrics API
- 8005: FastAPI Proxy API
Step 3: Start the Services
Connect to the container and start the Triton server and FastAPI proxy:
docker exec -it triton /bin/bash
cd /app
./start.sh
This script:
- Starts the Triton server with all sentiment models loaded (Arabic, English, and Multilingual)
- Waits for the Triton server to become ready
- Starts the FastAPI proxy server on port 8005
API Usage
Sentiment Analysis Endpoint
POST /models/sentiment
Request body:
{
"text": "I really enjoyed this movie!",
"lang": "en"
}
The lang parameter can be:
- "en" for English text
- "ar" for Arabic text
- Any other value will use the multilingual model
Example response:
{
"predicted_class_label": "positive",
"predicted_probability": 0.95,
"class_probabilities": {
"positive": 0.95,
"negative": 0.02,
"neutral": 0.03
},
"model_used": "sentiment_en",
"raw_text_length": 28
}
Health Check
GET /health
Returns the status of the server and loaded models.
Model Management Endpoints
POST /models/{model_key}/load
POST /models/{model_key}/unload
Where model_key can be:
- ar_sentiment - Arabic sentiment model
- en_sentiment - English sentiment model
- multi_sentiment - Multilingual sentiment model
Configuration
Model Management
The FastAPI proxy includes a model management system that automatically loads and unloads models based on usage. The configuration in main.py includes:
MODEL_MANAGEMENT = {
"enabled": True, # Enable/disable model management
"idle_threshold": 3600, # Unload models after 60 seconds of inactivity
"check_interval": 300, # Check for idle models every 15 seconds
"always_loaded": [], # Models that should never be unloaded
"load_timeout": 30 # Timeout for model loading in seconds
}
Testing
You can test the API using the provided test script:
python notebooks/test_fastapi_proxy.py
This script can test:
- Server health
- Sentiment analysis for different languages
- Model loading/unloading
- Performance metrics
Troubleshooting
-
GPU not detected: Make sure the NVIDIA Container Toolkit is installed and your Docker Compose file includes the correct GPU configuration.
-
Model loading errors: Check the Triton server logs:
docker exec -it triton cat /var/log/triton/server.log -
API errors: Check the FastAPI logs:
docker exec -it triton cat /var/log/fastapi.log -
Model management issues: The model management system automatically unloads models after a period of inactivity. If you need a model to stay loaded, add its key to the always_loaded list in the MODEL_MANAGEMENT configuration.
Performance Optimization
-
GPU Batch Size: Adjust the max_batch_size parameter in the model's config.pbtxt to optimize throughput.
-
Instance Count: Increase or decrease the number of model instances in the instance_group section of the model's config.pbtxt:
instance_group [
{
kind: KIND_GPU
count: 2 # Adjust based on your GPU memory
}
]
Adding New Language Models
To add a new language model:
- Add the model files to the models directory following the existing structure
- Update the MODEL_CONFIG dictionary in main.py to include your new model
- Update the start.sh script to load your new model
Conclusion
This project provides a scalable and efficient solution for multilingual sentiment analysis using NVIDIA's Triton Inference Server. The FastAPI proxy makes it easy to integrate with other applications through a simple REST API.