Skip to main content

Triton Inference Server for Multilingual Sentiment Analysis

This documentation provides a comprehensive guide to understanding and running the multilingual sentiment analysis project using NVIDIA's Triton Inference Server with a FastAPI proxy.

Project Overview

This project implements a multilingual sentiment analysis service using NVIDIA's Triton Inference Server with a FastAPI proxy. It supports sentiment analysis in multiple languages, including English, Arabic, and a multilingual model that can handle multiple languages.

Key Components

  1. Triton Inference Server: Manages model execution and scaling
  2. FastAPI Proxy: Provides a user-friendly REST API for the sentiment analysis service
  3. Sentiment Analysis Pipeline: Each language model is split into two components:
    • Tokenizer: Prepares text for the model
    • Model: Runs inference on the tokenized inputs

Features

  • Language detection and routing to the appropriate model
  • Automated model management (load/unload based on usage)
  • Detailed sentiment analysis results with probabilities
  • Simple REST API interface

Project Structure

ai/Triton/
├── app/ # FastAPI application
│ ├── main.py # FastAPI server code
│ └── start.sh # Script to start both Triton and FastAPI
├── models/ # Triton model repository
│ ├── sentiment_ar/ # Arabic sentiment ensemble model
│ │ ├── 1/ # Model version directory
│ │ └── config.pbtxt # Ensemble configuration
│ ├── sentiment_ar-model/ # Arabic sentiment ONNX model
│ │ ├── 1/ # Model version with model.onnx
│ │ └── config.pbtxt # Model configuration
│ ├── sentiment_ar-tokenizer/ # Arabic tokenizer
│ │ ├── 1/ # Model version with model.py
│ │ └── config.pbtxt # Tokenizer configuration
│ ├── sentiment_en/ # English sentiment ensemble model
│ │ └── ... # Similar structure as Arabic model
│ ├── sentiment_multi/ # Multilingual sentiment ensemble model
│ │ └── ... # Similar structure as other models
└── notebooks/ # Test scripts and notebooks
└── test_fastapi_proxy.py # Script to test the API

Prerequisites

  • NVIDIA GPU with CUDA support
  • Docker
  • NVIDIA Container Toolkit
  • Python 3.10+ (for local development/testing)

Getting Started

Step 1: Build the Docker Image

Build the Docker image using the provided Dockerfile:

docker build -t triton:1

This builds an image based on NVIDIA's Triton Inference Server (version 24.12-py3) with the additional dependencies needed for the FastAPI proxy.

Step 2: Start the Container

Use Docker Compose to start the container:

docker-compose up -d

This will:

  • Start the Triton container with GPU access
  • Mount the ./models and ./app directories into the container
  • Map the necessary ports:
    • 8000: Triton HTTP API
    • 8001: Triton gRPC API
    • 8002: Triton Metrics API
    • 8005: FastAPI Proxy API

Step 3: Start the Services

Connect to the container and start the Triton server and FastAPI proxy:

docker exec -it triton /bin/bash
cd /app
./start.sh

This script:

  1. Starts the Triton server with all sentiment models loaded (Arabic, English, and Multilingual)
  2. Waits for the Triton server to become ready
  3. Starts the FastAPI proxy server on port 8005

API Usage

Sentiment Analysis Endpoint

POST /models/sentiment

Request body:

{
"text": "I really enjoyed this movie!",
"lang": "en"
}

The lang parameter can be:

  • "en" for English text
  • "ar" for Arabic text
  • Any other value will use the multilingual model

Example response:

{
"predicted_class_label": "positive",
"predicted_probability": 0.95,
"class_probabilities": {
"positive": 0.95,
"negative": 0.02,
"neutral": 0.03
},
"model_used": "sentiment_en",
"raw_text_length": 28
}

Health Check

GET /health

Returns the status of the server and loaded models.

Model Management Endpoints

POST /models/{model_key}/load
POST /models/{model_key}/unload

Where model_key can be:

  • ar_sentiment - Arabic sentiment model
  • en_sentiment - English sentiment model
  • multi_sentiment - Multilingual sentiment model

Configuration

Model Management

The FastAPI proxy includes a model management system that automatically loads and unloads models based on usage. The configuration in main.py includes:

MODEL_MANAGEMENT = {
"enabled": True, # Enable/disable model management
"idle_threshold": 3600, # Unload models after 60 seconds of inactivity
"check_interval": 300, # Check for idle models every 15 seconds
"always_loaded": [], # Models that should never be unloaded
"load_timeout": 30 # Timeout for model loading in seconds
}

Testing

You can test the API using the provided test script:

python notebooks/test_fastapi_proxy.py

This script can test:

  • Server health
  • Sentiment analysis for different languages
  • Model loading/unloading
  • Performance metrics

Troubleshooting

  1. GPU not detected: Make sure the NVIDIA Container Toolkit is installed and your Docker Compose file includes the correct GPU configuration.

  2. Model loading errors: Check the Triton server logs:

    docker exec -it triton cat /var/log/triton/server.log
  3. API errors: Check the FastAPI logs:

    docker exec -it triton cat /var/log/fastapi.log
  4. Model management issues: The model management system automatically unloads models after a period of inactivity. If you need a model to stay loaded, add its key to the always_loaded list in the MODEL_MANAGEMENT configuration.

Performance Optimization

  1. GPU Batch Size: Adjust the max_batch_size parameter in the model's config.pbtxt to optimize throughput.

  2. Instance Count: Increase or decrease the number of model instances in the instance_group section of the model's config.pbtxt:

    instance_group [
    {
    kind: KIND_GPU
    count: 2 # Adjust based on your GPU memory
    }
    ]

Adding New Language Models

To add a new language model:

  1. Add the model files to the models directory following the existing structure
  2. Update the MODEL_CONFIG dictionary in main.py to include your new model
  3. Update the start.sh script to load your new model

Conclusion

This project provides a scalable and efficient solution for multilingual sentiment analysis using NVIDIA's Triton Inference Server. The FastAPI proxy makes it easy to integrate with other applications through a simple REST API.