NVidia Triton Inference Server

(Self-Hosted 가능한 S3 호환되는 객체 스토리지는 Triton Object Storage 항목 참조)

NVIDIA Triton™ Inference Server는 모델 배포 및 실행을 표준화할 수 있도록 도와주고, 프로덕션 환경에 빠르고 확장 가능한 AI를 제공하는 오픈 소스 추론 지원 소프트웨어입니다.

NVIDIA AI 플랫폼의 구성 요소인 Triton Inference Server는 팀이 GPU 또는 CPU 기반 인프라의 프레임워크에서 훈련된 AI 모델을 배포, 실행 및 확장할 수 있도록 지원함으로써 AI 추론을 간소화하고 표준화합니다. AI 연구원과 데이터 과학자는 Triton을 통해 프로덕션 배포에 영향을 미치지 않고 프로젝트에 적합한 프레임워크를 자유롭게 선택할 수 있으며, 개발자는 클라우드, 온프레미스, 에지 및 임베디드 디바이스 전반에서 고성능 추론을 제공할 수 있습니다.

Run on System with GPUs

docker run --gpus=1 --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/full/path/to/docs/examples/model_repository:/models nvcr.io/nvidia/tritonserver:<xx.yy>-py3 tritonserver --model-repository=/models

참고로 내가 한 방법:

docker pull nvcr.io/nvidia/tritonserver:24.05-py3
docker run --rm -it \
  --gpus=all \
  -p 8000:8000 \
  -p 8001:8001 \
  -p 8002:8002 \
  -v ./model:/models \
  nvcr.io/nvidia/tritonserver:24.05-py3 \
  tritonserver \
  --model-repository=/models

8000 - HTTP 통신용
8001 - gRPC 통신용
8002 - Metrics용
-v {로컬 모델 경로}/models - 여기서 지정한 로컬 모델 경로에는 serving 할 trained model들이 위치하게 된다.

Verify Triton Is Running Correctly

Use Triton’s ready endpoint to verify that the server and the models are ready for inference. From the host system use curl to access the HTTP endpoint that indicates server status.

curl -v localhost:8000/v2/health/ready

Docker images

Triton Inference Server | NVIDIA NGC

Docker images are available:

The xx.yy-py3 image contains the Triton inference server with support for Tensorflow, PyTorch, TensorRT, ONNX and OpenVINO models. <- 잘 모르면 이거 받아라
The xx.yy-py3-sdk image contains Python and C++ client libraries, client examples, and the Model Analyzer.
The xx.yy-py3-min image is used as the base for creating custom Triton server containers as described in Customize Triton Container.
The xx.yy-pyt-python-py3 image contains the Triton Inference Server with support for PyTorch and Python backends only.
The xx.yy-tf2-python-py3 image contains the Triton Inference Server with support for TensorFlow 2.x and Python backends only.

NVIDIA driver 버전 확인하기

(MLOps) Triton Inference Server 구축기 1 - 설치

NVIDIA driver 버전에 따라 사용하는 triton container 버전이 달라질 수 있다. 나의 경우 driver 버전은 515.86.01이라, triton inference server container 버전은 22.08으로 docker image를 빌드했다. 만약 container에서 제공하는 최소 driver 버전보다 나의 driver가 낮다면 오류가 날 수 있으니 꼭 확인할 것!

아래 공식 문서에서 나의 driver 버전과 맞는 container 버전을 확인한다.

Triton Inference Server 릴리즈 노트 : https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html

또는 https://docs.nvidia.com/deeplearning/frameworks/support-matrix/index.html 사이트에서 확인하자.

Model serving

tritonclient

Github - triton-inference-server/client - Triton Python, C++ and Java client libraries, and GRPC-generated client examples for go, java and scala.

pip install tritonclient[all]

Using all installs both the HTTP/REST and GRPC client libraries. There are two optional packages available, grpc and http that can be used to install support specifically for the protocol. For example, to install only the HTTP/REST client library use,

pip install tritonclient[http]

There is another optional package namely cuda, that must be installed in order to use cuda_shared_memory utilities. all specification will install the cuda package by default but in other cases cuda needs to be explicitly specified for installing client with cuda_shared_memory support.

pip install tritonclient[http, cuda]

HTTP 연결 방법

import tritonclient.http as httpclient
import numpy as np

# Triton Inference Server URL
url = "localhost:8000"

# 클라이언트 초기화
client = httpclient.InferenceServerClient(url=url)

# 입력 데이터 준비
inputs = []
outputs = []
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)  # 예시 입력 데이터

inputs.append(httpclient.InferInput("input_name", input_data.shape, "FP32"))
inputs[-1].set_data_from_numpy(input_data)

outputs.append(httpclient.InferRequestedOutput("output_name"))

# 추론 요청
results = client.infer("my_model", inputs, outputs=outputs)

# 결과 확인
output_data = results.as_numpy("output_name")
print(output_data)

Shared Memory 사용 방법

서버 실행 시 ulimit 관련 옵션 및 공유 메모리(/dev/shm) 접근 관련 옵션을 추가한다.

docker run --gpus=1 --rm --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 \
  -p8000:8000 -p8001:8001 -p8002:8002 \
  -v /path/to/model/repository:/models nvcr.io/nvidia/tritonserver:23.04-py3 \
  tritonserver --model-repository=/models

클라이언트:

import tritonclient.http as httpclient
import tritonclient.utils.shared_memory as shm
import numpy as np
import sys

# Triton Inference Server URL
url = "localhost:8000"

# 클라이언트 초기화
client = httpclient.InferenceServerClient(url=url)

# 모델 정보
model_name = "my_model"
input_name = "input_name"
output_name = "output_name"

# 입력 데이터 준비
input_data = np.random.rand(1, 3, 224, 224).astype(np.float32)

# Shared memory 설정
input_byte_size = input_data.nbytes
output_byte_size = input_data.nbytes  # 예제에서는 입력과 출력 크기가 같다고 가정

# Shared memory 등록
input_shm_region = shm.create_shared_memory_region("input_data", "/input", input_byte_size)
output_shm_region = shm.create_shared_memory_region("output_data", "/output", output_byte_size)

# Shared memory에 데이터 쓰기
shm.set_shared_memory_region(input_shm_region, [input_data])

# Shared memory 등록
client.register_system_shared_memory("input_data", "/input", input_byte_size)
client.register_system_shared_memory("output_data", "/output", output_byte_size)

# 입력 및 출력 설정
inputs = []
outputs = []

inputs.append(httpclient.InferInput(input_name, input_data.shape, "FP32"))
inputs[-1].set_shared_memory("input_data", input_byte_size)

outputs.append(httpclient.InferRequestedOutput(output_name, binary_data=True))
outputs[-1].set_shared_memory("output_data", output_byte_size)

# 추론 요청
results = client.infer(model_name, inputs, outputs=outputs)

# 결과 읽기
output_data = shm.get_shared_memory_region(output_shm_region, output_byte_size, np.float32)
print(output_data)

# Shared memory 해제
client.unregister_system_shared_memory("input_data")
client.unregister_system_shared_memory("output_data")
shm.destroy_shared_memory_region(input_shm_region)
shm.destroy_shared_memory_region(output_shm_region)

Shared-Memory Extension

System Shared Memory

system shared memory 를 사용하여 client library와 triton 간에 텐서를 통신하게 되면, 성능을 향상시킬 수 있다.

python에는 shared memory를 할당하고 접근하는 standard 방법이 없기 때문에¹, 간단한 예시 모듈을 제공한다.

이 예시는 python client library를 사용하여 system shared memory를 생성하고 설정하고 삭제할 수 있도록 한다.

https://github.com/triton-inference-server/client/tree/main/src/python/examples

CUDA Shared Memory

Quick Start

k8s 서버 구축

NVIDIA Triton Inference Server를 쿠버네티스에 배포하고 GPU 병렬 사용을 테스트 한 후기 (Kubernetes Kubeflow KFServing)
k8s triton inference server 클러스터 구축
Github - NVIDIA/k8s-device-plugin - NVIDIA device plugin for Kubernetes

Python Client Examples

client/src/python/examples at main · triton-inference-server/client

클라이언트 예제 파일들 목록

ensemble_image_client.py
grpc_client.py
grpc_explicit_byte_content_client.py
grpc_explicit_int8_content_client.py
grpc_explicit_int_content_client.py
grpc_image_client.py
image_client.py
memory_growth_test.py
reuse_infer_objects_client.py
simple_grpc_aio_infer_client.py
simple_grpc_aio_sequence_stream_infer_client.py
simple_grpc_async_infer_client.py
simple_grpc_cudashm_client.py
simple_grpc_custom_args_client.py
simple_grpc_custom_repeat.py
simple_grpc_health_metadata.py
simple_grpc_infer_client.py
simple_grpc_keepalive_client.py
simple_grpc_model_control.py
simple_grpc_sequence_stream_infer_client.py
simple_grpc_sequence_sync_infer_client.py
simple_grpc_shm_client.py
simple_grpc_shm_string_client.py
simple_grpc_string_infer_client.py
simple_http_aio_infer_client.py
simple_http_async_infer_client.py
simple_http_cudashm_client.py
simple_http_health_metadata.py
simple_http_infer_client.py
simple_http_model_control.py
simple_http_sequence_sync_infer_client.py
simple_http_shm_client.py
simple_http_shm_string_client.py
simple_http_string_infer_client.py

ONNX Runtime Backend

Github - triton-inference-server/onnxruntime_backend - The Triton backend for the ONNX Runtime.

Favorite site

Triton Inference Server | NVIDIA Developer
- (KO) Triton Inference Server | NVIDIA Developer
~~Github - tensorrt-inference-server~~
Github - triton-inference-server/server - The Triton Inference Server provides an optimized cloud and edge inferencing solution
NVIDIA TensorRT Inference Server - documentation
Triton Inference Server | NVIDIA NGC - Container Repository
NVIDIA의 결정! TensorRT Inference Server 오픈 소스로 공개
[추천] NVIDIA Triton 한 눈에 알아보기
[추천] Triton Python Backend 사용하기

Triton Inference Server 시작하기

References

확인 요망 ↩