FunASR/runtime/triton_gpu/README_ONLINE.md

2.8 KiB

Steps:

  1. Prepare model repo files
├── README.md
└── model_repo_paraformer_large_online
    ├── cif_search
    │   ├── 1
    │   │   └── model.py
    │   └── config.pbtxt
    ├── decoder
    │   ├── 1
    │   │   └── decoder.onnx
    │   └── config.pbtxt
    ├── encoder
    │   ├── 1
    │   │   └── model.onnx
    │   └── config.pbtxt
    ├── feature_extractor
    │   ├── 1
    │   │   └── model.py
    │   ├── config.pbtxt
    │   └── config.yaml
    ├── lfr_cmvn_pe
    │   ├── 1
    │   │   └── lfr_cmvn_pe.onnx
    │   ├── am.mvn
    │   ├── config.pbtxt
    │   └── export_lfr_cmvn_pe_onnx.py
    └── streaming_paraformer
        ├── 1
        └── config.pbtxt
  1. Follow below instructions to launch triton server
# using docker image Dockerfile/Dockerfile.server
docker build . -f Dockerfile/Dockerfile.server -t triton-paraformer:23.01 
docker run -it --rm --name "paraformer_triton_server" --gpus all -v <path_host/model_repo_paraformer_large_online>:/workspace/ --shm-size 1g --net host triton-paraformer:23.01 

# launch the service 
cd /workspace
tritonserver --model-repository model_repo_paraformer_large_online \
             --pinned-memory-pool-byte-size=512000000 \
             --cuda-memory-pool-byte-size=0:1024000000

Performance benchmark with a single A10

  • FP32, onnx, paraformer larger online,Our chunksize is 10 * 960 / 16000 = 0.6 s, so we should care about the perf of latency less than 0.6s so that it can be a realtime application.
Concurrency Throughput Latency_p50 (ms) Latency_p90 (ms) Latency_p95 (ms) Latency_p99 (ms)
20 309.252 56.913 76.267 85.598 138.462
40 391.058 97.911 145.509 150.545 185.399
60 426.269 138.244 185.855 201.016 236.528
80 431.781 170.991 227.983 252.453 412.273
100 473.351 206.205 262.612 288.964 463.337