Embedding Models#
RTP-LLM supports the deployment of mainstream Embedding, Reranker, and Classifier models, with dedicated handling for multi-embedding architectures such as BGE-M3, enabling hybrid request processing within a single service instance. Built on Sentence Transformers, it allows users to tailor post-processing workflows to standard model architectures.
At the model layer, RTP-LLM leverages high-performance compute kernels to accelerate inference. The engine optimizes both intra- and inter-request sequence batching according to user configuration, eliminating redundant computation and improving GPU utilization.
Example Launch Command#
# TASK_TYPE in ["DENSE_EMBEDDING", "CLASSIFIER", "RERANKER", "BGE_M3"]
# For model with SentenceTransformer config, task_type can be auto deduced to DENSE EMBEDDING
/opt/conda310/bin/python3 -m rtp_llm.start_server \
--checkpoint_path /models/bert \
--model_type bert \
--act_type fp16 \
--start_port 8088 \
--TASK_TYPE DENSE_EMBEDDING \
--MAX_CONTEXT_BATCH_SIZE 20
Example Client Request#
Dense Embedding#
import requests
url = "http://localhost:30000"
text_input = "Hello, what's your name"
request = {
"input": ["text_input"]
}
response = requests.post(url + "/v1/embeddings", json=request).json()
Reranker#
import requests
url = "http://localhost:30000"
request = {
"query": "coffee",
"documents": [
"Starbuck",
"Luckin",
"Peets Coffee",
"One point point"
]
}
response = requests.post(url + "/v1/reranker", json=request).json()
print(response)
Classifier#
import requests
url = "http://localhost:30000"
request = {
"input": [
[
"what is panda?",
"hi"
],
]
}
response = requests.post(url + "/v1/classifier", json=request).json()
print(response)
BGE_M3#
import requests
url = "http://localhost:30000"
text_input = "Hello, what's your name"
request = {
"input": ["text_input"]
}
endpoints = ["/v1/embeddings/dense", "/v1/embeddings/sparse", "/v1/embeddings/colbert"]
for endpoint in endpoints:
response = requests.post(url + endpoint, json=request).json()
print(response)
Supported models#
Model Family (Embedding) |
Example HuggingFace Identifier |
Chat Template |
Description |
---|---|---|---|
Qwen3 Embedding/Reranker |
|
N/A |
Support all size of qwen3 embedding/reranker |
BGE (BgeEmbeddingModel) |
|
N/A |
only support BGE family with model_type= |