RTP-LLM Native APIs#
Apart from the OpenAI compatible APIs, the RTP-LLM Runtime also provides its native server APIs. We introduce these following APIs:
method_name |
example_request |
is_post |
is_get |
desc |
---|---|---|---|---|
|
|
✅ |
❌ |
Basic text-generation endpoint (backward-compatible with early versions). |
|
|
✅ |
✅ |
Render the chat template into the final prompt that will be sent to the model. |
|
|
✅ |
❌ |
v1 path for |
|
|
✅ |
❌ |
Encode text into a list of token IDs using the internal tokenizer. |
|
|
✅ |
❌ |
Lightweight tokenization endpoint that returns an array of tokens. |
|
|
✅ |
✅ |
Detailed status of a worker in the RTP-LLM framework. |
|
|
✅ |
✅ |
Query runtime status of the inference worker. |
|
|
✅ |
✅ |
Generic health check; returns whether the service is alive. |
|
|
✅ |
✅ |
Retrieve comprehensive status information for the current service instance. |
|
|
✅ |
✅ |
Deep health check that includes a cache version number. |
|
|
✅ |
❌ |
Hot-reload LoRA info into the running service. |
|
❌ |
✅ |
List currently deployed models (OpenAI-compatible). |
|
|
|
✅ |
❌ |
Dynamically adjust the service log level. |
|
|
✅ |
❌ |
Update the EPLB (Elastic Load Balancer) configuration. |
|
|
✅ |
❌ |
OpenAI-compatible dense-vector embedding endpoint. |
|
|
✅ |
❌ |
Return dense embeddings only. |
|
|
✅ |
❌ |
Return sparse embeddings only (e.g., BM25/TF-IDF). |
|
|
✅ |
❌ |
Return ColBERT late-interaction multi-vector representations for high-accuracy semantic retrieval. |
|
|
✅ |
❌ |
Accept query–doc pairs and return pairwise similarities (cosine/dot) directly, skipping the separate embedding step. |
|
|
✅ |
❌ |
Generic text-classification endpoint supporting tasks such as sentiment or topic classification. |
|
|
✅ |
❌ |
Rerank a list of retrieved documents by relevance and return the reordered results. |
We mainly use requests to test these APIs in the following examples. You can also use curl.
Launch A Server#
[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
"--model_type=qwen_2",
f"--start_port={port}"
]
)
wait_sever_done(server_process, port)
Generate (text generation model)#
Generate completions. This is similar to the /v1/completions
in OpenAI API. Detailed parameters can be found in the sampling parameters.
[ ]:
url = f"http://localhost:{port}"
json_data = {
"prompt": "who are you",
"generate_config": {"max_new_tokens": 32, "temperature": 0}
}
response = requests.post(url, json=json_data)
print(f"Output 0: {response.json()}")
Chat Render / Tokenizer#
/chat/render
、/v1/chat/render
: Chat Template Render/tokenizer/encode
,/tokenize
: Raw prompt tokenize
[ ]:
url = f"http://localhost:{port}/v1/chat/render"
data = {"messages": [{"role": "user","content": "hello?"}]}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Render Result: {response_json}")
[ ]:
url = f"http://localhost:{port}/tokenizer/encode"
data = {"prompt": "hello"}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Encode Result: {response_json}")
Worker Status#
/rtp_llm/worker_status
、/worker_status
: Server for processing snapshot, includes RunningTask, FinishedTask, CacheStatus.
[ ]:
url = f"http://localhost:{port}/rtp_llm/worker_status"
data = { "latest_cache_version": -1}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Worker Status: {response_json}")
Health Check#
/health
、/status
: Check the health of the server.
[ ]:
url = f"http://localhost:{port}/health"
response = requests.get(url)
print(response.text)
Update Lora Info#
/update
: Update full LoRA Info
[ ]:
url = f"http://localhost:{port}/rtp_llm/worker_status"
data = {"peft_info": {"lora_info": {"lora_0": "/lora/llama-lora-test/"}}}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")
Get Model Info#
/v1/models
[ ]:
url = f"http://localhost:{port}//v1/models"
response = requests.get(url)
print(response.text)
Update Log Level#
/set_log_level
[ ]:
url = f"http://localhost:{port}/set_log_level"
data = { "log_level": "INFO"}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")
Update EPLB Config for MoE#
/update_eplb_config
[ ]:
url = f"http://localhost:{port}/update_eplb_config"
data = {"model": "EPLB", "update_time":1000}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")
Encode (embedding model)#
Encode text into embeddings. Note that this API is only available for embedding models and will raise an error for generation models. Therefore, we launch a new server to server an embedding model.
[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/bge-large-en-v1.5",
"--model_type=bert",
"--embedding_model=1",
f"--start_port={port}"
]
)
wait_sever_done(server_process, port)
[ ]:
# successful encode for embedding model
url = f"http://localhost:{port}/v1/embeddings"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")
url = f"http://localhost:{port}/v1/embeddings/dense"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")
url = f"http://localhost:{port}/v1/embeddings/sparse"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")
url = f"http://localhost:{port}/v1/embeddings/colbert"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")
url = f"http://localhost:{port}/v1/embeddings/similarity"
data = {
"left": [
"hello, what is your name?"
],
"right": [
"hello",
"what is your name"
],
"model": "xx"
}
response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")
[ ]:
stop_server(server_process)
v1/rerank (cross encoder rerank model)#
Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like BAAI/bge-reranker-v2-m3 with attention-backend
triton
and torch_native
.
[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/bge-large-en-v1.5",
"--model_type=bert",
"--embedding_model=1",
f"--start_port={port}"
]
)
wait_sever_done(server_process, port)
[ ]:
# compute rerank scores for query and documents
url = f"http://localhost:{port}/v1/rerank"
data = {
"query": "what is panda? ",
"documents": [
"hi",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
"gg"
]
}
response = requests.post(url, json=data)
response_json = response.json()
for item in response_json.get('results'):
print(f"Score: {item['relevance_score']:.2f} - Document: '{item['document']}'")
[ ]:
stop_server(server_process)
Classify#
RTP-LL Runtime also supports classify models. Here we use a classify model to classify the quality of pairwise generations.
[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/models--unitary--toxic-bert",
"--model_type=bert",
"--embedding_model=1",
f"--start_port={port}"
]
)
wait_sever_done(server_process, port)
[ ]:
url = f"http://localhost:{port}/v1/classifier"
data = {
"input": [
[
"what is panda?",
"hi"
],
[
"what is panda?",
"The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
]
],
"model": "xx"
}
response = requests.post(url, json=data)
response_json = response.json()
for item in response_json.get('results'):
print(f"Score: {item['relevance_score']:.2f} - Document: '{item['document']}'")
[ ]:
stop_server(server_process)