RTP-LLM Native APIs#

Apart from the OpenAI compatible APIs, the RTP-LLM Runtime also provides its native server APIs. We introduce these following APIs:

method_name

example_request

is_post

is_get

desc

/

{"prompt": "Hello", "generate_config": {"max_new_tokens": 10, "top_k": 1, "top_p": 0}}

Basic text-generation endpoint (backward-compatible with early versions).

/chat/render

{"messages": [{"role": "user","content": "hello?"}]}

Render the chat template into the final prompt that will be sent to the model.

/v1/chat/render

{"messages": [{"role": "user","content": "hello?"}]}

v1 path for /chat/render (POST only).

/tokenizer/encode

{"prompt": "hello"}

Encode text into a list of token IDs using the internal tokenizer.

/tokenize

{"prompt": "hello"}

Lightweight tokenization endpoint that returns an array of tokens.

/rtp_llm/worker_status

{ "latest_cache_version": -1}

Detailed status of a worker in the RTP-LLM framework.

/worker_status

{ "latest_cache_version": -1}

Query runtime status of the inference worker.

/health

{}

Generic health check; returns whether the service is alive.

/status

{}

Retrieve comprehensive status information for the current service instance.

/health_check

{"latest_cache_version": -1}

Deep health check that includes a cache version number.

/update

{"peft_info": {"lora_info": {"lora_0": "/lora/llama-lora-test/"}}}

Hot-reload LoRA info into the running service.

/v1/models

List currently deployed models (OpenAI-compatible).

/set_log_level

{ "log_level": "INFO"}

Dynamically adjust the service log level.

/update_eplb_config

{"model": "EPLB", "update_time":1000}

Update the EPLB (Elastic Load Balancer) configuration.

/v1/embeddings

{"input": "who are u", "model": "text-embedding-ada-002"}

OpenAI-compatible dense-vector embedding endpoint.

/v1/embeddings/dense

{"input": "who are u"}

Return dense embeddings only.

/v1/embeddings/sparse

{"input": "who are u"}

Return sparse embeddings only (e.g., BM25/TF-IDF).

/v1/embeddings/colbert

{"input":["hello, what is your name?","hello"],"model":"xx"}

Return ColBERT late-interaction multi-vector representations for high-accuracy semantic retrieval.

/v1/embeddings/similarity

{"left":["hello, what is your name?"],"right":["hello","what is your name"],"embedding_config":{"type":"sparse"},"model":"xx"}

Accept query–doc pairs and return pairwise similarities (cosine/dot) directly, skipping the separate embedding step.

/v1/classifier

{"input":[["what is panda?","hi"],["what is panda?","The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."]],"model":"xx"}

Generic text-classification endpoint supporting tasks such as sentiment or topic classification.

/v1/reranker

{"query":"what is panda? ","documents":["hi","The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.","gg"]}

Rerank a list of retrieved documents by relevance and return the reordered results.

We mainly use requests to test these APIs in the following examples. You can also use curl.

Launch A Server#

[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

Generate (text generation model)#

Generate completions. This is similar to the /v1/completions in OpenAI API. Detailed parameters can be found in the sampling parameters.

[ ]:
url = f"http://localhost:{port}"
json_data = {
     "prompt": "who are you",
     "generate_config": {"max_new_tokens": 32, "temperature": 0}
}

response = requests.post(url, json=json_data)
print(f"Output 0: {response.json()}")

Chat Render / Tokenizer#

  • /chat/render/v1/chat/render: Chat Template Render

  • /tokenizer/encode, /tokenize: Raw prompt tokenize

[ ]:
url = f"http://localhost:{port}/v1/chat/render"
data = {"messages": [{"role": "user","content": "hello?"}]}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Render Result: {response_json}")
[ ]:
url = f"http://localhost:{port}/tokenizer/encode"
data = {"prompt": "hello"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Encode Result: {response_json}")

Worker Status#

  • /rtp_llm/worker_status/worker_status: Server for processing snapshot, includes RunningTask, FinishedTask, CacheStatus.

[ ]:
url = f"http://localhost:{port}/rtp_llm/worker_status"
data = { "latest_cache_version": -1}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Worker Status: {response_json}")

Health Check#

  • /health/status: Check the health of the server.

[ ]:
url = f"http://localhost:{port}/health"

response = requests.get(url)
print(response.text)

Update Lora Info#

  • /update: Update full LoRA Info

[ ]:
url = f"http://localhost:{port}/rtp_llm/worker_status"
data = {"peft_info": {"lora_info": {"lora_0": "/lora/llama-lora-test/"}}}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")

Get Model Info#

  • /v1/models

[ ]:
url = f"http://localhost:{port}//v1/models"

response = requests.get(url)
print(response.text)

Update Log Level#

  • /set_log_level

[ ]:
url = f"http://localhost:{port}/set_log_level"
data = { "log_level": "INFO"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")

Update EPLB Config for MoE#

  • /update_eplb_config

[ ]:
url = f"http://localhost:{port}/update_eplb_config"
data = {"model": "EPLB", "update_time":1000}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")

Encode (embedding model)#

Encode text into embeddings. Note that this API is only available for embedding models and will raise an error for generation models. Therefore, we launch a new server to server an embedding model.

[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/bge-large-en-v1.5",
         "--model_type=bert",
         "--embedding_model=1",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)
[ ]:
# successful encode for embedding model

url = f"http://localhost:{port}/v1/embeddings"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/dense"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/sparse"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/colbert"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/similarity"
data = {
    "left": [
        "hello, what is your name?"
    ],
    "right": [
        "hello",
        "what is your name"
    ],
    "model": "xx"
}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")
[ ]:
stop_server(server_process)

v1/rerank (cross encoder rerank model)#

Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like BAAI/bge-reranker-v2-m3 with attention-backend triton and torch_native.

[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/bge-large-en-v1.5",
         "--model_type=bert",
         "--embedding_model=1",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)
[ ]:
# compute rerank scores for query and documents

url = f"http://localhost:{port}/v1/rerank"
data = {
    "query": "what is panda? ",
    "documents": [
        "hi",
        "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
        "gg"
    ]
}

response = requests.post(url, json=data)
response_json = response.json()
for item in response_json.get('results'):
    print(f"Score: {item['relevance_score']:.2f} - Document: '{item['document']}'")
[ ]:
stop_server(server_process)

Classify#

RTP-LL Runtime also supports classify models. Here we use a classify model to classify the quality of pairwise generations.

[ ]:
import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--unitary--toxic-bert",
         "--model_type=bert",
         "--embedding_model=1",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)
[ ]:

url = f"http://localhost:{port}/v1/classifier" data = { "input": [ [ "what is panda?", "hi" ], [ "what is panda?", "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China." ] ], "model": "xx" } response = requests.post(url, json=data) response_json = response.json() for item in response_json.get('results'): print(f"Score: {item['relevance_score']:.2f} - Document: '{item['document']}'")
[ ]:
stop_server(server_process)