RTP-LLM Native APIs#

Apart from the OpenAI compatible APIs, the RTP-LLM Runtime also provides its native server APIs. We introduce these following APIs:

method_name	example_request	is_post	is_get	desc
`/`	`{"prompt": "Hello", "generate_config": {"max_new_tokens": 10, "top_k": 1, "top_p": 0}}`	✅	❌	Basic text-generation endpoint (backward-compatible with early versions).
`/chat/render`	`{"messages": [{"role": "user","content": "hello？"}]}`	✅	✅	Render the chat template into the final prompt that will be sent to the model.
`/v1/chat/render`	`{"messages": [{"role": "user","content": "hello？"}]}`	✅	❌	v1 path for `/chat/render` (POST only).
`/tokenizer/encode`	`{"prompt": "hello"}`	✅	❌	Encode text into a list of token IDs using the internal tokenizer.
`/tokenize`	`{"prompt": "hello"}`	✅	❌	Lightweight tokenization endpoint that returns an array of tokens.
`/rtp_llm/worker_status`	`{ "latest_cache_version": -1}`	✅	✅	Detailed status of a worker in the RTP-LLM framework.
`/worker_status`	`{ "latest_cache_version": -1}`	✅	✅	Query runtime status of the inference worker.
`/health`	`{}`	✅	✅	Generic health check; returns whether the service is alive.
`/status`	`{}`	✅	✅	Retrieve comprehensive status information for the current service instance.
`/health_check`	`{"latest_cache_version": -1}`	✅	✅	Deep health check that includes a cache version number.
`/update`	`{"peft_info": {"lora_info": {"lora_0": "/lora/llama-lora-test/"}}}`	✅	❌	Hot-reload LoRA info into the running service.
`/v1/models`		❌	✅	List currently deployed models (OpenAI-compatible).
`/set_log_level`	`{ "log_level": "INFO"}`	✅	❌	Dynamically adjust the service log level.
`/update_eplb_config`	`{"model": "EPLB", "update_time":1000}`	✅	❌	Update the EPLB (Elastic Load Balancer) configuration.
`/v1/embeddings`	`{"input": "who are u", "model": "text-embedding-ada-002"}`	✅	❌	OpenAI-compatible dense-vector embedding endpoint.
`/v1/embeddings/dense`	`{"input": "who are u"}`	✅	❌	Return dense embeddings only.
`/v1/embeddings/sparse`	`{"input": "who are u"}`	✅	❌	Return sparse embeddings only (e.g., BM25/TF-IDF).
`/v1/embeddings/colbert`	`{"input":["hello, what is your name?","hello"],"model":"xx"}`	✅	❌	Return ColBERT late-interaction multi-vector representations for high-accuracy semantic retrieval.
`/v1/embeddings/similarity`	`{"left":["hello, what is your name?"],"right":["hello","what is your name"],"embedding_config":{"type":"sparse"},"model":"xx"}`	✅	❌	Accept query–doc pairs and return pairwise similarities (cosine/dot) directly, skipping the separate embedding step.
`/v1/classifier`	`{"input":[["what is panda?","hi"],["what is panda?","The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."]],"model":"xx"}`	✅	❌	Generic text-classification endpoint supporting tasks such as sentiment or topic classification.
`/v1/reranker`	`{"query":"what is panda? ","documents":["hi","The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.","gg"]}`	✅	❌	Rerank a list of retrieved documents by relevance and return the reordered results.

We mainly use requests to test these APIs in the following examples. You can also use curl.

Launch A Server#

[ ]:

import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

Generate (text generation model)#

Generate completions. This is similar to the /v1/completions in OpenAI API. Detailed parameters can be found in the sampling parameters.

[ ]:

url = f"http://localhost:{port}"
json_data = {
     "prompt": "who are you",
     "generate_config": {"max_new_tokens": 32, "temperature": 0}
}

response = requests.post(url, json=json_data)
print(f"Output 0: {response.json()}")

Chat Render / Tokenizer#

/chat/render、/v1/chat/render: Chat Template Render
/tokenizer/encode, /tokenize: Raw prompt tokenize

[ ]:

url = f"http://localhost:{port}/v1/chat/render"
data = {"messages": [{"role": "user","content": "hello？"}]}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Render Result: {response_json}")

[ ]:

url = f"http://localhost:{port}/tokenizer/encode"
data = {"prompt": "hello"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Encode Result: {response_json}")

Worker Status#

/rtp_llm/worker_status、/worker_status: Server for processing snapshot, includes RunningTask, FinishedTask, CacheStatus.

[ ]:

url = f"http://localhost:{port}/rtp_llm/worker_status"
data = { "latest_cache_version": -1}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Worker Status: {response_json}")

Health Check#

/health、/status: Check the health of the server.

[ ]:

url = f"http://localhost:{port}/health"

response = requests.get(url)
print(response.text)

Update Lora Info#

/update: Update full LoRA Info

[ ]:

url = f"http://localhost:{port}/rtp_llm/worker_status"
data = {"peft_info": {"lora_info": {"lora_0": "/lora/llama-lora-test/"}}}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")

Get Model Info#

/v1/models

[ ]:

url = f"http://localhost:{port}//v1/models"

response = requests.get(url)
print(response.text)

Update Log Level#

/set_log_level

[ ]:

url = f"http://localhost:{port}/set_log_level"
data = { "log_level": "INFO"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")

Update EPLB Config for MoE#

/update_eplb_config

[ ]:

url = f"http://localhost:{port}/update_eplb_config"
data = {"model": "EPLB", "update_time":1000}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Update Result: {response_json}")

Encode (embedding model)#

Encode text into embeddings. Note that this API is only available for embedding models and will raise an error for generation models. Therefore, we launch a new server to server an embedding model.

[ ]:

import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/bge-large-en-v1.5",
         "--model_type=bert",
         "--embedding_model=1",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

[ ]:

# successful encode for embedding model

url = f"http://localhost:{port}/v1/embeddings"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/dense"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/sparse"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/colbert"
data = {"input": "who are u", "model": "bge-large-en-v1.5"}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

url = f"http://localhost:{port}/v1/embeddings/similarity"
data = {
    "left": [
        "hello, what is your name?"
    ],
    "right": [
        "hello",
        "what is your name"
    ],
    "model": "xx"
}

response = requests.post(url, json=data)
response_json = response.json()
print(f"Text embedding: {response_json}")

[ ]:

stop_server(server_process)

v1/rerank (cross encoder rerank model)#

Rerank a list of documents given a query using a cross-encoder model. Note that this API is only available for cross encoder model like BAAI/bge-reranker-v2-m3 with attention-backend triton and torch_native.

[ ]:

import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/bge-large-en-v1.5",
         "--model_type=bert",
         "--embedding_model=1",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

[ ]:

# compute rerank scores for query and documents

url = f"http://localhost:{port}/v1/rerank"
data = {
    "query": "what is panda? ",
    "documents": [
        "hi",
        "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China.",
        "gg"
    ]
}

response = requests.post(url, json=data)
response_json = response.json()
for item in response_json.get('results'):
    print(f"Score: {item['relevance_score']:.2f} - Document: '{item['document']}'")

[ ]:

stop_server(server_process)

Classify#

RTP-LL Runtime also supports classify models. Here we use a classify model to classify the quality of pairwise generations.

[ ]:

import socket
import subprocess
import time
import logging
import psutil
import requests
import json
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--unitary--toxic-bert",
         "--model_type=bert",
         "--embedding_model=1",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

[ ]:

url = f"http://localhost:{port}/v1/classifier"
data = {
    "input": [
        [
            "what is panda?",
            "hi"
        ],
        [
            "what is panda?",
            "The giant panda (Ailuropoda melanoleuca), sometimes called a panda bear or simply panda, is a bear species endemic to China."
        ]
    ],
    "model": "xx"
}

response = requests.post(url, json=data)
response_json = response.json()
for item in response_json.get('results'):
    print(f"Score: {item['relevance_score']:.2f} - Document: '{item['document']}'")

[ ]:

stop_server(server_process)

RTP-LLM Native APIs

Contents