OpenAI APIs - Completions#
RTP-LLM provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. This tutorial covers the following popular APIs:
v1/chat/completions
Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models.
Launch A Server#
Launch the server in your terminal and wait for it to initialize.
export CUDA_VISIBLE_DEVICES=1
[ ]:
import subprocess
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
"--model_type=qwen_2",
f"--start_port={port}"
]
)
wait_sever_done(server_process, port)
Chat Completions#
Usage#
The server fully implements the OpenAI API. It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.
[ ]:
import openai
port=8090
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1/chat/completions", api_key="None")
response = client.chat.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
messages=[
{"role": "user", "content": "List 3 countries and their capitals."},
],
temperature=0,
max_tokens=64,
)
print(f"Response: {response}")
Parameters#
The chat completions API accepts OpenAI Chat Completions API’s parameters. Refer to OpenAI Chat Completions API for more details.
RTP-LLM extends the standard API with the extra_configs
parameter, allowing for additional customization. One key option within extra_configs
is chat_template_kwargs
, which can be used to pass arguments to the chat template processor.
Enabling Model Thinking/Reasoning#
# Ensure the server is launched with a compatible reasoning parser
from openai import OpenAI
port = 8090
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")
model = "QwQ/Qwen3-32B-250415" # Use the model loaded by the server
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]
response = client.chat.completions.create(
model=model,
messages=messages,
extra_body={
"extra_configs": {
"chat_template_kwargs": {
"enable_thinking": True
}
},
}
)
print("response.choices[0].message.reasoning_content: \n", response.choices[0].message.reasoning_content)
print("response.choices[0].message.content: \n", response.choices[0].message.content)
[ ]:
response = client.chat.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
messages=[
{
"role": "system",
"content": "You are a knowledgeable historian who provides concise responses.",
},
{"role": "user", "content": "Tell me about ancient Rome"},
{
"role": "assistant",
"content": "Ancient Rome was a civilization centered in Italy.",
},
{"role": "user", "content": "What were their major achievements?"},
],
temperature=0.3, # Lower temperature for more focused responses
max_tokens=128, # Reasonable length for a concise response
top_p=0.95, # Slightly higher for better fluency
)
print(response.choices[0].message.content)
Completions#
Usage#
Completions API is similar to Chat Completions API, but without the messages
parameter or chat templates.
[ ]:
response = client.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
prompt="List 3 countries and their capitals.",
temperature=0,
max_tokens=64,
n=1,
stop=None,
)
print(f"Response: {response}")
Parameters#
The completions API accepts OpenAI Completions API’s parameters. Refer to OpenAI Completions API for more details.
Here is an example of a detailed completions request:
[ ]:
response = client.completions.create(
model="qwen/qwen2.5-0.5b-instruct",
prompt="Write a short story about a space explorer.",
temperature=0.7, # Moderate temperature for creative writing
max_tokens=150, # Longer response for a story
top_p=0.9, # Balanced diversity in word choice
)
print(f"Response: {response}")