OpenAI APIs - Completions#

RTP-LLM provides OpenAI-compatible APIs to enable a smooth transition from OpenAI services to self-hosted local models. This tutorial covers the following popular APIs:

  • v1/chat/completions

Check out other tutorials to learn about vision APIs for vision-language models and embedding APIs for embedding models.

Launch A Server#

Launch the server in your terminal and wait for it to initialize.

export CUDA_VISIBLE_DEVICES=1

[ ]:
import subprocess
from rtp_llm.utils.util import wait_sever_done, stop_server
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

Chat Completions#

Usage#

The server fully implements the OpenAI API. It will automatically apply the chat template specified in the Hugging Face tokenizer, if one is available.

[ ]:
import openai

port=8090
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1/chat/completions", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print(f"Response: {response}")

Parameters#

The chat completions API accepts OpenAI Chat Completions API’s parameters. Refer to OpenAI Chat Completions API for more details.

RTP-LLM extends the standard API with the extra_configs parameter, allowing for additional customization. One key option within extra_configs is chat_template_kwargs, which can be used to pass arguments to the chat template processor.

Enabling Model Thinking/Reasoning#

# Ensure the server is launched with a compatible reasoning parser

from openai import OpenAI

port = 8090
client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

model = "QwQ/Qwen3-32B-250415" # Use the model loaded by the server
messages = [{"role": "user", "content": "9.11 and 9.8, which is greater?"}]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "extra_configs": {
            "chat_template_kwargs": {
                "enable_thinking": True
            }
        },
    }
)

print("response.choices[0].message.reasoning_content: \n", response.choices[0].message.reasoning_content)
print("response.choices[0].message.content: \n", response.choices[0].message.content)
[ ]:
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {
            "role": "system",
            "content": "You are a knowledgeable historian who provides concise responses.",
        },
        {"role": "user", "content": "Tell me about ancient Rome"},
        {
            "role": "assistant",
            "content": "Ancient Rome was a civilization centered in Italy.",
        },
        {"role": "user", "content": "What were their major achievements?"},
    ],
    temperature=0.3,  # Lower temperature for more focused responses
    max_tokens=128,  # Reasonable length for a concise response
    top_p=0.95,  # Slightly higher for better fluency
)

print(response.choices[0].message.content)

Completions#

Usage#

Completions API is similar to Chat Completions API, but without the messages parameter or chat templates.

[ ]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="List 3 countries and their capitals.",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)

print(f"Response: {response}")

Parameters#

The completions API accepts OpenAI Completions API’s parameters. Refer to OpenAI Completions API for more details.

Here is an example of a detailed completions request:

[ ]:
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="Write a short story about a space explorer.",
    temperature=0.7,  # Moderate temperature for creative writing
    max_tokens=150,  # Longer response for a story
    top_p=0.9,  # Balanced diversity in word choice
)

print(f"Response: {response}")