PD Disaggregation#

Why and What is PD Disaggregation?#

Large Language Models (LLMs) separate Prefill (context preparation) and Decode (token generation) phases to:

Boost Efficiency: Prefill precomputes attention keys/values (KV caching) for the input sequence, enabling fast autoregressive decoding by reusing cached data—reducing compute from O(n²) to O(n) per token. Optimize Memory: Caching KV matrices during Prefill avoids redundant recomputation, slashing memory overhead during long-sequence generation. Leverage Hardware: Prefill exploits parallel processing for known inputs (full-sequence batched compute), while Decode optimizes latency-critical step-by-step generation. Scale Applications: Separation allows dynamic resource allocation (e.g., high-throughput Prefill for prompts + low-latency Decode for streaming outputs), vital for real-time use cases like chatbots.

start#

start prefill#

[ ]:

import subprocess
from rtp_llm.utils.util import wait_sever_done, stop_server
prefill_port=8090
decode_port=27001
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         "--role_type=PREFILL",
         f"--start_port={prefill_port}",
         "--use_local=1",
         f"--remote_rpc_server_ip=127.0.0.1:{decode_port}"
         ]
    )

wait_sever_done(server_process, prefill_port)

start decode#

[ ]:

import subprocess
from rtp_llm.utils.util import wait_sever_done, stop_server
prefill_port=8090
decode_port=27001
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         "--role_type=DECODE",
         f"--start_port={decode_port}",
         "--use_local=1",
         f"--remote_rpc_server_ip=127.0.0.1:{port}"
         ]
    )

wait_sever_done()

[ ]:

import openai

prefill_port=8090
client = openai.Client(base_url=f"http://127.0.0.1:{prefill_port}/v1/chat/completions", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0,
    max_tokens=64,
)

print(f"Response: {response}")

Advanced Configuration#

PD Disaggregation supports the following environment variables.

Prefill Server Configuration#

Variable	Description	Default
PREFILL_RETRY_TIMES	Number of retries for prefill process, 0 means retry disabled	`0`
PREFILL_RETRY_TIMEOUT_MS	Total timeout for prefill retries (milliseconds)	`0`
PREFILL_MAX_WAIT_TIMEOUT_MS	Maximum wait timeout for prefill execution (milliseconds)	`600000`
LOAD_CACHE_TIMEOUT_MS	Timeout for remote KVCache loading (milliseconds)	`5000`
DECODE_RETRY_TIMES	Number of retries for decode process, 0 means retry disabled	`100`
DECODE_RETRY_TIMEOUT_MS	Total timeout for decode process retries (milliseconds)	`100`
DECODE_RETRY_INTERVAL_MS	interval for decode process retries (milliseconds)	`1`
RDMA_CONNECT_RETRY_TIMES	Number of retries for RDMA connection establishment	`5000`
DECODE_POLLING_KV_CACHE_STEP_MS	Interval time for polling KV loading status (milliseconds)	`30`
DECODE_ENTRANCE	Whether Decode serves as traffic entry point	`false`

PD Disaggregation

Contents