LoRA Serving#

RTP-LLM supports Static LoRA and Dynamic LoRA inference modes. For information on LoRA principles, refer to Hugging Face Official Documentation.

Serving Single Adaptor（Static LoRA）#

Feature	Description
Fusion Method	Before inference, the base model and specified LoRA weights are permanently fused (irreversible).
Applicable Scenarios	Scenarios requiring single LoRA with pursuit of optimal performance.
Limitations	After fusion, the base model cannot be restored; the original output before applying LoRA cannot be obtained simultaneously.

Usage#

When the startup parameter --lora_info contains only 1 element, it automatically enters static mode.
Example: --lora_info {"test0":"/mnt/nas1/lora/taoshang_qwen_lora_18000/lora"}

[ ]:

def wait_sever_done(server_process, port: int, timeout: int = 1600):
    host = "localhost"
    retry_interval = 1  # Retry interval (seconds)
    start_time = time.time()

    port = str(port)

    logging.info(f"Waiting for pid[{server_process.pid}] to start...\nPort {port}")
    while True:
        try:
            # Try to connect to the specified host and port
            sock = socket.create_connection((host, port), timeout=timeout)
            sock.close()
            logging.info(f"Port {port} started successfully")
            return True
        except (socket.error, ConnectionRefusedError):
            # If connection fails, wait for a while before retrying
            time.sleep(retry_interval)

            if (
                not psutil.pid_exists(server_process.pid)
                or server_process.poll()
            ):
                logging.warning(
                    f"Process:[{server_process.pid}] status abnormal, service startup failed, please check log files"
                )
                return False
            # If waiting time exceeds the preset timeout, give up waiting
            if time.time() - start_time > timeout:
                logging.warning(
                    f"Waiting for port {port} to start timed out, please check log files"
                )
                return False

[ ]:

port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         "--lora_info='{\"test1\": \"/mnt/nas1/lora/qwen_lora_18000/lora\"}'"
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

[ ]:

url = f"http://localhost:{port}"
json_data = {
     "prompt": "who are you",
     "generate_config": {"max_new_tokens": 32, "temperature": 0}
}

response = requests.post(url, json=json_data)
print(f"Output 0: {response.json()}")

[ ]:

stop_server(server_process)

Serving Multiple Adaptors(Dynamic LoRA)#

Feature	Description
Fusion Method	Load LoRA dynamically as a plugin during inference, without modifying base weights.
Applicable Scenarios	Scenarios requiring multiple LoRA switching or dynamic selection by request.
Limitations	Only some models support LoRA

Usage#

Specify the LoRA to be used for this request through generate_config.adapter_name.
Field rules
Type: str or list[str]
The number of elements and order must be completely aligned with prompt
Leave empty or omit to indicate not using LoRA
When the startup parameter --lora_info contains multiple elements, it automatically enters dynamic mode.
Example: --lora_info {"test0":"/mnt/nas1/lora/qwen_lora_18000/lora", "test1":"/mnt/nas1/lora/qwen_lora_18000/lora"}

[ ]:

server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         "--lora_info='{\"test1\": \"/mnt/nas1/lora/qwen_lora_18000/lora\", \"test0\": \"/mnt/nas1/lora/qwen_lora_18000/lora\"}'",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)

[ ]:

url = f"http://localhost:{port}"
json_data = {
    "prompt_batch": [
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
    ],
    "generate_config": {
        "max_new_tokens": 500,
        "top_k": 1,
        "top_p": 0,
        "adapter_name": ["test0", "", "test0", ""]
    }
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()['response_batch'][0]}")
print(f"Output 1: {response.json()['response_batch'][1]}")

[ ]:

stop_server(server_process)

LoRA Serving

Contents

LoRA Serving#

Serving Single Adaptor（Static LoRA）#

Usage#

Serving Multiple Adaptors(Dynamic LoRA)#

Usage#