LoRA Serving#

RTP-LLM supports Static LoRA and Dynamic LoRA inference modes. For information on LoRA principles, refer to Hugging Face Official Documentation.

Serving Single Adaptor(Static LoRA)#

Feature

Description

Fusion Method

Before inference, the base model and specified LoRA weights are permanently fused (irreversible).

Applicable Scenarios

Scenarios requiring single LoRA with pursuit of optimal performance.

Limitations

After fusion, the base model cannot be restored; the original output before applying LoRA cannot be obtained simultaneously.

Usage#

  • When the startup parameter --lora_info contains only 1 element, it automatically enters static mode.

  • Example: --lora_info {"test0":"/mnt/nas1/lora/taoshang_qwen_lora_18000/lora"}

[ ]:
def wait_sever_done(server_process, port: int, timeout: int = 1600):
    host = "localhost"
    retry_interval = 1  # Retry interval (seconds)
    start_time = time.time()

    port = str(port)

    logging.info(f"Waiting for pid[{server_process.pid}] to start...\nPort {port}")
    while True:
        try:
            # Try to connect to the specified host and port
            sock = socket.create_connection((host, port), timeout=timeout)
            sock.close()
            logging.info(f"Port {port} started successfully")
            return True
        except (socket.error, ConnectionRefusedError):
            # If connection fails, wait for a while before retrying
            time.sleep(retry_interval)

            if (
                not psutil.pid_exists(server_process.pid)
                or server_process.poll()
            ):
                logging.warning(
                    f"Process:[{server_process.pid}] status abnormal, service startup failed, please check log files"
                )
                return False
            # If waiting time exceeds the preset timeout, give up waiting
            if time.time() - start_time > timeout:
                logging.warning(
                    f"Waiting for port {port} to start timed out, please check log files"
                )
                return False
[ ]:
port=8090
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         "--lora_info='{\"test1\": \"/mnt/nas1/lora/qwen_lora_18000/lora\"}'"
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)
[ ]:
url = f"http://localhost:{port}"
json_data = {
     "prompt": "who are you",
     "generate_config": {"max_new_tokens": 32, "temperature": 0}
}

response = requests.post(url, json=json_data)
print(f"Output 0: {response.json()}")
[ ]:
stop_server(server_process)

Serving Multiple Adaptors(Dynamic LoRA)#

Feature

Description

Fusion Method

Load LoRA dynamically as a plugin during inference, without modifying base weights.

Applicable Scenarios

Scenarios requiring multiple LoRA switching or dynamic selection by request.

Limitations

Only some models support LoRA

Usage#

  • Specify the LoRA to be used for this request through generate_config.adapter_name.

  • Field rules

  • Type: str or list[str]

  • The number of elements and order must be completely aligned with prompt

  • Leave empty or omit to indicate not using LoRA

  • When the startup parameter --lora_info contains multiple elements, it automatically enters dynamic mode.

  • Example: --lora_info {"test0":"/mnt/nas1/lora/qwen_lora_18000/lora", "test1":"/mnt/nas1/lora/qwen_lora_18000/lora"}

[ ]:
server_process = subprocess.Popen(
        ["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
         "--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
         "--model_type=qwen_2",
         "--lora_info='{\"test1\": \"/mnt/nas1/lora/qwen_lora_18000/lora\", \"test0\": \"/mnt/nas1/lora/qwen_lora_18000/lora\"}'",
         f"--start_port={port}"
         ]
    )
wait_sever_done(server_process, port)
[ ]:
url = f"http://localhost:{port}"
json_data = {
    "prompt_batch": [
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
        "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
    ],
    "generate_config": {
        "max_new_tokens": 500,
        "top_k": 1,
        "top_p": 0,
        "adapter_name": ["test0", "", "test0", ""]
    }
}
response = requests.post(
    url + "/generate",
    json=json_data,
)
print(f"Output 0: {response.json()['response_batch'][0]}")
print(f"Output 1: {response.json()['response_batch'][1]}")
[ ]:

stop_server(server_process)