LoRA Serving#
RTP-LLM supports Static LoRA and Dynamic LoRA inference modes. For information on LoRA principles, refer to Hugging Face Official Documentation.
Serving Single Adaptor(Static LoRA)#
Feature |
Description |
---|---|
Fusion Method |
Before inference, the base model and specified LoRA weights are permanently fused (irreversible). |
Applicable Scenarios |
Scenarios requiring single LoRA with pursuit of optimal performance. |
Limitations |
After fusion, the base model cannot be restored; the original output before applying LoRA cannot be obtained simultaneously. |
Usage#
When the startup parameter
--lora_info
contains only 1 element, it automatically enters static mode.Example:
--lora_info {"test0":"/mnt/nas1/lora/taoshang_qwen_lora_18000/lora"}
[ ]:
def wait_sever_done(server_process, port: int, timeout: int = 1600):
host = "localhost"
retry_interval = 1 # Retry interval (seconds)
start_time = time.time()
port = str(port)
logging.info(f"Waiting for pid[{server_process.pid}] to start...\nPort {port}")
while True:
try:
# Try to connect to the specified host and port
sock = socket.create_connection((host, port), timeout=timeout)
sock.close()
logging.info(f"Port {port} started successfully")
return True
except (socket.error, ConnectionRefusedError):
# If connection fails, wait for a while before retrying
time.sleep(retry_interval)
if (
not psutil.pid_exists(server_process.pid)
or server_process.poll()
):
logging.warning(
f"Process:[{server_process.pid}] status abnormal, service startup failed, please check log files"
)
return False
# If waiting time exceeds the preset timeout, give up waiting
if time.time() - start_time > timeout:
logging.warning(
f"Waiting for port {port} to start timed out, please check log files"
)
return False
[ ]:
port=8090
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
"--model_type=qwen_2",
"--lora_info='{\"test1\": \"/mnt/nas1/lora/qwen_lora_18000/lora\"}'"
f"--start_port={port}"
]
)
wait_sever_done(server_process, port)
[ ]:
url = f"http://localhost:{port}"
json_data = {
"prompt": "who are you",
"generate_config": {"max_new_tokens": 32, "temperature": 0}
}
response = requests.post(url, json=json_data)
print(f"Output 0: {response.json()}")
[ ]:
stop_server(server_process)
Serving Multiple Adaptors(Dynamic LoRA)#
Feature |
Description |
---|---|
Fusion Method |
Load LoRA dynamically as a plugin during inference, without modifying base weights. |
Applicable Scenarios |
Scenarios requiring multiple LoRA switching or dynamic selection by request. |
Limitations |
Only some models support LoRA |
Usage#
Specify the LoRA to be used for this request through
generate_config.adapter_name
.Field rules
Type:
str
orlist[str]
The number of elements and order must be completely aligned with
prompt
Leave empty or omit to indicate not using LoRA
When the startup parameter
--lora_info
contains multiple elements, it automatically enters dynamic mode.Example:
--lora_info {"test0":"/mnt/nas1/lora/qwen_lora_18000/lora", "test1":"/mnt/nas1/lora/qwen_lora_18000/lora"}
[ ]:
server_process = subprocess.Popen(
["/opt/conda310/bin/python", "-m", "rtp_llm.start_server",
"--checkpoint_path=/mnt/nas1/hf/models--Qwen--Qwen1.5-0.5B-Chat/snapshots/6114e9c18dac0042fa90925f03b046734369472f/",
"--model_type=qwen_2",
"--lora_info='{\"test1\": \"/mnt/nas1/lora/qwen_lora_18000/lora\", \"test0\": \"/mnt/nas1/lora/qwen_lora_18000/lora\"}'",
f"--start_port={port}"
]
)
wait_sever_done(server_process, port)
[ ]:
url = f"http://localhost:{port}"
json_data = {
"prompt_batch": [
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
"<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nhello|im_end|>\n<|im_start|>assistant",
],
"generate_config": {
"max_new_tokens": 500,
"top_k": 1,
"top_p": 0,
"adapter_name": ["test0", "", "test0", ""]
}
}
response = requests.post(
url + "/generate",
json=json_data,
)
print(f"Output 0: {response.json()['response_batch'][0]}")
print(f"Output 1: {response.json()['response_batch'][1]}")
[ ]:
stop_server(server_process)