ReuseCache#
In multi-turn conversation scenarios, multiple prompts often share common prefixes. The KV cache corresponding to these prefix tokens is identical, and reusing KV cache can reduce computation time for these repeated parts, lowering First Token Latency. Enable KV cache reuse by setting the environment variable REUSE_CACHE=1
. The startup logs will show “reuse_cache: True” when enabled.
The environment variable SEQ_SIZE_PER_BLOCK
specifies the number of sequences corresponding to each KV cache block.
Note: ReuseCache cannot currently use flash attention due to mismatched lengths between Q and KV, requiring --reuse_cache true
to be added in the CMD
import os
from rtp_llm.pipeline import Pipeline
from rtp_llm.model_factory import ModelFactory, ModelConfig
model_name = "Qwen/Qwen-7B-Chat"
model_config = ModelConfig()
model = ModelFactory.from_huggingface(model_name, model_config)
pipeline = Pipeline(model, model.tokenizer)
generate_config = {
"top_k": 1,
"max_new_tokens": 100,
}
for res in pipeline("hello, what's your name", generate_config = generate_config):
print(res.batch_response)
# This query can reuse the KV cache corresponding to the previous query.
# The 'reuse len' in the response's aux info indicates how much KV cache was reused. The reuse length is always an integer multiple of SEQ_SIZE_PER_BLOCK.
for res in pipeline("hello, what's your name? how old are you?", generate_config = generate_config):
print(res.batch_response)
pipeline.stop()
MultiTaskPrompt#
Create static cache for long-text System Prompts, directly reading KV cache from static cache in each request instead of recomputing. This method can significantly reduce the model’s First Token Latency.
Usage#
MultiTaskPrompt#
rtp-llm specifies the system prompt information file that needs static caching through the --multi_task_prompt
parameter. The format is similar to the following:
[
{"task_id": 1, "prompt": " <|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>"},
{"task_id": 2, "prompt": "你是一个严谨的程序员,你接下来需要非常谨慎的思考并回答以下问题:"}
]
You can also pass the above JSON through the multi_task_prompt_str
environment variable.
After startup, the model will run the above system prompts and cache the KV cache in GPU memory. During subsequent runs, if a task_id is specified, this prefix can be used. Demo is as follows:
Note: MultiTaskPrompt cannot currently use flash attention due to mismatched lengths between Q and KV, requiring the environment variable export ENABLE_FMHA=OFF
to be configured before running the code
import os
from rtp_llm.pipeline import Pipeline
from rtp_llm.model_factory import ModelFactory, ModelConfig
os.environ["MULTI_TASK_PROMPT"] = "/path/to/file"
# os.environ["MULTI_TASK_PROMPT_STR"] = "{json str}"
model_name = "Qwen/Qwen-7B-Chat"
model_config = ModelConfig()
model = ModelFactory.from_huggingface(model_name, model_config)
pipeline = Pipeline(model, model.tokenizer)
# Using system prompt with task_id=1 to concatenate the request
generate_config = {
"top_k": 1,
"max_new_tokens": 100,
"task_id": "1"
}
for res in pipeline("hello, what's your name", generate_config = generate_config):
print(res.batch_response)
# Not using system prompt
generate_config = {
"top_k": 1,
"max_new_tokens": 100,
}
for res in pipeline("hello, what's your name", generate_config = generate_config):
print(res.batch_response)
pipeline.stop()
Note:#
When using MULTI_TASK_PROMPT, if the REUSE_CACHE function is enabled, then KV cache can be reused. Refer to the document ReuseKVCache. When a task ID is specified, the system prompt of the task_id is used to concatenate the request, and the longest matching historical request is found in the KV cache to reuse the KV cache. When no task ID is specified, the user’s prompt is used to find the longest matching historical request in the KV cache to reuse the KV cache.