Multimodal Part Debug and Separate Deployment#
Multimodal Development#
A multimodal model refers to an LLM that incorporates multimodal inputs. Currently, the primary input format is URLs, which are distinguished by specific placeholders using OpenAI formatting. Each multimodal model has its unique preprocessing pipeline but must implement the following interfaces:
The multimodal model must inherit from
MultimodalMixin
inrtp_llm/models/multimodal/multimodal_mixin.py
and instantiatemm_part
as the processing class for multimodal inputs.mm_part
has various interface implementations based on input types, such as images, videos, or audio. The logic must be self-consistent, with the most critical interfaces beingmm_embedding
,_mm_preprocess
, andmm_process
:mm_embedding
has a default implementation that calls_mm_preprocess
andmm_process
, converting the multimodal input URL into an embedding tensor and other information (e.g., position IDs)._mm_preprocess
also has default implementations for specific modalities, preprocessing byte data from input url and preparing inputs for mm_process. This separation is necessary because preprocessing is CPU-bound, while subsequent processing is GPU-bound.mm_process
handles GPU-based transformation of preprocessed inputs into outputs.
For model weights, the required weights must be registered in
GptInitModelParameters
undermm_related_params.vit_weights
. Refer toBaseVitWeights
for specific implementation logic.
Debug#
Start multimodal part.
START_PORT=12345 \
VIT_SEPARATION=1 \
ACT_TYPE=bf16 \
MODEL_TYPE=qwen2_5_vl \
CHECKPOINT_PATH=/home/xieshui.yyx/Qwen2.5-VL-3B \
/opt/conda310/bin/python -m rtp_llm.start_server
Start a grpc client.
from rtp_llm.cpp.proto.model_rpc_service_pb2_grpc import (
MultimodalRpcServiceStub,
)
from rtp_llm.cpp.proto.model_rpc_service_pb2 import (
MultimodalInputPB,
MultimodalInputsPB
)
from rtp_llm.utils.grpc_util import trans_tensor
import grpc
import torch
def trans_multimodal_input(urls):
input_pb = MultimodalInputsPB()
for url in urls:
mm_input_pb = MultimodalInputPB()
mm_input_pb.multimodal_url = url
mm_input_pb.multimodal_type = 1
mm_input_pb.mm_preprocess_config.width = -1
mm_input_pb.mm_preprocess_config.height = -1
mm_input_pb.mm_preprocess_config.min_pixels = -1
mm_input_pb.mm_preprocess_config.max_pixels = -1
mm_input_pb.mm_preprocess_config.fps = -1
mm_input_pb.mm_preprocess_config.min_frames = -1
mm_input_pb.mm_preprocess_config.max_frames = -1
input_pb.multimodal_inputs.append(mm_input_pb)
return input_pb
def main():
with grpc.insecure_channel('localhost:12346', options=[('grpc.max_receive_message_length', 1024 * 1024 * 1024),
('grpc.max_send_message_length', 1024 * 1024 * 1024)]) as channel:
stub = MultimodalRpcServiceStub(channel)
response = stub.RemoteMultimodalEmbedding(trans_multimodal_input(['/mnt/nas1/hf/llava-v1.5-7b/1.jpg']))
for res in response.multimodal_outputs:
print(trans_tensor(res.multimodal_embedding))
print(trans_tensor(res.multimodal_pos_id))
if __name__ == '__main__':
main()
hints: Grpc port is START_PORT + 1.