Multimodal Part Debug and Separate Deployment#
Multimodal Development#
A multimodal model refers to an LLM that incorporates multimodal inputs. Currently, the primary input format is URLs, which are distinguished by specific placeholders using OpenAI formatting. Each multimodal model has its unique preprocessing pipeline but must implement the following interfaces:
The multimodal model must inherit from
MultimodalMixininrtp_llm/models/multimodal/multimodal_mixin.pyand instantiatemm_partas the processing class for multimodal inputs.mm_parthas various interface implementations based on input types, such as images, videos, or audio. The logic must be self-consistent, with the most critical interfaces beingmm_embedding,_mm_preprocess, andmm_process:mm_embeddinghas a default implementation that calls_mm_preprocessandmm_process, converting the multimodal input URL into an embedding tensor and other information (e.g., position IDs)._mm_preprocessalso has default implementations for specific modalities, preprocessing byte data from input url and preparing inputs for mm_process. This separation is necessary because preprocessing is CPU-bound, while subsequent processing is GPU-bound.mm_processhandles GPU-based transformation of preprocessed inputs into outputs.
For model weights, the required weights must be registered in
GptInitModelParametersundermm_related_params.vit_weights. Refer toBaseVitWeightsfor specific implementation logic.
Debug#
Start multimodal part.
START_PORT=12345 \
VIT_SEPARATION=1 \
ACT_TYPE=bf16 \
MODEL_TYPE=qwen2_5_vl \
CHECKPOINT_PATH=/home/xieshui.yyx/Qwen2.5-VL-3B \
/opt/conda310/bin/python -m rtp_llm.start_server
Start a grpc client.
from rtp_llm.cpp.proto.model_rpc_service_pb2_grpc import (
MultimodalRpcServiceStub,
)
from rtp_llm.cpp.proto.model_rpc_service_pb2 import (
MultimodalInputPB,
MultimodalInputsPB
)
from rtp_llm.utils.grpc_util import trans_tensor
import grpc
import torch
def trans_multimodal_input(urls):
input_pb = MultimodalInputsPB()
for url in urls:
mm_input_pb = MultimodalInputPB()
mm_input_pb.multimodal_url = url
mm_input_pb.multimodal_type = 1
mm_input_pb.mm_preprocess_config.width = -1
mm_input_pb.mm_preprocess_config.height = -1
mm_input_pb.mm_preprocess_config.min_pixels = -1
mm_input_pb.mm_preprocess_config.max_pixels = -1
mm_input_pb.mm_preprocess_config.fps = -1
mm_input_pb.mm_preprocess_config.min_frames = -1
mm_input_pb.mm_preprocess_config.max_frames = -1
input_pb.multimodal_inputs.append(mm_input_pb)
return input_pb
def main():
with grpc.insecure_channel('localhost:12346', options=[('grpc.max_receive_message_length', 1024 * 1024 * 1024),
('grpc.max_send_message_length', 1024 * 1024 * 1024)]) as channel:
stub = MultimodalRpcServiceStub(channel)
response = stub.RemoteMultimodalEmbedding(trans_multimodal_input(['/mnt/nas1/hf/llava-v1.5-7b/1.jpg']))
for res in response.multimodal_outputs:
print(trans_tensor(res.multimodal_embedding))
print(trans_tensor(res.multimodal_pos_id))
if __name__ == '__main__':
main()
hints: Grpc port is START_PORT + 1.