However, the existing Contrastive Language-Image Pre-training (CLIP)-based multimodal networks often suffer from incomplete fusion of two modalities and lack multi-scale contextual information. To ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results