Abstract: Directly applying CLIP ... path encodes the candidate labels with a text encoder. Finally, we fuse the output of the three paths to obtain the predicted action label. Extensive experiments ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results