Open-vocabulary object detection (OVD) aims to detect arbitrary objects with user-provided text labels. Although recent progress has enhanced zero-shot detection ability, current techniques handicap themselves with three important challenges. They heavily depend on expensive and large-scale region-level annotations, which are hard to scale. Their captions are typically short and not contextually rich, which makes them…
