YOLO-Count: Differentiable Object Counting for Text-to-Image Generation

Kavli Affiliate: Xiang Zhang

| First 5 Authors: Guanning Zeng, Guanning Zeng, , ,

| Summary:

We propose YOLO-Count, a differentiable open-vocabulary object counting model
that tackles both general counting challenges and enables precise quantity
control for text-to-image (T2I) generation. A core contribution is the
‘cardinality’ map, a novel regression target that accounts for variations in
object size and spatial distribution. Leveraging representation alignment and a
hybrid strong-weak supervision scheme, YOLO-Count bridges the gap between
open-vocabulary counting and T2I generation control. Its fully differentiable
architecture facilitates gradient-based optimization, enabling accurate object
count estimation and fine-grained guidance for generative models. Extensive
experiments demonstrate that YOLO-Count achieves state-of-the-art counting
accuracy while providing robust and effective quantity control for T2I systems.

| Search Query: ArXiv Query: search_query=au:”Xiang Zhang”&id_list=&start=0&max_results=3