Publications

You can also find my articles on my Google Scholar profile.

Journal Papers


Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-Image Diffusion Models!

A. Mari Oriyad, M.A. Banayeeanzade, R. Abbasi, M.H. Rohban, M. Soleymani Baghshah. · TMLR 2025

We investigate the root cause of the entity missing problem in text-to-image diffusion models, where certain objects described in prompts fail to appear in the generated images. Through de tailed analysis, we demonstrate that overlapping attention maps between entities suppress their independent representation, leading to object omission. To address this, we propose simple atten tion separation techniques that reduce attention overlap and significantly improve entity inclusion rates across various diffusion models and datasets, without compromising image quality.

Paper | Code

Conference Papers


Visual Structures Help Visual Reasoning: Addressing the Binding Problem in VLMs

A. Izadi *, M.A. Banayeeanzade *, F. Askari, A. Rahimiakbar, M.M Vahedi, H. Hasani, M. Soleymani Baghshah. · Neurips 2025

Proposed a lightweight, model-agnostic method to improve vision-language models’ visual reason ing by adding low-level spatial structures (e.g., horizontal lines) and guided prompts. Demonstrated substantial gains across tasks like visual search, counting, and spatial reasoning without retrain ing or added computation. Outperformed standard Chain-of-Thought prompting and matched fine-tuned models on several benchmarks.

Paper | Code

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

R. Abbasi, A. Nazari, A. Sefid, M.A. Banayeeanzade, M.H. Rohban, M. Soleymani Baghshah. · CVPR 2025

This paper investigates how the CLIP model encodes and processes scenes containing multiple objects. We conduct a comprehensive analysis of CLIP’s internal representations, focusing on its capacity to distinguish, localize, and semantically relate multiple entities within complex visual inputs. Through controlled experiments and probing techniques, we reveal the model’s strengths and limitations in multi-object understanding and compositional reasoning. Our findings provide insights for improving multimodal models and guiding future benchmarks for compositional vision language tasks.

Paper | Code

Workshop Papers


Analyzing clip’s performance limitations in multi-object scenarios: A controlled high-resolution study.

R. Abbasi, A. Nazari, A. Sefid, M.A. Banayeeanzade, M.H. Rohban, M. Soleymani Baghshah. · ECCV 2024 (Eval of FOMO)

In this initial study, we evaluate CLIP’s behavior in multi-object scenarios using a high-resolution synthetic dataset designed for controlled experimentation. Our results show that CLIP often strug gles with distinguishing and localizing multiple entities in a scene. Although primarily empirical, the findings raise important questions about CLIP’s compositional limitations and motivated our subsequent in-depth analysis presented at CVPR 2025.

Paper | Code