Dedicated undergraduate at Sharif University of Technology, exploring the intersection of Vision Language Models, generative models, and compositional reasoning.
I am a dedicated undergraduate student in Computer Engineering at Sharif University of Technology, with a parallel academic pursuit in Economics. My primary research interests lie at the intersection of Vision Language Models, Generative Models, and Compositional Reasoning.
With a strong academic record (GPA: 19.36/20 in Computer Engineering and 20.0/20 in Economics), I have actively contributed to the research community through multiple co-authored publications in leading venues such as CVPR, TMLR, and ECCV.
My work explores fundamental limitations in current AI systems and proposes novel, practical solutions. I thrive in research-oriented environments and aim to continue advancing AI models that are more generalizable, interpretable, and capable of deep reasoning.
BS in Computer Engineering
Minor BS, Economics
Iranian Matriculation Exams
NLP and Vision Competition
Student Guidance & Leadership
National Ranking
ML Challenge Winner
Olympiad Mentor
h-index
My research explores fundamental limitations in current AI systems and proposes novel, practical solutions. I focus on advancing AI models that are more generalizable, interpretable, and capable of deep reasoning, with particular emphasis on vision-language understanding and compositional reasoning.
Authors: Amir Mohammad Izadi *, Mohammdad Ali Banayeeanzade *, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah
Proposed a lightweight, model-agnostic method to improve vision-language models' visual reasoning by adding low-level spatial structures (e.g., horizontal lines) and guided prompts. Demonstrated substantial gains across tasks like visual search, counting, and spatial reasoning without retraining or added computation. Outperformed standard Chain-of-Thought prompting and matched fine-tuned models on several benchmarks.
Authors: Arash Mari Oriyad, Mohammad Ali Banayeeanzade, Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah.
We investigate the root cause of the entity missing problem in text-to-image diffusion models, where certain objects described in prompts fail to appear in the generated images. Through detailed analysis, we demonstrate that overlapping attention maps between entities suppress their independent representation, leading to object omission. To address this, we propose simple attention separation techniques that reduce attention overlap and significantly improve entity inclusion rates across various diffusion models and datasets, without compromising image quality.
Authors: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammad Ali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah.
This paper investigates how the CLIP model encodes and processes scenes containing multiple objects. We conduct a comprehensive analysis of CLIP’s internal representations, focusing on its capacity to distinguish, localize, and semantically relate multiple entities within complex visual inputs. Through controlled experiments and probing techniques, we reveal the model’s strengths and limitations in multi-object understanding and compositional reasoning. Our findings provide insights for improving multimodal models and guiding future benchmarks for compositional vision-language tasks.
Authors: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammad Ali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah.
In this initial study, we evaluate CLIP’s behavior in multi-object scenarios using a high-resolution synthetic dataset designed for controlled experimentation. Our results show that CLIP often struggles with distinguishing and localizing multiple entities in a scene. Although primarily empirical, the findings raise important questions about CLIP’s compositional limitations and motivated our subsequent in-depth analysis presented at CVPR 2025.