Mohammad Ali Banayeeanzade


Dedicated undergraduate at Sharif University of Technology, exploring the intersection of Vision Language Models, generative models, and compositional reasoning.

Learn More About Me

About Me

Research Interests

Vision Language Models
Generative Models
Compositional Reasoning
AI Interpretability

I am a dedicated undergraduate student in Computer Engineering at Sharif University of Technology, with a parallel academic pursuit in Economics. My primary research interests lie at the intersection of Vision Language Models, Generative Models, and Compositional Reasoning.

With a strong academic record (GPA: 19.36/20 in Computer Engineering and 20.0/20 in Economics), I have actively contributed to the research community through multiple co-authored publications in leading venues such as CVPR, TMLR, and ECCV.

My work explores fundamental limitations in current AI systems and proposes novel, practical solutions. I thrive in research-oriented environments and aim to continue advancing AI models that are more generalizable, interpretable, and capable of deep reasoning.

Education

Sharif University of Technology

BS in Computer Engineering

Sept 2021 – Present
GPA: 19.36/20.0

Key Coursework:

Engineering Probability & Statistics Linear Algebra Artificial Intelligence Deep Learning Algorithm Design Security & Privacy in ML Deep Reinforcement Learning Large Language Models

Sharif University of Technology

Minor BS, Economics

Sept 2023 – Present
GPA: 20.0/20.0

Key Coursework:

Game Theory Econometrics

Awards & Honors

National Excellence

Iranian Matriculation Exams

Ranked 114th out of 200,000+

ML Challenge Winner

NLP and Vision Competition

1st Place among 100+ participants

AI Olympiad Mentor

Student Guidance & Leadership

AI World Olympiad Preparation

Achievement Highlights

114th

National Ranking

1st

ML Challenge Winner

AI

Olympiad Mentor

3

h-index

Research & Publications

Research Focus

My research explores fundamental limitations in current AI systems and proposes novel, practical solutions. I focus on advancing AI models that are more generalizable, interpretable, and capable of deep reasoning, with particular emphasis on vision-language understanding and compositional reasoning.

Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs

Under review at Neurips 2025

Authors: Amir Mohammad Izadi *, Mohammdad Ali Banayeeanzade *, Fatemeh Askari, Ali Rahimiakbar, Mohammad Mahdi Vahedi, Hosein Hasani, Mahdieh Soleymani Baghshah

Proposed a lightweight, model-agnostic method to improve vision-language models' visual reasoning by adding low-level spatial structures (e.g., horizontal lines) and guided prompts. Demonstrated substantial gains across tasks like visual search, counting, and spatial reasoning without retraining or added computation. Outperformed standard Chain-of-Thought prompting and matched fine-tuned models on several benchmarks.

Attention Overlap Is Responsible for The Entity Missing Problem in Text-to-Image Diffusion Models!

TMLR 2025

Authors: Arash Mari Oriyad, Mohammad Ali Banayeeanzade, Reza Abbasi, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah.

We investigate the root cause of the entity missing problem in text-to-image diffusion models, where certain objects described in prompts fail to appear in the generated images. Through detailed analysis, we demonstrate that overlapping attention maps between entities suppress their independent representation, leading to object omission. To address this, we propose simple attention separation techniques that reduce attention overlap and significantly improve entity inclusion rates across various diffusion models and datasets, without compromising image quality.

CLIP Under the Microscope: A Fine-Grained Analysis of Multi-Object Representation

CVPR 2025

Authors: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammad Ali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah.

This paper investigates how the CLIP model encodes and processes scenes containing multiple objects. We conduct a comprehensive analysis of CLIP’s internal representations, focusing on its capacity to distinguish, localize, and semantically relate multiple entities within complex visual inputs. Through controlled experiments and probing techniques, we reveal the model’s strengths and limitations in multi-object understanding and compositional reasoning. Our findings provide insights for improving multimodal models and guiding future benchmarks for compositional vision-language tasks.

Analyzing CLIP’s Performance Limitations in Multi-Object Scenarios: A Controlled High-Resolution Study

ECCV 2024 Workshop (Eval of FOMO)

Authors: Reza Abbasi, Ali Nazari, Aminreza Sefid, Mohammad Ali Banayeeanzade, Mohammad Hossein Rohban, Mahdieh Soleymani Baghshah.

In this initial study, we evaluate CLIP’s behavior in multi-object scenarios using a high-resolution synthetic dataset designed for controlled experimentation. Our results show that CLIP often struggles with distinguishing and localizing multiple entities in a scene. Although primarily empirical, the findings raise important questions about CLIP’s compositional limitations and motivated our subsequent in-depth analysis presented at CVPR 2025.