
Integrated Multimodal Analysis
Building Façade
Perception Research
A comprehensive eye-tracking study examining how people perceive and describe building façades, integrating Weighted Voronoi tessellation, group attention heatmaps, AI-based AOI segmentation, and verbal description analysis.
Abstract
This study investigates visual perception of building façades through a multimodal approach combining eye-tracking, verbal descriptions, and similarity judgements. Fifty-two participants viewed line-drawing façade stimuli across three task conditions (Preview, Response, Compare). Analysis reveals that architectural elements occupy highly disproportionate attention: windows (7% area → 30% attention) and decorative features (12% area → 43% attention) are systematically over-attended, whilst walls (63% area → 29% attention) are under-attended. Task type significantly modulates gaze strategy, with Response tasks eliciting concentrated attention (TCI = 0.234) and Compare tasks producing distributed scanning (TCI = 0.180). Cross-modal analysis identifies nine significant correlations between gaze metrics and verbal features, suggesting that attention allocation directly shapes descriptive content.
Research Methodology
This study employed a within-subjects experimental design in which 52 participants viewed line-drawing façade stimuli across three distinct task conditions. Eye movements were recorded using Tobii Pro eye-tracking technology at 60 Hz, capturing fixation coordinates, durations, and saccade patterns throughout each trial.
The analytical pipeline integrates four complementary modalities: Time-Weighted Voronoi tessellation for individual attention mapping, group-level attention heatmaps for aggregate fixation density, AI-based AOI segmentation using GPT-4 Vision for architectural element classification, and verbal description analysis for linguistic feature extraction.
Preview
10 secInitial scanning of 3×3 building grid to form first impressions
Response
35 secDetailed verbal description of individual building façade features
Compare
10 secSide-by-side comparison and similarity judgement of building pairs


Weighted Voronoi Tessellation
Time-Weighted Voronoi diagrams partition the visual field into regions proportional to fixation duration. Each fixation point generates a Voronoi cell whose size is inversely proportional to dwell time, yielding the Temporal Concentration Index (TCI) as a summary metric. Over 1,851 individual Voronoi images were generated across all participants and experiments.




Task-Type Comparison
- Mean Fixations
- Mean TCI
Figure 5. Mean fixation count and TCI by task type. All differences significant at p < 0.001 (Kruskal-Wallis).


Group Attention Heatmaps
Group heatmaps aggregate fixation data across all participants to reveal collective attention patterns. A total of 403 group heatmaps were generated across three experiments. Each pixel intensity represents cumulative dwell time, normalised across the participant pool, providing immediate visual confirmation of which façade elements attract the most attention.

Experiment 1 — Response Tasks



Experiment 1 — Compare Task

Key Observations
01 — Central bias is consistently observed, with fixation density peaking at the geometric centre of each façade.
02 — Windows and decorative elements act as primary attention attractors, receiving disproportionate fixation relative to their spatial extent.
03 — Response tasks produce more concentrated heatmaps than Preview tasks, reflecting deeper engagement during verbal description.
AOI Segmentation Analysis
Area of Interest (AOI) analysis classifies each façade into architectural elements—walls, windows, roof, entrance, and decorative features—to quantify attention distribution across functional building components. The segmentation employs GPT-4 Vision for intelligent grid-based classification, producing element maps for each stimulus image. A total of 216 segmentation maps were generated.




Attention Efficiency
Figure 14a. Attention distribution by architectural element.
Figure 14b. Area vs. attention scatter. Points above diagonal indicate over-representation.




Verbal Description Analysis
Participants provided verbal descriptions during Response tasks. These were transcribed and analysed for architectural feature mentions. The average description contained 5.97 distinct features, with windows and storey count being the most frequently mentioned attributes across all experiments.
Figure 19. Architectural feature mention frequency. Windows (3,047) and storeys (2,216) dominate.



Cross-Modal Integration
The cross-modal analysis examines relationships between visual attention (Voronoi metrics) and verbal descriptions (speech features). Nine statistically significant correlations were identified, suggesting that attention allocation directly influences which features participants choose to describe.
- Response
- Compare
- Preview
Figure 23. Attention distribution across elements by task type. Response tasks show stronger window focus.



Significant Correlations
| Speech Feature | Gaze Metric | r | p-value |
|---|---|---|---|
| Shape mentions | TCI | −0.76 | < 0.001 |
| Window mentions | Fixation count | −0.45 | 0.03 |
| Storey mentions | Mean duration | +0.52 | 0.01 |
| Decorative mentions | Gaze dispersion | +0.61 | 0.005 |
