Zero-shot object navigation (ZSON) in large-scale outdoor environments faces many challenges; we specifically address a coupled one: long-range targets that reduce to tiny projections and intermittent visibility due to partial or complete occlusion. We present a unified, lightweight closed-loop system built on an aligned multi-scale image tile hierarchy. Through hierarchical target–saliency fusion, it summarizes localized semantic contrast into a stable coarse-layer regional saliency that provides the target direction and indicates target visibility. This regional saliency supports visibility-aware heading maintenance through keyframe memory, saliency-weighted fusion of historical headings, and active search during temporary invisibility. The system avoids whole-image rescaling, enables deterministic bottom-up aggregation, supports zero-shot navigation, and runs efficiently on a mobile robot. Across simulation and real-world outdoor trials, the system detects semantic targets beyond 150 m, maintains a correct heading through visibility changes with 82.6% probability, and improves overall task success by 17.5% compared with SOTA methods, demonstrating robust ZSON toward distant and intermittently observable targets.
System overview: To tackle the unified challenge of outdoor ZSON—pursuing distant targets that appear as tiny projections while remaining robust to dynamic visibility changes—we design a lightweight perception–navigation system that integrates multi-scale semantic amplification with saliency-driven heading maintenance.
As illustrated in the figure, the system forms a closed perception–decision loop. The perception module analyzes RGB observations through a multi-scale tile pyramid, amplifying weak semantic cues from tiny, long-range targets and estimating both target direction and visibility status. These outputs are consumed by the navigation module, which adapts robot behavior accordingly: when the target is visible, the estimated direction directly guides frontier selection and forward progress; when the target becomes partially visible or fully occluded, the system leverages stored keyframes to sustain orientation, while triggering active search or fallback strategies for reliable re-identification.
Together, they close the loop for outdoor ZSON: the aligned multi-scale tile hierarchy makes regional semantic saliency reliable for far-field perception, and the same saliency sustains heading under variant visibility over time.
To jointly measure both detection accuracy and failure cases, we use a penalized angular error metric. This metric computes the average difference between the estimated and ground-truth target directions, while assigning a maximum penalty of π radians whenever the target is not detected. The radar chart illustrates directional perception accuracy across distances from 10 m to 150 m, where smaller values indicate better performance. At close ranges (10–25 m), most methods achieve comparable accuracy. However, beyond 50 m the performance of baseline approaches drops sharply due to target shrinkage and unstable visibility—box- and mask-based models (GroundingDINO, Mobile-SAM) fail to provide reliable direction, RADIO produces blurred heatmaps, YOLO-World shows moderate stability but reduced recall, and DyFo’s focus tree struggles under tiny or occluded targets. In contrast, our method sustains consistently lower errors, clearly outperforming all baselines up to 150 m by leveraging multi-scale tiles and saliency-driven scoring for robust long-range perception.
The table compares different methods on three metrics—Recovery Success Rate (RSR), Recovery Path Length (RPL), and overall Success Rate (SR)—across short, long, and mixed invisibility scenarios. Our method consistently achieves the highest recovery success rates, especially in long and mixed occlusions where conventional approaches fail to maintain heading. While the look-around behavior introduces slight path redundancy, the recovery path length remains reasonable, resulting in a strong balance between robustness and efficiency. These results highlight our method’s ability to reliably sustain navigation even under extended target invisibility.
The top row shows third-person views as the robot progresses toward the target, which transitions from fully visible to partially visible and then fully occluded. The second row depicts egocentric observations, where the multi-scale tile pyramid assigns semantic scores: green boxes mark confident target perception, while orange boxes indicate low-scoring background tiles. The third row visualizes decision-making in RViz: blue dots denote frontier candidates, arrows provide directional guidance (green for current target direction, red for recorded keyframe, blue for fused direction under occlusion), and the selected frontier is shown as a green dot. Together, these visualizations highlight how the system maintains orientation and exploration progress across visibility transitions.
(a) The navigation trajectory projected onto a satellite map, showing the robot’s path from the starting point (red circle) to the target (red star), along with the projected LiDAR map. (b) The mapping result during navigation, with red rectangles and arrows marking the robot’s observations along the route. (c–e) Examples of target visibility states: fully visible (red bounding box), fully occluded, and partially visible. (f) The ground robot platform used in the experiment. Together, these results demonstrate that the system can sustain stable heading and perception across varying visibility conditions, ultimately achieving successful long-range navigation in a real-world setting.
@article{YourPaperKey2024,
title={Your Paper Title Here},
author={First Author and Second Author and Third Author},
journal={Conference/Journal Name},
year={2024},
url={https://your-domain.com/your-project-page}
}