Abstract:Visual scene understanding includes detecting and recognizing objects, reasoning the visual relationships of the detected objects, and describing image regions with sentences. In order to achieve the more comprehensive and accurate understanding of scene image, we view object detection, visual relationship detection and image captioning as three visual tasks at different semantic levels in scene understanding, so as to propose an image understanding model based on multi-level semantic features to leverage the mutual connections across the three different semantic layers to solve the scene understanding tasks jointly. The model iterates and updates the semantic features of objects, relationship phrases and image captioning simultaneously through a message pass graph. The updated semantic features are used to classify objects and visual relationships, generate scene graphs and captions, and introduce a fusion attention mechanism to improve the accuracy of captions. The experimental results on the visual genome and COCO datasets show that the proposed method outperforms the existing methods on the scene graph generation and image captioning tasks.