Abstract:Multi-object tracking is a crucial technique of video surveillance. Over the past decade, convolutional neural networks(Convolutional Neural Networks, CNN) and especially graph neural networks(Graph Neural Networks, GNN) have made multi-object tracking a great progress, where the GNN show an significant advantages due to modeling the relationship between targets and trajectories. These GNN models, however, mostly consider building a global relationship model for targets and trajectories only in two neighboring frames, neglecting the interactions between an object with the others within a frame. In respond to this, we propose an intra-frame relationship modeling and self-attention fusion method for multi-object tracking. The method considers both intra- and inter-frame relationships, and has a feature integration module via a self-attention mechanism. To validate the effectiveness of our proposed method, we run various experiments on the MotChallenge benchmark datasets, and the experimental results show that our method outperforms GNN-based multi-object tracking methods by 1.9% of MOTA and 3.6% of IDF1 .