Abstract:In this paper, an improved Inception-Resnet-V2 (IRV2) network and Local-Global-Local (LGL) module are used to design a siamese network structure based on CNN and Transformer coding structure for object tracking-SiamLGL (Siamese Local-Global-Local Network). Firstly, Due to the improved Inception-Resnet-V2(IRV2) network with deep layers, the features extracted by the IRV2 network in the images are better than those extracted by the shallow network. Furthermore, the information on the feature map is fused through deep intercorrelation. Secondly, the fused feature map uses the Local-Global-Local (LGL) module to obtain the global and local information of the object, and two encoder layers are used in series inside the module, the first encoder layer with depth-separable convolution obtain the local information of the object, and the second encoder layer with self-attention obtain the global features of the picture. In order to reduce the time complexity of the self-attention structure, the sparse attention approach is used for the computation, which ensures the accuracy of the network while reducing the time complexity. Finally, the feature map is input to the classification and regression network to generate the corresponding object location. The classification network adopts the binary cross entropy loss function, and the regression network adopts Distance-IoU (DIoU) as the loss function. The algorithm is evaluated on six public datasets : GOT-10k, LaSOT, TrackingNet, UAV123, OTB100 and VOT2019. The experimental results verify the effectiveness of the algorithm.