Abstract:The visual features of images play a crucial role in realizing zero-shot image classification. Although the deep features extracted by networks such as VGG, GoogLeNet, and ResNet have been widely used in the field of image classification, their performance in zero-shot image classification is not ideal. In addition, due to the disjoint setting of the training and testing sets under the zero-shot learning scenario, the classification network inevitably suffers from the problem of domain shift. Therefor, a transductive zero-shot image classification framework based on self-supervised enhancement feature is proposed. The main idea is as follows: first, the pseudo-labels are constructed via the auxiliary task, the self-supervised features of images are obtained by using the self-supervised learning and are further fused with the unsupervised deep features; then, the fused features are embedded in the semantic space for zero-shot image classification, thus the initial predicted labels for unseen classes are obtained; finally, the features and predicted labels of unseen classes are adopted to iteratively optimize the visual-semantic mapping. The framework components proposed can be selected. The framework components self-supervised network, backbone network and reduced-dimension network are CFN, VGG16 and PCA respectively. Experiments on CUB, SUN, and AwA2 datasets show that the proposed network can enhance the discriminative capability of features and perform well on zero-shot image classification tasks.