The traditional sparse filtering network(SFN) lacks multi-scale feature extraction ability, which makes it fail to dig adequate fault information. To deal with this problem, we propose a multi-scale sparse filtering network(MSSFN) which includes five layers. In particular, the multi-scale coarse-grained layer aims at obtaining multi-scale signals. The sample segmentation layer plays the role of dividing each sample at each scale into several segments. The local feature extraction layer aims to calculate feature vector of each segment. The feature averaging layer targets at averaging all segments as the feature representation of the input signal under this scale. The feature stacking layer plays the role of stacking all feature representations at different scales into a long vector as the final feature vector of the input signal. Three gear datasets are collected to validate the effectiveness. The results about visualization and clustering show that the MSSFN is able to learn more discriminative features from the gear vibration signals than those learned by the SFN. Softmax is used to classify features extracted by these two types of networks as well as three traditional multi-scale approaches, and it presents that the MSSFN achieves the highest recognition accuracy for each type of the gear fault. At the same time, the proposed MSSFN achieves the very competitive diagnosis results, in comparison with two other types of multi-scale networks under different architectures. The proposed MSSFN can be widely applied to the stage of feature extraction for machinery fault diagnosis, where it can discover useful fault information from massive unlabelled samples automatically.