(College of Computer Science and Engineering,Northeastern University,Shenyang110169,China)
As a classical classification algorithm, the decision tree algorithm is widely used in medical data analysis because its classification rules are easy to understand. However, the unbalanced sample of medical data reduces the classification effect of the decision tree algorithm. Data resampling is a common method for solving the problem of sample imbalance. It mainly improves the classification performance of minority samples by changing the sample distribution. The existing resampling methods are often independent of the subsequent learning algorithms and the sampled data may not be effective for the construction of weak classifiers. Based on the above observations, we propose a hybrid sampling algorithm based on C4.5. Specifically, this algorithm controls the iterative process of oversampling and undersampling with the evaluation criteria of iterative sampling based on the C4.5. In addition, we dynamically update the sampling ratio of the oversampling based on the unbalanced ratio of the data and eventually combine multiple weak classifiers to predict the results with a voting mechanism. The effectiveness of the proposed algorithm is proved by the comparison experiments on 9 UCI datasets, and the algorithm also achieves accurate predictions on the missed abortion data.