Abstract:In recent years, graph convolutional networks have exhibited outstanding performance in the field of skeleton-based action recognition. Nevertheless, existing graph convolutional network (GCN) based methods suffer from limitations in modeling complex node correlations and insufficient utilization of complementary information between modalities. To address these issues, this paper proposes a multi-semantic dynamic GCN (MSD-GCN). This network adopts a joint-bone fused dual-stream architecture, processing joint and bone modality data in parallel. The dual-stream network consists of multiple MSD-GC operators, multiple multi-scale temporal convolution (MS-TC) operators, and a joint-bone cross-modal contrastive learning (JB-CMCL) module. Specifically, the MSD-GC operator reconstructs high semantic granularity partitions through a semantic-aware hierarchical graph (SH-Graph) and executes in parallel a cross-semantic space modeling (CSSM) module to capture global joint correlations and a local geometry modeling (LGM) module to capture subtle motion features. The JB-CMCL module guides feature fusion and enhancement between joint and bone modalities within the dual-stream network through cross-modal feature alignment and hard sample discrimination mechanisms, thereby improving the model’s fine-grained recognition capability. Extensive experiments are conducted on NTU RGB + D, NTU RGB + D 120, and Northwestern-UCLA datasets. The results demonstrate that the proposed components and the overall network exhibit superior performance, effectively recognizing ambiguous actions. Compared with state-of-the-art methods, the proposed model shows strong competitiveness.