Abstract:The path model of extensible markup language(XML) document is extended by adding the frequency of path.
Based on this frequency-path model, a similarity calculation algorithm with position weight, weighted longest common
subsequence(WLCS), is proposed, and then a new method of creating vector of the structure of XML document is proposed. The result of the experiment on true data set shows that WLCS is suitable for the similarity comparison between XML files from different DTDs, and its recall ratio and accuracy are higher than the existing similarity calculation methods.