A Semi—Structured Document Model for Text Mining

(整期优先)网络出版时间:2002-05-15
/ 1
Asemi-structureddocumenthasmorestructuredinformationcomparedtoanordinarydocument,andtherelationamongsemi-structureddocumentscanbefullyutilized.Inordertotakeadvantageofthestructureandlinkinformationinasemi-structureddocumentforbettermining,astructuredlinkvectormodel(SLVM)ispresentedinthispaper,whereavectorrepresentsadocument,andvectors'elementsaredeterminedbyterms,documentstructureandneighboringdocuments.TextminingbasedonSLVMisdescribedintheprocedureofK-meansforbriefnessandclarity:calculatingdocumentsimilarityandcalculatingclustercenter.TheclusteringbasedonSLVMperformssignificantlybetterthanthatbasedonaconventionalvectorspacemodelintheexperiments,anditsFvalueincreasesfrom0.65-0.73to0.82-0.86.