Withtheadventofthebigdataera,theamountsofsamplingdataandthedimensionsofdatafeaturesarerapidlygrowing.Itishighlydesiredtoenablefastandefficientclusteringofunlabeledsamplesbasedonfeaturesimilarities.Asafundamentalprimitivefordataclustering,thek-meansoperationisreceivingincreasinglymoreattentionstoday.Toachievehighperformancek-meanscomputationsonmodernmulti-core/many-coresystems,weproposeamatrix-basedfusedframeworkthatcanachievehighperformancebyconductingcomputationsonadistancematrixandatthesametimecanimprovethememoryreusethroughthefusionofthedistance-matrixcomputationandthenearestcentroidsreduction.Weimplementandoptimizetheparallelk-meansalgorithmontheSW26010many-coreprocessor,whichisthemajorhorsepowerofSunwayTaihuLight.Inparticular,wedesignataskmappingstrategyforload-balancedtaskdistribution,adatasharingschemetoreducethememoryfootprintandaregisterblockingstrategytoincreasethedatalocality.Optimizationtechniquessuchasinstructionreorderinganddoublebufferingarefurtherappliedtoimprovethesustainedperformance.Discussionsonblock-sizetuningandperformancemodelingarealsopresented.Weshowbyexperimentsonbothrandomlygeneratedandreal-worlddatasetsthatourparallelimplementationofk-meansonSW26010cansustainadouble-precisionperformanceofover348.1Gflops,whichis46.9%ofthepeakperformanceand84%ofthetheoreticalperformanceupperboundonasinglecoregroup,andcanachieveanearlyidealscalabilitytothewholeSW26010processoroffourcoregroups.Performancecomparisonswiththepreviousstate-of-the-artonbothCPUandGPUarealsoprovidedtoshowthesuperiorityofouroptimizedk-meanskernel.