明白了多标签分类的概念,我们怎么实现呢?其实说到底,多类分类,多标签分类,其实底层都是二分类组合。我们当然可以自己手动实现。但有没有已有的库来帮助我们实现呢?当然有!!!sklearn中已经有实现多标签分类的方法了,它的底层支持可以是Decision Trees, Random Forests以及Nearest Neighbors. 不过我们这次不使用我们优雅的sklearn,要跳到JAVA的坑去(好桑心)。JAVA中比较有名的多标签分类包是mulan以及meka,我们今天只讲meka,至于mulan的实现方法其实也跟meka很像,甚至在meka中都已经集成了mulan的jar包。
meka也是用arff文件作为输入。官方文档中有这么一句话:”A suitable dataset is the only requirement to begin running experiments with Meka.”,足见meka使用之方便~~
meka中已经自带了几个用于实现多标签分类的数据集在data目录下,这个数据集是描述某段音频的特征,以及它们属于哪些分类(是悲伤类的还是安静类的亦或是疯狂摇滚的…)我们来看看大概看看,每行%后面是我自己加的注释:
% 'Music' dataset; normalised version.
@relation 'Music: -C 6' %代表标签一共有6类
@attribute amazed-suprised {0,1} %第一个分类标签,取值是0或者1,代表是或者不是
@attribute happy-pleased {0,1} %第二个分类标签
@attribute relaxing-clam {0,1} %第三个分类标签
@attribute quiet-still {0,1} %第四个分类标签
@attribute sad-lonely {0,1} %第五个分类标签
@attribute angry-aggresive {0,1} %第六个分类标签
@attribute Mean_Acc1298_Mean_Mem40_Centroid numeric %第一个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_Rolloff numeric %第二个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_Flux numeric %第三个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_MFCC_0 numeric %第四个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_MFCC_1 numeric %第五个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_MFCC_2 numeric %第六个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_MFCC_3 numeric %第七个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_MFCC_4 numeric %第八个特征,numeric类型
@attribute Mean_Acc1298_Mean_Mem40_MFCC_5 numeric %第九个特征,numeric类型
....
@data
0,1,1,0,0,0,0.132498,0.077848,0.229227,0.602629,0.512861,0.467404,0.529733,0.573498,0.592831,0.520031,0.598853,0.537699,0.780658,0.462982,0.407108,0.684364,0.135824,0.245631,0.157515,0.301285,0.350107,0.459476,0.583274,0.430053,0.416198,0.581916,0.342758,0.309345,0.388929,0.323521,0.455207,0.26139,0.027559,0.149077,0.195433,0.571354,0.326404,0.246745,0.524645,0.354798,0.240244,0.239788,0.128689,0.173252,0.204863,0.131632,0.245653,0.144607,0.258203,0.470051,0.259909,0.61364,0.458314,0.434716,0.448941,0.370609,0.285647,0.663082,0.29708,0.273671,0.286411,0.197026,0.196244,0.164323,0.030017,0.253968,0.008473,0.240602,0.136735,0.058442,0.107594 %第一条数据
1,0,0,0,0,1,0.384281,0.355249,0.16719,0.853089,0.260577,0.332757,0.15393,0.519381,0.268043,0.251955,0.459922,0.430814,0.654323,0.641021,0.356511,0.647367,0.367659,0.539078,0.100569,0.133502,0.337194,0.319752,0.349012,0.171182,0.191357,0.390569,0.289253,0.208641,0.341328,0.265669,0.273736,0.181791,0.028513,0.252827,0.25819,0.011351,0.236247,0.069285,0.192754,0.154258,0.128671,0.116726,0.059704,0.073697,0.080341,0.062701,0.075672,0.041256,0.207782,0.300735,0.888274,0.444,0.294673,0.210429,0.132036,0.167474,0.205996,0.155514,0.086631,0.071462,0.067492,0.093526,0.085649,0.025101,0.182955,0.285714,0.156764,0.270677,0.191377,0.153728,0.197951 %第二条数据
与weka有点区别的是,对于标签在每个样本集中的位置,weka一般是放在每行末尾,虽然可以放在每行的任意位置,可以通过Instances.setClassIndex(n)方法,来指定每行的第n位是属于分类标签。
先来上代码吧: