书签分享收藏举报版权申诉 / 26

立即下载加入VIP,免费下载

当前位置：首页 > 人文社科 > Weka24 Apriori源代码分析1.docx

Weka24 Apriori源代码分析1.docx

文档编号：70063
上传时间：2023-04-28
格式：DOCX
页数：26
大小：24.51KB

《Weka24 Apriori源代码分析1.docx》由会员分享，可在线阅读，更多相关《Weka24 Apriori源代码分析1.docx（26页珍藏版）》请在冰点文库上搜索。

Weka24 Apriori源代码分析1.docx

Weka24Apriori源代码分析1

Weka[24]Apriori源代码分析

作者：

Koala++/屈伟

曾经卖过一个Apriori的程序，那个程序大约有50%的正确率（当然结果是正确的，是只

实现上很不一样），数据挖掘课上写了一个Apriori，一部分懒地按书上的算法，大约对了80%

（当然结果仍然是正确的），记得邱强有一次要用Apriori算法时说：

weka的太慢了，还好上

次数据挖掘课实现了一下，还挺快的，注意的一点是关联规则不属于机器学习，这里我不想

再分出来一个数据挖掘的组了。

从buildAssociations函数开始：

double[]confidences,supports;

int[]indices;

FastVector[]sortedRuleSet;

intnecSupport=0;

instances=newInstances（instances）;

if（m_removeMissingCols）{

instances=removeMissingColumns（instances）;

}

看一下removeMissingColumns，虽然它是如此的不重要：

protectedInstancesremoveMissingColumns（Instancesinstances）

throwsException{

intnumInstances=instances.numInstances（）;

StringBufferdeleteString=newStringBuffer（）;

intremoveCount=0;

booleanfirst=true;

intmaxCount=0;

for（inti=0;i

AttributeStatsas=instances.attributeStats（i）;

if（m_upperBoundMinSupport==1.0&&maxCount!

=numInstances）{

//seeifwecandecreasethisbylookingforthemostfrequent

//value

int[]counts=as.nominalCounts;

if（counts[Utils.maxIndex（counts）]>maxCount）{

maxCount=counts[Utils.maxIndex（counts）];

}

if（as.missingCount==numInstances）{

if（first）{

deleteString.append（（i+1））;

first=false;

}else{

deleteString.append（","+（i+1））;

}

removeCount++;

}

if（m_verbose）{

System.err.println（"Removed:

"+removeCount

+"columnswithallmissing"+"values."）;

}

if（m_upperBoundMinSupport==1.0&&maxCount!

=numInstances）{

m_upperBoundMinSupport=（double）maxCount

/（double）numInstances;

if（m_verbose）{

System.err.println（"Settingupperboundminsupportto:

"

+m_upperBoundMinSupport）;

}

if（deleteString.toString（）.length（）>0）{

Removeaf=newRemove（）;

af.setAttributeIndices（deleteString.toString（））;

af.setInvertSelection（false）;

af.setInputFormat（instances）;

InstancesnewInst=Filter.useFilter（instances,af）;

returnnewInst;

}

returninstances;

}

For循环中的第一个if不重要，不要理睬。

它的作用是想在后面也就是maxCount/

numInstances那看一下support的下界设多少。

第二个if中是看是不是缺失值等于样本数，

也就是在这个属性上所有的值都是缺失的，那么就用deleteString把这些都是缺失值的属性

下标记下来，连成一个字符串，最后一个if就是标准的删除特征的代码，也就是把那么都是

缺失值的特征给删除了。

if（m_car&&m_metricType!

=CONFIDENCE）

thrownewException（

"ForCAR-Miningmetrictypehastobeconfidence!

"）;

如果想得到与类别有关的规则，又没有设成CONFIDENCE是不可以的。

//onlysetclassindexifCARisrequested

if（m_car）{

if（m_classIndex==-1）{

instances.setClassIndex（instances.numAttributes（）-1）;

}elseif（m_classIndex<=instances.numAttributes（）

&&m_classIndex>0）{

instances.setClassIndex（m_classIndex-1）;

}else{

thrownewException（"Invalidclassindex."）;

}

如果用用户想得到与类别有关的规则又忘了设类别索引，就帮他设一下。

//canassociatorhandlethedata?

getCapabilities（）.testWithFail（instances）;

看能不能外理这种数据。

m_cycles=0;

if（m_car）{

//m_instancesdoesnotcontaintheclassattribute

m_instances=LabeledItemSet.divide（instances,false）;

//m_onlyClasscontainsonlytheclassattribute

m_onlyClass=LabeledItemSet.divide（instances,true）;

}else

m_instances=instances;

if（m_car&&m_numRules==Integer.MAX_VALUE）{

//Setdesiredminimumsupport

m_minSupport=m_lowerBoundMinSupport;

}else{

//Decreaseminimumsupportuntildesirednumberofrulesfound.

m_minSupport=m_upperBoundMinSupport-m_delta;

m_minSupport=（m_minSupport

m_lowerBoundMinSupport:

m_minSupport;

}

不再看LabeledItemSet.devide，只看一下注释，m_instances不包含类别属性，而

m_onlyClass只包含类别属性。

下面的就不去管它了。

do{

//Reservespaceforvariables

m_Ls=newFastVector（）;

m_hashtables=newFastVector（）;

m_allTheRules=newFastVector[6];

m_allTheRules[0]=newFastVector（）;

m_allTheRules[1]=newFastVector（）;

m_allTheRules[2]=newFastVector（）;

if（m_metricType!

=CONFIDENCE||m_significanceLevel!

=-1）{

m_allTheRules[3]=newFastVector（）;

m_allTheRules[4]=newFastVector（）;

m_allTheRules[5]=newFastVector（）;

}

sortedRuleSet=newFastVector[6];

sortedRuleSet[0]=newFastVector（）;

sortedRuleSet[1]=newFastVector（）;

sortedRuleSet[2]=newFastVector（）;

if（m_metricType!

=CONFIDENCE||m_significanceLevel!

=-1）{

sortedRuleSet[3]=newFastVector（）;

sortedRuleSet[4]=newFastVector（）;

sortedRuleSet[5]=newFastVector（）;

}

if（!

m_car）{

//Findlargeitemsetsandrules

findLargeItemSets（）;

if（m_significanceLevel!

=-1||m_metricType!

=CONFIDENCE）

findRulesBruteForce（）;

else

findRulesQuickly（）;

}else{

findLargeCarItemSets（）;

findCarRulesQuickly（）;

}

上面都是一些初始化的代码，不讲了，如果想知道，可以看一下我以前写的那一篇Weka

开发［17］——关联规则之Apriori。

再下来，如果不是挖掘与类别相关的规则，那么先执行

findLargeItermSets：

privatevoidfindLargeItemSets（）throwsException{

FastVectorkMinusOneSets,kSets;

Hashtablehashtable;

intnecSupport,necMaxSupport,i=0;

//Findlargeitemsets

//minimumsupport

necSupport=（int）（m_minSupport

*（double）m_instances.numInstances（）+0.5）;

necMaxSupport=（int）（m_upperBoundMinSupport

*（double）m_instances.numInstances（）+0.5）;

kSets=AprioriItemSet.singletons（m_instances）;

AprioriItemSet.upDateCounters（kSets,m_instances）;

kSets=AprioriItemSet.deleteItemSets（kSets,necSupport,

necMaxSupport）;

if（kSets.size（）==0）

return;

do{

m_Ls.addElement（kSets）;

kMinusOneSets=kSets;

kSets=AprioriItemSet.mergeAllItemSets（kMinusOneSets,i,

m_instances.numInstances（））;

hashtable=AprioriItemSet.getHashtable（kMinusOneSets,

kMinusOneSets.size（））;

m_hashtables.addElement（hashtable）;

kSets=AprioriItemSet.pruneItemSets（kSets,hashtable）;

AprioriItemSet.upDateCounters（kSets,m_instances）;

kSets=AprioriItemSet.deleteItemSets（kSets,necSupport,

necMaxSupport）;

i++;

}while（kSets.size（）>0）;

}

NecSupprot是最小support的另一种衡量，就是计数，原来的m_minSupport是用百分

比逄的，+0.5当然就是四舍五入了。

necMaxSupport也是相同的意思。

下面看AprioriItermSet.singletons，AprioriItermSet是继承自ItemSet的：

/**

*Convertstheheaderinfoofthegivensetofinstancesintoasetof

*itemsets（singletons）.Theorderingofvaluesintheheaderfile

*determinesthelexicographicorder.

*/

publicstaticFastVectorsingletons（Instancesinstances）

throwsException{

FastVectorsetOfItemSets=newFastVector（）;

ItemSetcurrent;

for（inti=0;i

if（instances.attribute（i）.isNumeric（））

thrownewException（"Can'thandlenumericattributes!

"）;

for（intj=0;j

current=newAprioriItemSet（instances.numInstances（））;

current.m_items=newint[instances.numAttributes（）];

for（intk=0;k

current.m_items[k]=-1;

current.m_items[i]=j;

setOfItemSets.addElement（current）;

}

returnsetOfItemSets;

}

这段代码的作用现在还看不出来，可以看一下注释，意思是：

将给定数据集的头信息转

换成一个项集的集合，头信息中的值的顺序是按字典序。

如果你也是用weather.nominal数

据集，那么它产生的结果应该是这个：

0,-1,-1,-1,-1

1,-1,-1,-1,-1

2,-1,-1,-1,-1

-1,0,-1,-1,-1

-1,1,-1,-1,-1

-1,2,-1,-1,-1

-1,-1,0,-1,-1

-1,-1,1,-1,-1

-1,-1,-1,0,-1

-1,-1,-1,1,-1

-1,-1,-1,-1,0

-1,-1,-1,-1,1

一共有12组，因为有2个属性有3种属性值，3个属性有2种属性值，2*3+3*2=12。

而且m_iterm的第i个元素值设为j。

接下来看AprioriItemSet.upDateCounters：

publicstaticvoidupDateCounters（FastVectoritemSets,Instances

instances）{

for（inti=0;i

Enumerationenu=itemSets.elements（）;

while（enu.hasMoreElements（））

（（ItemSet）enu.nextElement（））.upDateCounter（instances

.instance（i））;

}

外层循环是样本的个数，内层循环是itermSets的数量。

里面的upDateCounter：

publicvoidupDateCounter（Instanceinstance）{

if（containedBy（instance））

m_counter++;

}

也就是如果满足containBy那么就计数加1，看一下containBy：

publicbooleancontainedBy（Instanceinstance）{

for（inti=0;i

if（m_items[i]>-1）{

if（instance.isMissing（i））

returnfalse;

if（m_items[i]!

=（int）instance.value（i））

returnfalse;

}

returntrue;

}

这段代码就相对容易一点了，如果在相应的特征值上为缺失值，或是与我们要找的特征

词不相同，那么就返回false。

那么在upDateCounters执行完之后，结果大概是这样的，最后的是记数：

0,-1,-1,-1,-1

1,-1,-1,-1,-1

2,-1,-1,-1,-1

-1,0,-1,-1,-1

-1,1,-1,-1,-1

-1,2,-1,-1,-1

-1,-1,0,-1,-1

-1,-1,1,-1,-1

-1,-1,-1,0,-1

-1,-1,-1,1,-1

:

5

4

5

4

6

4

7

6

8

-1,-1,-1,-1,0:

9

-1,-1,-1,-1,1:

5

这样也就理解了，m_item的意义，也就是现在是1项集。

接下来看deleteItemSets：

publicstaticFastVectordeleteItemSets（FastVectoritemSets,

intminSupport,intmaxSupport）{

FastVectornewVector=newFastVector（itemSets.size（））;

for（inti=0;i

ItemSetcurrent=（ItemSet）itemSets.elementAt（i）;

if（（current.m_counter>=minSupport）

&&（current.m_counter<=maxSupport））

newVector.addElement（current）;

}

returnnewVector;

}

这里可以看到，只有m_counter在minSupport和maxSupport之间的项我们才要。

接下来就要看findLargeItemSets这个函数中的do/while循环了，m_Ls是保存所有项的

一个集合，那kMinusOneSets就是上一个项集，接下来是AprioriItemSet.mergeAllItemSets，

但是这个函数在执行第一次的时候，它并看不出来什么作用：

publicstaticFastVectormergeAllItemSets（FastVectoritemSets,intsize,

inttotalTrans）{

FastVectornewVector=newFastVector（）;

ItemSetresult;

intnumFound,k;

for（inti=0;i

ItemSetfirst=（ItemSet）itemSets.elementAt（i）;

out:

for（intj=i+1;j

ItemSetsecond=（ItemSet）itemSets.elementAt（j）;

result=newAprioriItemSet（totalTrans）;

result.m_items=newint[first.m_items.length];

//Findandcopycommonprefixofsize'size'

numFound=0;

k=0;

while（numFound

if（first.m_items[k]==second.m_items[k]）{

if（first.m_items[k]!

=-1）

numFound++;

result.m_items[k]=first.m_items[k];

}else

breakout;

k++;

}

//Checkdifference

while（k

if（（first.m_items[k]!

=-1）&&（second.m_items[k]!

=-1））

break;

else{

if（first.m_items[k]!

=-1）

result.m_items[k]=first.m_items[k];

else

result.m_items[k]=second.m_items[k];

}

k++;

}

if（k==first.m_items.length）{

result.m_counter=0;

newVector.addElement（result）;

}

returnnewVector;

}

这里有一个out标签，也就是breakout执行后会跳到out那里接着执行，很象goto。

这

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: Weka24 Apriori源代码分析1 Apriori 源代码分析

冰点文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：Weka24 Apriori源代码分析1.docx
链接地址：https://www.bingdoc.com/p-70063.html

Weka24 Apriori源代码分析1.docx

热门标签