书签分享收藏举报版权申诉 / 29

立即下载加入VIP,免费下载

当前位置：首页 > 总结汇报 > 学习总结 > hadoop文本词频排序实验报告.docx

hadoop文本词频排序实验报告.docx

文档编号：1974430
上传时间：2023-05-02
格式：DOCX
页数：29
大小：383.53KB

《hadoop文本词频排序实验报告.docx》由会员分享，可在线阅读，更多相关《hadoop文本词频排序实验报告.docx（29页珍藏版）》请在冰点文库上搜索。

hadoop文本词频排序实验报告.docx

hadoop文本词频排序实验报告

大数据技术概论实验报告

文

本

词

频

排

序

*******

专业：

工程管理专业

学号：

2015E**********

1.实验要求

在Eclipse环境下编写WordCount程序，统计所有出现次数k次以上的单词计数，最后的结果按照词频从高到低排序输出。

2.环境说明

2.1系统硬件

处理器：

IntelCorei3-2350M**********×4

内存：

2GB

磁盘：

60GB

2.2系统软件

操作系统：

Ubuntu14.04LTS

操作系统类型：

32位

Java版本：

1.7.0_85

Eclipse版本：

3.8

Hadoop插件：

hadoop-eclipse-plugin-2.6.0.jar

Hadoop：

2.6.1

2.3安装及配置

1.Hadoop配置

1）core-site.xml

hadoop.tmp.dir

Abaseforothertemporarydirectories.

fs.defaultFS

hdfs:

//inspiron:

9000

2）hdfs-site.xml

dfs.replication

1

dfs.namenode.name.dir

dfs.datanode.data.dir

dfs.namenode.secondary.http-address

127.0.0.1:

50090

Thesecondarynamenodehttpserveraddressandport.

dfs.webhdfs.enabled

true

EnableWebHDFS（RESTAPI）inNamenodesandDatanodes.

3）maprd-site.xml

mapreduce.framework.name

yarn

mapreduce.jobhistory.address

127.0.0.1:

10020

MapReduceJobHistoryServerIPChost:

port

mapreduce.jobhistory.webapp.address

127.0.0.1:

19888

MapReduceJobHistoryServerWebUIhost:

port

mapreduce.jobtracker.http.address

127.0.0.1:

50030

Thejobtrackerhttpserveraddressandporttheserverwilllistenon.

Iftheportis0thentheserverwillstartonafreeport.

4）yarn-site.xml

ThehostnameoftheRM.

yarn.resourcemanager.hostname

inspiron

yarn.nodemanager.aux-services

mapreduce_shuffle

yarn.nodemanager.aux-services.mapreduce_shuffle.class

org.apache.hadoop.mapred.ShuffleHandler

TheaddressoftheapplicationsmanagerinterfaceintheRM.

yarn.resourcemanager.address

inspiron:

8032

Theaddressoftheschedulerinterface.

yarn.resourcemanager.scheduler.address

inspiron:

8030

yarn.resourcemanager.resource-tracker.address

inspiron:

8031

Theclasstouseastheresourcescheduler.

yarn.resourcemanager.scheduler.class

org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler

5）slaves

inspiron

2.eclipse配置

1）安装hadoop开发插件。

由于本次实验所使用的hadoop版本较新，编译过程中出现问题太多，所以直接使用了官方发布的2.6.0版本的插件，经过测试可以正常使用。

将插件复制至eclipse安装目录下的plugins目录下。

2）进入eclipse->window->preferences配置hadoop安装路径

3.新建HadoopLocation

4.配置完成后在ProjectExplore及Map/ReduceLocation窗口可看到如下界面

3.实验设计

3.1设计思路

利用MapReduce框架设计，在Map过程将输入文本拆分成单个的单词，并对单词进行初步统计，将单词及词频组合作为Map过程输出的value值，将Map过程的Key值设为统一固定值。

在Reduce过程获取Map过程的输出，拆分value值，获取并汇总统计出所有单词的词频，根据设定值k对统计单词进行筛选，将词频高于设定值k的单词和词频以键值对的形式存入某个容器中，然后将容器的对象按照词频从高到低的顺序排序后以单词和词频键值对的形式输出。

如此设计，只需要一个MapReduce过程即可完成词频统计并筛选排序输出。

3.2算法设计

1.在Map过程中，重写map类，利用StringTokenizer类，将map方法中的value值中存储的文本，拆分成一个个的单词，将单词进行初步统计，统计得到的结果存入一个Map集合中。

遍历Map集合，将所得单词和词频组成一个字符串，作为Map过程输出的value值，并以形式输出。

2.在Reduce过程中，重写setup方法，获取设定词频。

3.对Map过程输出的形式的键值对，遍历values，拆分并统计出对应单词的词频，以键值对的形式装入一个Map集合中。

4.遍历存有单词和词频键值对的Map集合，将其中词频大于设定值k的单词和词频存入一个List集合中。

5.利用Collect.sort（）重载方法对List集合进行按照词频由高到低顺序的排序。

6.遍历List集合，将经过排序的List集合中存储的单词和词频写入reduce方法的context变量，以单词和词频键值对的形式输出。

3.3程序和类的设计

1.定义TokenizerMapper类继承org.apache.hadoop.mapreduce包中Mapper类，并重写map方法。

然后利用StringTokenizer类，将map方法中的value值中存储的文本，拆分成一个个的单词，进行初步统计后放入Map集合中，遍历Map集合取出单词及对应词频，将单词和词频组合后，以的形式作为map方法的结果输出，其余的工作都交由MapReduce框架处理。

publicstaticclassTokenizerMapperextends

Mapper{

privatefinalstaticTextmapValue=newText（）;

privateTextmapKey=newText（"key"）;

publicvoidmap（Objectkey,Textvalue,Contextcontext）

throwsIOException,InterruptedException{

StringTokenizeritr=newStringTokenizer（value.toString（））;

Mapword2count=newHashMap（）;

while（itr.hasMoreTokens（））{

StringnextToken=removeNonLetters（itr.nextToken（））;

if（!

word2count.containsKey（nextToken））

word2count.put（nextToken,0）;

word2count.put（nextToken,word2count.get（nextToken）+1）;

}

for（Entryentry:

word2count.entrySet（））{

mapValue.set（entry.getKey（）+"\001"+entry.getValue（））;

context.write（mapKey,mapValue）;

}

//去除拆分后字符串中所含非字母字符

publicstaticStringremoveNonLetters（Stringoriginal）{

StringBufferaBuffer=newStringBuffer（original.length（））;

charaCharacter;

for（inti=0;i

aCharacter=original.charAt（i）;

if（Character.isLetter（aCharacter））{

aBuffer.append（aCharacter）;

}

returnnewString（aBuffer）;

}

2.定义IntSumReducer类继承org.apache.hadoop.mapreduce包中Reducer类，对Map过程中发送过来的键值对，拆分value值取出单词及对应词频，进行词频统计，筛选出词频高于设定值的单词，并按照词频从高到低的顺序排序后输出。

publicstaticclassIntSumReducerextends

Reducer{

privateIntWritableoutputValue=newIntWritable（）;

privateTextoutputKey=newText（）;

privateintk=0;

@Override

protectedvoidsetup（

Reducer.Contextcontext）

throwsIOException,InterruptedException{

super.setup（context）;

this.k=Integer.parseInt（context.getConfiguration（）.get（"k"））;

}

publicvoidreduce（Textkey,Iterablevalues,Contextcontext）

throwsIOException,InterruptedException{

Mapword2count=newHashMap（）;

for（Textval:

values）{

StringvalStr=val.toString（）;

String[]records=valStr.split（"\001"）;

Stringword=records[0];

intcnt=Integer.parseInt（records[1]）;

if（!

word2count.containsKey（word））

word2count.put（word,0）;

word2count.put（word,word2count.get（word）+cnt）;

}

Listlist=newArrayList（）;

for（Map.Entryentry:

word2count.entrySet（））{

if（entry.getValue（）>this.k）{

Pairp=newPair（entry.getKey（）,entry.getValue（））;

list.add（p）;

}

Collections.sort（list,newComparator（）{

@Override

publicintcompare（Pairo1,Pairo2）{

returno2.getV（）.compareTo（o1.getV（））;

}

}）;

for（Pairp:

list）{

outputKey.set（p.getK（））;

outputValue.set（p.getV（））;

context.write（outputKey,outputValue）;

}

1）重写setup方法，获取设定词频，并将其值赋给已声明的变量k。

protectedvoidsetup（

Reducer.Contextcontext）

throwsIOException,InterruptedException{

super.setup（context）;

this.k=Integer.parseInt（context.getConfiguration（）.get（"k"））;

}

2）Map过程输出中key为统一设定值，舍去不用，而values是[word+split+count]形式的集合，重写reduce方法，遍历values按照设定分隔符拆分后，汇总进行统计，得到某个单词的词频。

声明一个Map变量word2count，将统计得到的单词及其词频，以键值对的形式存入word2count中。

Mapword2count=newHashMap（）;

for（Textval:

values）{

StringvalStr=val.toString（）;

String[]records=valStr.split（"\001"）;

Stringword=records[0];

intcnt=Integer.parseInt（records[1]）;

if（!

word2count.containsKey（word））

word2count.put（word,0）;

word2count.put（word,word2count.get（word）+cnt）;

}

3）声明一个List集合变量list，遍历word2count，根据词频进行筛选，用词频大于k的单词和词频的值初始化新定义的类Pair的对象，然后将对象存入list中。

使用Collect.sort（）重载方法对list进行排序。

Listlist=newArrayList（）;

for（Map.Entryentry:

word2count.entrySet（））{

if（entry.getValue（）>this.k）{

Pairp=newPair（entry.getKey（）,entry.getValue（））;

list.add（p）;

}

Collections.sort（list,newComparator（）{

@Override

publicintcompare（Pairo1,Pairo2）{

returno2.getV（）.compareTo（o1.getV（））;

}

}）;

4）遍历list，将经过排序的集合中存储的单词和词频写入reduce方法的context变量

for（Pairp:

list）{

outputKey.set（p.getK（））;

outputValue.set（p.getV（））;

context.write（outputKey,outputValue）;

}

3.主方法main，定义Job对象负责管理和运行一个计算任务，并通过Job的一些方法对任务的参数进行相关的设置。

publicstaticvoidmain（String[]args）throwsException{

Configurationconf=newConfiguration（）;

Stringmaster="127.0.0.1";

conf.set（"fs.defaultFS","hdfs:

//127.0.0.1:

9000"）;

conf.set（"hadoop.job.user","hadoop"）;

conf.set（"mapreduce.framework.name","yarn"）;

conf.set（"yarn.resourcemanager.address",master+":

8032"）;

conf.set（"yarn.resourcemanager.scheduler.address",master+":

8030"）;

conf.set（"mapred.jar","wordcount.jar"）;

String[]otherArgs=newGenericOptionsParser（conf,args）

.getRemainingArgs（）;

if（otherArgs.length<3）{

System.err.println（"Usage:

wordcount[...]"）;

System.exit

（2）;

}

Jobjob=newJob（conf,"wordcount"）;

job.setMapperClass（TokenizerMapper.class）;

job.setReducerClass（IntSumReducer.class）;

job.setOutputKeyClass（Text.class）;

job.setOutputValueClass（Text.class）;

//获取设定值K

for（inti=0;i

文档加载中……请稍候！
如果长时间未打开，您也可以点击刷新试试。

下载文档到电脑，查找使用更方便

下载	加入VIP,免费下载

版权申诉 word格式文档无特别注明外均可编辑修改；预览文档经过压缩，下载后原文更清晰！ 立即下载

配套讲稿：: 如PPT文件的首页显示word图标，表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
特殊限制：: 部分文档作品中含有的国旗、国徽等图片，仅作为作品整体效果示例展示，禁止商用。设计者仅对作品中独创性部分享有著作权。
关键词：: hadoop 文本词频排序实验报告

冰点文库所有资源均是用户自行上传分享，仅供网友学习交流，未经上传用户书面授权，请勿作他用。

关于本文

本文标题：hadoop文本词频排序实验报告.docx
链接地址：https://www.bingdoc.com/p-1974430.html

hadoop文本词频排序实验报告.docx

热门标签