hadoop分布式存储平台外文文献翻译.docx
- 文档编号:17526395
- 上传时间:2023-07-26
- 格式:DOCX
- 页数:20
- 大小:30.81KB
hadoop分布式存储平台外文文献翻译.docx
《hadoop分布式存储平台外文文献翻译.docx》由会员分享,可在线阅读,更多相关《hadoop分布式存储平台外文文献翻译.docx(20页珍藏版)》请在冰点文库上搜索。
hadoop分布式存储平台外文文献翻译
hadoop分布式存储平台外文文献翻译
(含:
英文原文及中文译文)
文献出处:
BorthakurD.TheHadoopDistributedFileSystem:
ArchitectureandDesign[J].HadoopProjectWebsite,2007,11(11):
1-10.
英文原文
HadoopDistributedFileSystem:
ArchitectureandDesign
DhrubaBorthakur
introduction
TheHadoopDistributedFileSystem(HDFS)isdesignedtobesuitablefordistributedfilesystemsrunningongeneral-purposehardware(commodityhardware).Ithasalotincommonwithexistingdistributedfilesystems.Atthesametime,itisalsoverydifferentfromotherdistributedfilesystems.HDFSisahighlyfault-tolerantsystemthatissuitablefordeploymentoninexpensivemachines.HDFScanprovidehigh-throughputdataaccessandisverysuitableforlarge-scaledata.Applicationsontheset.HDFSrelaxessomeofthePOSIXconstraintstostreamlinethereadingoffilesystemdata.HDFSwasoriginallydevelopedastheinfrastructurefortheApacheNutchsearchengineproject.HDFSispartoftheApacheHadoopCoreproject..
Prerequisitesanddesigngoals
Hardwareerror
Hardwareerrorsarethenorm,nottheexception.HDFSmayconsistofhundredsofservers,eachofwhichstorespartofthefilesystem'sdata.Therealitywefaceisthatthenumberofcomponentsthatmakeupasystemishuge,andanycomponentcanfail.ThismeansthatthereisalwaysaportionofHDFScomponentsthatarenotworking.Therefore,errordetectionandrapid,automaticrecoveryarethecorearchitecturalgoalsofHDFS.
Streamingdataaccess
ApplicationsrunningonHDFSaredifferentfromnormalapplicationsinthattheyneedtoaccesstheirdatasetsinastream.ThedesignofHDFStakesmoreconsiderationofdatabatchprocessingthanuserinteractionprocessing.Thelowerlatencyofdataaccessismorecriticalthanthehighthroughputofdataaccess.ManyofthehardconstraintsimposedbythePOSIXstandardsettingarenotrequiredforHDFSapplications.Toimprovethethroughputofthedata,somechangeshavebeenmadetothesemanticsofPOSIX.
Large-scaledatasets
ApplicationsrunningonHDFShavelargedatasets.AtypicalfilesizeonHDFSistypicallyintherangeof1byteto1byte.Therefore,HDFSistunedtosupportlargefilestorage.Itshouldbeabletoprovideahighoveralldatatransmissionbandwidththatcanscaletohundredsofnodesinacluster.AsingleHDFSinstanceshouldbeabletosupporttensofmillionsoffiles.
Simpleconsistencymodel
HDFSapplicationsrequirea"writeonce,readmany"fileaccessmodel.Afileiscreated,written,andclosedwithoutchangingit.Thisassumptionsimplifiesdataconsistencyissuesandmakeshigh-throughputdataaccesspossible.MAP/reductionapplicationsorwebcrawlerapplicationsarewellsuitedtothismodel.Therearealsoplanstoexpandthismodelinthefuturesothatitsupportsadditionalwriteoperationsforfiles.
"Mobilecomputingismorecosteffectivethanmobiledata"
Thecalculationofanapplicationrequestismoreefficientasitisclosertothedataitmanipulates,especiallywhenthedatareachesamassivelevel.Becausethiscanreducetheimpactofnetworkcongestionandincreasethethroughputofsystemdata.Movingthecalculationsclosertothedataisclearlybetterthanmovingthedatatotheapplication.HDFSprovidesapplicationswithinterfacestomovethemselvesaroundthedata.
Heterogeneitybetweenhardwareandsoftwareplatforms
HDFStakesintoaccounttheplatform'sportabilityatdesigntime.ThisfeaturefacilitatesthepromotionofHDFSasalarge-scaledataapplicationplatform.
NamenodeandDatanode
HDFSusesamaster/slavearchitecture.AnHDFSclusterconsistsofaNamenodeandacertainnumberofDatanodes.ANamenodeisacentralserverthatmanagesthefilesystem'snamespace(namespace)andclientaccesstofiles.TheDatanodeinaclusterisgenerallyanodethatisresponsibleformanagingstorageonthenodewhereitislocated.TheHDFSexposesthefilesystem'snamespace,anduserscanstoredataonitintheformoffiles.Internally,afileisactuallydividedintooneormoredatablocks,whicharestoredonasetofDatanodes.Namenodeperformsfilesystemnamespaceoperationssuchasopening,closing,renamingafileordirectory.ItisalsoresponsiblefordeterminingthemappingofdatablockstospecificDatanodenodes.TheDatanodeisresponsibleforhandlingreadandwriterequestsfromthefilesystemclient.Datablocksarecreated,deleted,andcopiedundertheunifiedscheduleoftheNameNode.
TheNamenodeandaDatanodearedesignedtorunoncommonbusinessmachines.ThesemachinesgenerallyruntheGNU/Linuxoperatingsystem(OS).TheHDFSusesJavalanguagedevelopment,soanyJava-enabledmachinecandeployaNamenodeorDatanode.DuetothehighlyportablelanguageofJava,HDFScanbedeployedonmanytypesofmachines.AtypicaldeploymentscenarioiswhenonlyoneNamenodeinstanceisrunningonamachine,andothermachinesintheclusterarerunninginstancesofaDatanode.ThisarchitecturedoesnotexcludetheoperationofmultipleDatanodesonasinglemachine,butthisisrelativelyrare.
ThestructureofasingleNamenodeinaclustergreatlysimplifiesthearchitectureofthesystem.NamenodeisthearbiterandadministratorofallHDFSmetadata,sothattheNameNodewhereuserdataneverflows.
FileSystemNamespace
HDFSsupportstraditionalhierarchicalfileorganization.Usersorapplicationscancreatedirectoriesandthenstorefilesinthesedirectories.Thefilesystemnamespacehierarchyissimilartomostexistingfilesystems:
.Userscancreate,delete,moveorrenamefiles.Currently,HDFSdoesnotsupportuserdiskquotaandaccesscontrol,nordoesitsupporthardlinksandsoftlinks.However,theHDFSarchitecturedoesnotpreventtheimplementationofthesefeatures.
TheNameNodeisresponsibleformaintainingthefilesystem'snamespace,andanychangestothefilesystemnamespaceorattributeswillberecordedbytheNamenode.TheapplicationcansetthenumberofcopiesoftheHDFSsavedfile.Thenumberoffilecopiesiscalledthecopyfactorofthefile.ThisinformationisalsostoredbytheNamenode.
Datareplication
HDFSisdesignedtoreliablystoreverylargefilesacrossmachinesinalargecluster.Itstoreseachfileasaseriesofdatablocks,exceptforthelastone,alldatablocksarethesamesize.Forfaulttolerance,alldatablocksofthefilewillhaveacopy.Theblocksizeandcopyfactorofeachfileareconfigurable.Applicationscanspecifythenumberofcopiesofafile.Replicacoefficientscanbespecifiedatthetimeoffilecreation,ortheycanbechangedlater.FilesinHDFSarewrittenonce,anditisstrictlyrequiredthattherecanbeonlyonewriteratanytime.
Thenamenodefullymanagesthereplicationofdatablocks,whichperiodicallyreceiveheartbeatsignalsandblockstatusreportsfromDatanodesineachoftheclusters.TheheartbeatsignalreceivedmeansthattheDatanode'snodeisworkingproperly.TheblockstatusreportcontainsalistofalldatablocksontheDatanode.
Copystorage:
thefirststep
ThestorageofcopiesisthekeytoHDFSreliabilityandperformance.TheoptimizedcopystoragepolicyisanimportantfeatureofHDFSdistinguishingitfrommostotherdistributedfilesystems.Thisfeaturerequiresalotoftuningandrequirestheaccumulationofexperience.HDFSusesastrategycalledrackawareness(rackawareness)toimprovedatareliability,availability,andutilizationofnetworkbandwidth.Thecurrentcopystoragestrategyisonlythefirststepinthisdirection.Theshort-termgoaltoachievethisstrategyistoverifyitseffectivenessintheproductionenvironment,observeitsbehavior,andlaythefoundationfortestingandresearchtoachievemoreadvancedstrategies.
LargeHDFSinstancestypicallyrunonclustersofcomputersthatspanmultipleracks.Communicationbetweentwomachinesondifferentracksneedstogothroughtheswitch.Inmostcases,thebandwidthbetweentwomachinesinthesamerackwillbegreaterthanthebandwidthbetweentwomachinesindifferentracks.
Througharack-awareprocess,theNamenodecandeterminetheIDoftheracktowhicheachDatanodebelongs.Asimplebutnotoptimizedstrategyistostorethecopiesindifferentracks.Thiscaneffectivelypreventthelossofdatawhentheentirerackfails,andallowfullutilizationofthebandwidthofmultiplerackswhenreadingdata.Thiskindofpolicysettingcanevenlydistributethecopiesinthecluster,whichisbeneficialtoloadbalancingintheeventofcomponentfailure.However,becauseawriteoperationofthisstrategyrequiresthetransmissionofdatablockstomultipleracks,thisaddstothecostofwriting.
Inmostcases,thereplicacoefficientis3,HDFSstoragestrategyistostoreacopyonthenodeofthelocalrack,acopyonanothernodeofthesamerack,thelastcopyonadifferentrackOnthenode.Thisstrategyreducesthetransmissionofdatabetweenracks,whichincreasestheefficiencyofwriteoperations.Rackerrorsarefarfewerthannodeerrors,sothisstrategydoesnotaffectdatareliabilityandavailability.Atthesametime,becausethedatablocksareonlyplacedontwo(notthree)differentracks,thisstrategyreducesthetotalnetworktransmissionbandwidthrequiredwhenreadingdata.Underthisstrategy,replicasarenotevenlydistributedacrossdifferentracks.One-thirdofthereplicasareononenode,two-thirdsofthereplicasareononerack,andotherreplicasareevenlydistributedintheremainingracks.Thisstrategydoesnotcompromisedatareliabilityandreadperformance.Undertheimprovedwriteperformance.
Currently,thedefaultcopystoragestrategydescribedhereisintheprocessofdevelopment.
Copyselection
Inordertoreduc
- 配套讲稿:
如PPT文件的首页显示word图标,表示该PPT已包含配套word讲稿。双击word图标可打开word文档。
- 特殊限制:
部分文档作品中含有的国旗、国徽等图片,仅作为作品整体效果示例展示,禁止商用。设计者仅对作品中独创性部分享有著作权。
- 关 键 词:
- hadoop 分布式 存储 平台 外文 文献 翻译