




版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Pyramid:the distributed infrastructure陽(yáng)振坤2009-06-17 Why distributed system?Massive data sizeTrillions of web pages indexed by Baidu/GoogleThousands of terabytes of logs.Challenges from machine failurePractically no fault-free machinesInflexibility of manual partitioning data among machinesWhy NOT Ha
2、doop?Baidu goes toe to toe with GoogleHuge gaps between Hadoop and Google on availability (fault tolerance), performance, scalability, featuresSingle point failureReal time fault tolerance, data/task migrationConcurrent appendingAutomatic rebalanceMaintenanceCopy-on-write + instant snapshot/checkpoi
3、ntB-tree + pressed file namespaceWhy will Pyramid beat Google?Standing on the shoulders of Google etcNo legacy burdensBetter infrastructureBetter design to fit modern machine metadataHigher localization rateWhat can Pyramid do for you?Scaling up to thousands of machines30GB/s and 3GB/s read/write pe
4、rformance in a 100-machine clusterAutomatic data partition among machinesTransparent machine failure handlingFailed machine is automatically detected and removedData and task are migrated to other machines automatically and quicklyMachine failure is transparent to appsAutomatic parallelizationNo par
5、allel or distributed system background are requiredDynamic load balance among machinesLow operation and maintenance costPyramids infrastructureDCS: Distributed Computing SystemDTS: Distributed Table SystemDFS: Distributed File SystemTo simplify design and implementationFor stage resultsDFSDTSDCSDFSs
6、 goalCapable of storing thousands of TB of dataHigh and sustainable aggregate IO bandwidthHundreds of GB/s read performanceTens of GB/s write performanceUninterruptible serviceBuilt-in fault tolerance and high availabilityAutomatic machine managementPlug and playDFSs infrastructureSingle master + ma
7、ny workersFiles: divided into fixed-size chunks (256MB) stored by workersChunks: replicated on multiple workersDFSDTSDCSDM333444555666777888000222111Infrastructure of DTSSingle DTS master + many workersSorted and partitioned by row keyEach partition is about 256MBPartitions can be split or merged du
8、e to insertion or deletionB+ tree hierarchyDTSs goalUp to trillions of rows and billions of columnsSorted and partitioned by row keyDistributed among hundreds or thousands of machinesEach machine serves a few to a few thousand partitions Queries are done by corresponding machinesDFSDTSDCSDCSs goalA
9、restricted programming model (Map/Reduce)Widely adaptableAutomatically parallel and transparently fault-tolerantA distributed execution frameworkPartitioning job to tasksScheduling tasks among machinesDealing with machine failureBenefits to engineersAutomatic parallelization Transparent fault-tolera
10、nceBoth native C+ and scriptNo experience on parallel or distributed system requiredDFSDTSDCSMap & reduce prototypeMap: (k1,v1) = list (k2, v2)k1, v1 and k2, v2 are all binary stringsUsually (k2,v2) are drawn from a different domain than the inputWritten by the userReduce: (k2, v2) = list(v)Both k2
11、and v2 are stringsUsually v is draw from the same domain as the inputWritten by the userHow MapReduce worksMapReduce transforms and rearranges dataMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RInput data for map #1Input data for map #2Input data for map #3Input data for map #MMapReduce dia
12、gramRead dataMap: extract something from each recordShuffle and SortReduce: aggregate, summarize, filter, or transformWrite the resultsScheduling & load balancingJob splittingM map tasksR Reduce tasksTask assignmentThe master obtains a machine pool from the underlying systemA task is assigned to a w
13、orker whenever it es idleData shuffling begins before all map tasks are doneReduce tasks will not start until all map tasks are doneDealing with stragglersA few stragglers are commonBad disksResource consumed by other jobsBackup taskStarting pointResult arbitrationCommutative and associative propert
14、yTask control policyEnable/disable backup tasksLimit map tasks number per machine during reduce timeCountering against failed workersDone/undergoing map tasks on failed worker Done/undergoing reduce tasks on failed workerMap 1:123RMap 2:123RMap 3:123R123R123R123R123RMap M:123RMap worker of word coun
15、tingRetrieve input data from DFS fileInterpret input data (text)Call map() to retrieve each word and emit(word, “1”)Deliver the emitted pair to corresponding reduce buckets: md5(key) mod R, R: number of reduce tasksSort each bucket and combine duplicated to and emitKeep final results to local disk a
16、s temporary fileReduce worker of word countingRetrieve corresponding bucket (set of ) from each map worker through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall reduce() to sum up under the same “word” and emit Process the emitted content by output_format_tDeli
17、ver final results to output_target_t (DFS file)Map worker of distributed sortRetrieve input data through input_source_t: e.g., DFS fileInterpret input data according to input_format_t: e.g., textCall user defined map() functionMap() retrieve every and emit themCollect pairs emitted by map()Deliver t
18、hese pairs to corresponding reduce bucket according to the partitioner_te.g., R-1 separator key strings, R: the number of reduce tasksKeep final results to local disk as temporary filestr1str2str(R-1)123RReduce worker of distributed sortRetrieve corresponding bucket (set of pairs) from each map work
19、er through network (shuffle) Sort whole retrieved data by keyExternal sort is often invokedCall user defined reduce() function: emit(key, value)Collect results emitted by reduce() Process the above results according to output_format_tDeliver final results to output_target_tstr1str2str(R-1)123RHow ma
20、p worksMap collects its output by bucketMulti-way mergingKept as local temporary files123R213R31232R123R123R512MB512MB512MBHow reduce worksReduce shuffles data from each map workerEach shuffle implies a disk read on a map machineReduce sorts its input dataMulti-way mergingOften assign reduce worker
21、more memoryMap 1:3Map 2:3Map 3:3333Map M:3A3,F3,B3,U3C3,K3,A3,A3,V3U3,C3,B3,L3V3,K3,L3,A31GBA3,F3,B3,U3,C3,K31GBA3,A3,V3,U3,C3,B31GBL3,1GBBatch lookup from a tableAssumptionsA large table (e.g., tens of GBs or more)A relatively large volume of source keys (a few GBs or more)TargetFor every key in th
22、e source, retrieve its value in the tableCan MapReduce offer a solution to it?One solutionInput sources: the table and the source keysThe map() functionFor the table input, emit(key+0, value)For the source keys input, emit(key+1, null)The reduce() functionFor , store key and value to member variable
23、s of the classFor , compare current key with stored key, emit if equal, emit otherwiseHow to make these two items in the same bucketPartition(key+x) = md5(key) mod RA few more MapReduce examplesSobar log analysis at 2008.11First DCS appRetrieve source data by another DCS app firstLinkuniq at 2008.12
24、1TB+ single reduce task input before applying combinerWebinfodb backup: retrieve and sortMap worker: retrieve webpage from webinfodbReduce worker: gunzip and press webpage by lsrPartitioner: a list of urlsASP log: retrieve and split by products PLSA30M docs, 700K words and 300/1000 topicsChallenges
25、during developmentComplexity500,000+ C+ code linesAsynchronous RPC callsFault toleranceFailed/slow machinesLarge clock skewTransient failed DNSMomentary network breakUndetected network transferring errorLost、disordered、delayed network packetsRoadmap to PyramidPreparationDFS designDFS development & t
26、estDCS designDCS development & testDTS designDTS development & test200720082009Q3Q4Q1Q2Q3Q4Q1Q2Q3Q4TodayDFS trial appDCS trial appDTS trial appReferencesReference:Pyramid Wiki: DCS SDK: Q & ADFS WorkerData are divided into chunks (256MB)Chunks are distributed among workersChunks are replicated on mu
27、ltiple workers for reliability as well as performanceClients communicates with workers for data directlyDFS MasterKeeps all metadata in memoryFile namespaceChunk namespaceLogs all metadata mutationsSends/receives heartbeat/response to/from all workersDetects machine failure and migrates chunks autom
28、aticallyBalances load between workers automaticallyShadow masters to counter against master failureClients communicate with the master for metadata onlyDFS Master failurePrimary master + shadow master(s)A shadow master keeps itself synchronized by replaying commit log produced by the primary masterA
29、 shadow master lags the primary master, typically fractions of a secondA shadow master es the primary after the latter failsDFS: Performance considerationsLarge chunk sizeChunk replicas distribution policyConcurrent appendingNearest network data transferringMasters batch commit logMasters in-memory
30、metadata policyMasters copy-on-write metadata structure10:1 read/write performance gapDFS: 3 replicasReliability and PerformanceDCS: The partition functionDefault partitionermd5(key) mod RMetamorphosis: Hash(f(key)Hash(url) = f(site(url)Hash(url) = f(domain(url)Hash(key+x) = f(key)Other partitioners
31、SeparatorsHow a worker performs a mutationA worker always writes a mutation to a memory patchReads and writes can be proceeded in parallel by adopting copy-on-write techniqueThe in-memory portion is written to disk and reset to empty when its size reaches a thresholdThe original tablet and its disk patches are immutable to accelerate reading, writing and splittingThe tablet is rew
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 抗菌藥物處方管理
- 城市交通規(guī)劃合同變更咨詢(xún)重點(diǎn)基礎(chǔ)知識(shí)點(diǎn)
- 《績(jī)效管理研究》課件
- 過(guò)節(jié)福利采購(gòu)合同協(xié)議
- 道具超市采購(gòu)合同協(xié)議
- 車(chē)貼廣告模板合同協(xié)議
- 有償出資協(xié)議書(shū)
- 車(chē)輛質(zhì)押合同借款協(xié)議
- 個(gè)人金融服務(wù)數(shù)字化協(xié)議
- 明星對(duì)賭協(xié)議書(shū)
- 2023年上海市數(shù)學(xué)中考真題(含答案)
- 2024年中國(guó)環(huán)境保護(hù)法律法規(guī)
- C139模型-大項(xiàng)目控單力
- 撫州潤(rùn)泰藥業(yè)有限公司年產(chǎn)500噸4,6-二甲氧基-2-甲磺?;奏ぜ碍h(huán)保設(shè)備設(shè)施升級(jí)技改項(xiàng)目環(huán)評(píng)報(bào)告
- 藥品追溯系統(tǒng)培訓(xùn)課件模板
- 2024信息安全意識(shí)培訓(xùn)ppt課件完整版含內(nèi)容
- 供水行業(yè)操作規(guī)范培訓(xùn)課件
- 軟件系統(tǒng)需求調(diào)研方案
- 《人體骨骼構(gòu)成圖解》課件
- 【復(fù)習(xí)資料】14237手機(jī)媒體概論(復(fù)習(xí)要點(diǎn))
- 電線電纜載流量及其計(jì)算常用數(shù)據(jù)
評(píng)論
0/150
提交評(píng)論