




版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
半結(jié)構(gòu)化文半結(jié)構(gòu)化文本挖掘方楊建北京大學(xué)計(jì)算機(jī)科學(xué)技術(shù)研究1Text-centricXMLDocumentsText-centricXMLDocumentsmarkedupasE.g.,assemblymanuals,journalQueriesareuserinformationE.g.,givemetheSection(element)ofedocumentthattellsmehowtochangeabrakelightDifferentfromwell-structuredXMLquerieswhereyoutightlyspecifywhatyou’relooking2VectorspacesandVectorspacesandVectorspaces–tried+testedframeworkforkeywordretrievalOther“bagofwords”applicationsinclassification,clustering…Fortext-centricXMLretrieval,canwemakeuseofvectorspaceideas?Challenge:capturethestructureofanXMLdocumentinthevectorspace.3VectorspacesandForinstance,distinguishbetweenVectorspacesandForinstance,distinguishbetweenfollowingtwoThePearlyMicrosoftBillBill4Content-richXML:MicrosoftTheContent-richXML:MicrosoftTheLexicon5EncodingtheGatesEncodingtheGatesWhataretheaxesofthevectorIntextretrieval,therewouldbeasingleaxisforGatesHerewemustseparateoutthetwooccurrences,underAuthorandTitleThus,axesmustrepresentnotonlyterms,butsomethingabouttheirpositioninanXMLtree6Beforeaddressingthis,letustheBeforeaddressingthis,letusthekindsofquerieswewanttoMicrosoft7QueryTheprecedingQueryTheprecedingexamplescanbeviewedassubtreesofthedocumentButwhat(GatessomewhereunderneathThisisharderandwewillreturntoit8SubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftSubtreesandConsiderallsubtreesofthethatincludeatleastonelexiconMicrosoftMicrosoft…MicrosoftMicrosoft9StructuralCalleachoftheresulting(8+,inpreviousStructuralCalleachoftheresulting(8+,inpreviousslide)subtreesastructuralNotethatstructuraltermsmightoccurmultipletimesinadocumentCreateoneaxisinthevectorspaceforeachdistinctstructuraltermWeightsbasedonfrequenciesfornumberofoccurrences(justaswehadtf)Alltheusualissueswithterms(stemming?Casefolding?)remainExampleoftfToExampleoftfTobeortoExercise:HowmanyaxesarethereinthisHerethestructuraltermscontainingtoorbewouldhavemoreweightthanthosethatForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveForthedocontheleft:inastructuraltermrootedatthenodePlay,shouldn’tHamlethaveahighertfweightIdea:multiplytfcontributionofatermtoanodeklevelsupbyk,forsomeg<Hamlet=0.8Forthedoconthepreviousslide,theHamletismultipliedbyYorickismultipliedbyinanystructuraltermrootedatThenumberofThenumberofstructuralCanbeAlright,howhuge,ImpracticaltobuildavectorindexwithsomanyWillexaminepragmaticsolutionstothisshortly;fornow,continuetobelieve…Structuralterms:Structuralterms:Thenotionofstructuraltermsisindependentofanyschema/DTDfortheXMLdocumentsWell-suitedtoaheterogeneouscollectionofXMLEachdocumentbecomesavectorinthespaceofstructuraltermsAquerytreecanlikewisebefactoredintostructuraltermsAndrepresentedasaAllowsweightingportionsoftheExample…Example…WeightTheWeightTheassignmentoftheweights0.6and0.4inthepreviousexampletosubtreeswasCanbemoreThinkofitasgeneratedbyanapplication,notnecessarilyanend-userQueries,documentsbecomenormalizedRetrievalscorecomputation“just”amatterofcosinesimilaritycomputationRestrictstructuralRestrictstructuralDependingontheapplication,wemayrestrictthestructuraltermsE.g.,mayneverwanttoreturnaTitlenode,onlyBookorPlaynodesSodon’tenumerate/index/retrieve/scorestructuraltermsrootedatsomenodesThecatchThisisThecatchThisisallverypromising,butHowbigisthisvectorCanbeexponentiallylargeinthesizeoftheCannothopetobuildsuchanAndinanycase,stillfailstoanswerqueriesTwoQuery-timeTwoQuery-timematerializationofRestrictthekindsofsubtreestoamanageablesetQuery-timeInsteadofenumeratingallstructuraltermsofalldocsQuery-timeInsteadofenumeratingallstructuraltermsofalldocs(andthequery),enumerateonlyforthequeryThelatterishopefullyasmallNow,we’rereducedtocheckingwhichstructuralterm(s)fromthequerymatchasubtreeofanyThisistreepatternmatching:givenatexttreeandapatterntree,findmatchesExceptwehavemanytextOurtreesarelabeledandTextHereweseekadocwithHamletintheTextHereweseekadocwithHamletintheOnfindingthematchwecomputethecosinesimilarityscoreAfterallmatchesarefound,rankbysortingHamletQueryHamlet(StillAdoc(StillAdocwithYoricksomewhereinQueryWillgettoitRestrictingtheRestrictingtheEnumeratingallstructuralterms(subtrees)isprohibitive,forindexingMostsubtreesmayneverbeusedinprocessinganyqueryCanwegetawaywithindexingarestrictedclassofsubtreesIdeally–focusonsubtreeslikelytoariseinJuruXML(IBMOnlypathsincludingalexicontermInthisJuruXML(IBMOnlypathsincludingalexicontermInthisexamplethereareonly14(why?)suchpathsThuswehave14structuraltermsintheHamletTobeortoWhyisthisfarmoreHowbigcantheindexbeasafunctionoftheCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindexCouldhaveusedothersubtrees–e.g.,allsubtreeswithtwosiblingsunderanodeWhichsubtreesgetused:dependsonthelikelyqueriesintheCouldbespecifiedatindextime–areawithlittleresearchsofarMicrosoft2MicrosoftWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayWhywouldthisbeanydifferentfromjustBecausewepreservemoreofthestructurethataquerymayMicrosoftReturntothedescendantReturntothedescendantNoknownQueryseeksGatesunderinthevectorDeviseamatchfunctioninthevectorDeviseamatchfunctionthatyieldsascorein[0,1]betweenstructuraltermsE.g.,whenthestructuraltermsarepaths,measureThegreatertheoverlap,thehigherthematchCanadjustmatchforwheretheoverlapHowdoweHowdoweusethisinFirstenumeratestructuraltermsintheMeasureeachformatchagainstthedictionaryofstructuraltermsJustlikeapostingslookup,exceptnotBoolean(doesthetermexist)Instead,produceascorethatsays“80%closetothisstructuralterm”,etc.Then,retrievedocswiththatstructuralterm,computecosinesimilarities,etc.ExampleofaretrievalMatchST=ExampleofaretrievalMatchST=StructuralNowranktheDoc’sbycosinesimilarity;e.g.,Doc9scores0.578.ClosingButwhatexactlyisaClosingButwhatexactlyisaInasense,anentirecorpuscanbeviewedasanXMLdocumentWhatareWhataretheDoc’sintheAnythingwearepreparedtoreturnasanCouldbenodes,someoftheirchildrenWhatareWhatarequerieswecan’thandleusingvectorspaces?FindfiguresthatdescribetheCorbaarchitectureandtheparagraphsthatrefertothosefiguresRequiresJOINbetween2RetrievethetitlesofarticlespublishedintheSpecialFeaturesectionofthejournalIEEEMicroDependsonorderofsiblingCanwedoCanwedoYes,butdoesn’tmakesensetodoitcorpus-Candoit,forinstance,withinalltextunderacertainelementnamesayChapterYieldsatf-idfweightforeachlexicontermunderanelementIssues:howdowepropagatecontributionstohigherlevelnodes.SayGateshashighIDFundertheAuthorHowSayGateshashighIDFundertheAuthorHowshoulditbetf-idfweightedfortheBookShouldweusetheidfforGatesinAuthororthatinBook?SQLforSQLforUsageHuman-readableData-orientedMixeddocuments(e.g.,patientReliesXMLSchemaTuringXQueryisstillaworkingTheprincipalTheprincipalformsofXQueryexpressionspathelementFLWR("flower")listdatatypeEvaluatedwithrespecttoaFOR$pINdocument("bib.xml")//publisherLETFOR$pINdocument("bib.xml")//publisherLET$b:=document("bib.xml”)//book[publisher=$p]WHEREcount($b)>100RETURN$pFORgeneratesanorderedlistofbindingsofpublishernamesto$pLETassociatestoeachbindingafurtherbindingofthelistofbookelementswiththatpublisherto$batthisstage,wehaveanorderedlistoftuplesofbindings:WHEREfiltersthatlisttoretainonlythedesiredRETURNconstructsforeachtuplearesultingQueriesSupportedbyQueriesSupportedbyLocation/position(“chapterSimple/play/titlecontainsPathtitlecontains/play//titlecontainsComplexEmployeeswithtwoSubsumes:WhataboutrelevanceHowXQueryHowXQuerymakesAlldocumentsinsetAmustberankedabovealldocumentsinsetB.Fragmentsmustbeorderedindepth-first,left-to-rightorder.XQuery:OrderByXQuery:OrderByfor$dinlet$e:=document("emps.xml")//emp[deptno=$d]wherecount($e)>=10orderbyavg($e/salary)descendingreturn<big-dept>{$d,XQuery:OrderXQuery:OrderByOrderbyclauseonlyallowsorderingbySaybyanattributeRelevanceIsoftenCan’tbeexpressedeasilyasfunctionofsettobeIsbetterabstractedoutofqueryformulation(cf.UniversityofUniversityofGoal:opensourceXMLsearch“Returnable”fragmentsareE.g.,don’treturna<bold>sometext</bold>StructuredDocumentRetrievalEmpoweruserswhodon’tknowtheEnablesearchforanypersonnomatterhowschemaencodesthedataDon’tworryaboutAtomicSpecifiedAtomicSpecifiedinOnlyatomicunitscanbereturnedasresultofsearch(unlessunitspecified)Tf.idfweightingisappliedtoatomicProbabilisticcombinationof“evidence”fromatomicunitsXIRQLXIRQLAsystemshouldalwaysretrievethemostspecificpartofadocumentansweringaquery.Examplequery:<chapter>0.3<section>0.8XQL0.7syntaxReturnsection,notAugmentationEnsureAugmentationEnsurethatStructuredDocumentRetrievalPrincipleisrespected.Assumedifferentqueryconditionsaredisjointevents->independence.er)*P(XQL|section)–n)=0.3+0.6*0.8-0.3*0.6*0.8=0.636SectionrankedaheadofExample:AssignExample:AssignallelementsandattributeswithpersonsemanticstothisdatatypeAllowusertosearchforwithoutspecifyingXIRQL:RelevanceXIRQL:RelevanceFragment/contextDatatypesSemanticXMLXMLNativeXMLNativeXMLUsesXMLdocumentaslogicalShouldPCDATA(parsedcharacterDocumentContrastDBmodifiedforGenericIRsystemmodifiedforXMLIndexingandMostnativeXMLIndexingandMostnativeXMLdatabasestakenaDBNoIRtyperelevanceOnlyafewthatfocusonrelevanceDatavs.Text-centricDatavs.Text-centricData-centricXML:usedformessagingbetweenenterpriseapplicationsMainlyarecastingofrelationalContent-centricXML:usedforannotatingRichinDemandsgoodintegrationoftextretrievalE.g.,findmetheISBN#sofBookswithatleastthreeChaptersdiscussingcocoaproduction,rankedbyPriceDatastructuresDatastructuresforXMLAverybasicDatastructuresforDatastructuresforXMLWhataretheprimitivesweInvertedindex:givemeallelementsmatchingtextqueryQWeknowhowtodothis–treateachelementasadocumentGivemeallelements(immediately)belowanyinstanceoftheBookelementCombinationoftheParent/childNumbereachParent/childNumbereachMaintainalistofparent-childE.g.,Chapter:21EnablesimmediateButwhatabout“thewordHamletunderSceneelementunderaPlayGeneralpositionalViewtheXMLdocumentasatextGeneralpositionalViewtheXMLdocumentasatextBuildapositionalindexforeachMarkthebeginningandendforeachelement,PositionaldroppethunderVersePositionaldroppethunderVerseunderPl6y.SummaryofdataSummaryofdataPathcontainmentetc.canessentiallybesolvedbypositionalinvertedindexesRetrievalconsistsof“merging”Allthecompressiontricksetc.from276AarestillComplicationsarisefrominsertion/deletionofelements,textwithinelementsBeyondthescopeofthisINEX:aINEX:abenchmarkfortext-XMLBenchmarkforBenchmarkfortheevaluationofXMLAnalogofTREC(recallConsistsSetofXMLCollectionofretrievalEachengineindexesEachengineindexesEngineteamconvertsretrievaltasksintoInXMLquerylanguageunderstoodbyInresponse,theengineretrievesnotdocs,butelementswithindocsEngineranksretrievedINEXForINEXForeachquery,eachretrievedelementishuman-assessedontwomeasures:Relevance–howrelevantistheretrievedCoverage–istheretrievedelementtoospecific,toogeneral,orjustrightE.g.,ifthequeryseeksadefinitionoftheFastFourierTransform,doIgettheequation(toospecific),thechaptercontainingthedefinition(toogeneral)orthedefinitionitselfTheseassessmentsareturnedintocompositeprecision/recallmeasuresINEX12,107INEX12,107articlesfromIEEESociety494Averagearticle:1,532XMLAveragenodedepth=INEXEachINEXEachtopicisaninformationneed,oneoftwokinds:ContentOnly(CO)–freetextContentandStructure(CAS)–structuralconstraints,e.g.,containmentSampleINEXCOSampleINEXCO<Title>computationalbiology<Keywords>computationalbiology,bioinformatics,genome,genomics,proteomics,sequencing,proteinfolding<Description>Challengesthatarise,andapproachesbeingexplored,intheinterdisciplinaryfieldofcomputational<Narrative>Toberelevant,adocument/componentmusteithe
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025至2030中國(guó)電窯行業(yè)產(chǎn)業(yè)運(yùn)行態(tài)勢(shì)及投資規(guī)劃深度研究報(bào)告
- 2025至2030中國(guó)電池螺帽扳手行業(yè)產(chǎn)業(yè)運(yùn)行態(tài)勢(shì)及投資規(guī)劃深度研究報(bào)告
- 2025至2030中國(guó)電動(dòng)摩托車(chē)產(chǎn)業(yè)行業(yè)市場(chǎng)占有率及投資前景評(píng)估規(guī)劃報(bào)告
- 2025至2030中國(guó)豬飼料預(yù)混料行業(yè)產(chǎn)業(yè)運(yùn)行態(tài)勢(shì)及投資規(guī)劃深度研究報(bào)告
- 2025至2030中國(guó)物流金融行業(yè)市場(chǎng)發(fā)展現(xiàn)狀分析及發(fā)展趨勢(shì)與投資前景報(bào)告
- 分揀機(jī)器人實(shí)驗(yàn)平臺(tái)的數(shù)據(jù)安全與隱私保護(hù)研究
- 探索教育游戲化的多維度營(yíng)銷(xiāo)策略
- 教育心理學(xué)的力量激發(fā)孩子學(xué)習(xí)興趣的方法論
- 教育信息化戰(zhàn)略新科技成果的推動(dòng)與實(shí)施
- 從課程到教學(xué)-數(shù)字化辦公中的技術(shù)和倫理探討
- 學(xué)前兒童心理學(xué)論文
- 輪機(jī)英語(yǔ)詞匯匯總
- 溝通秘訣-報(bào)聯(lián)商課件
- 充電樁檢測(cè)報(bào)告模板
- 吊車(chē)施工專(zhuān)項(xiàng)施工方案
- 英語(yǔ)詞匯的奧秘·升級(jí)英語(yǔ)版-蔣爭(zhēng)
- NBT 10739-2021 井工煤礦輔助運(yùn)輸安全管理規(guī)范
- 2021年彬縣林業(yè)系統(tǒng)事業(yè)單位招聘考試《林業(yè)基礎(chǔ)知識(shí)》試題及答案解析
- 全絕緣銅管母線安裝方案
- 房地產(chǎn)殘余價(jià)值估價(jià)報(bào)告
- 2016河南省通用安裝工程預(yù)算定額-章節(jié)說(shuō)明
評(píng)論
0/150
提交評(píng)論