推薦系統(tǒng)搭建全程圖文攻略_第1頁(yè)
推薦系統(tǒng)搭建全程圖文攻略_第2頁(yè)
推薦系統(tǒng)搭建全程圖文攻略_第3頁(yè)
推薦系統(tǒng)搭建全程圖文攻略_第4頁(yè)
推薦系統(tǒng)搭建全程圖文攻略_第5頁(yè)
已閱讀5頁(yè),還剩10頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、推薦系統(tǒng)搭建全程圖文攻略一 推薦系統(tǒng)架構(gòu)簡(jiǎn)介整體推薦架構(gòu)圖:1. 推薦整體從數(shù)據(jù)處理開(kāi)始,默認(rèn)數(shù)據(jù)從關(guān)系型數(shù)據(jù)到每天增量導(dǎo)入到hive,在hive中通過(guò)中間表和調(diào)用python文件等一系列操作,將數(shù)據(jù)處理為算法數(shù)學(xué)建模的入口數(shù)據(jù),這里只是模擬一下,所以用一個(gè)scala文件產(chǎn)生所有準(zhǔn)備數(shù)據(jù),并直接load到hive中去做數(shù)據(jù)處理2. 數(shù)據(jù)處理完以后開(kāi)始數(shù)學(xué)建模,通過(guò)recommend.scala文件對(duì)邏輯回歸算法的調(diào)用,產(chǎn)生模型文件,將三個(gè)模型文件拷貝到dubbox項(xiàng)目的響應(yīng)目錄,啟動(dòng)項(xiàng)目,訪問(wèn)測(cè)試整個(gè)過(guò)程默認(rèn)已經(jīng)有hive環(huán)境,intellij idea的環(huán)境,并且可以執(zhí)行scala文件流程如

2、下:Scala文件產(chǎn)生數(shù)據(jù)èload到hive,處理數(shù)據(jù)èrecommond.scala調(diào)用邏輯回歸算法計(jì)算模型,生成模型文件è將模型文件拷貝到項(xiàng)目制定目錄,運(yùn)行項(xiàng)目è瀏覽器訪問(wèn)測(cè)試二數(shù)據(jù)預(yù)處理1.創(chuàng)建測(cè)試數(shù)據(jù)通過(guò)DataGenerator類創(chuàng)建數(shù)據(jù),參見(jiàn)附件DataGenerator.scala文件,傳入?yún)?shù)兩個(gè),數(shù)據(jù)條數(shù)和輸出目錄比如:100000 E:推薦系統(tǒng)資料hitop會(huì)輸出三個(gè)文件2.hive建表真實(shí)的生產(chǎn)場(chǎng)景涉及到大概五十張表的字段,這里全部簡(jiǎn)化流程,直接給出最終的三張表:應(yīng)用詞表用戶歷史下載表正負(fù)例樣本表建表語(yǔ)句:應(yīng)用詞表:CREATE

3、EXTERNAL TABLE IF NOT EXISTS dim_rcm_hitop_id_list_ds( hitop_id STRING, name STRING, author STRING, sversion STRING, ischarge SMALLINT, designer STRING, font STRING, icon_count INT, stars DOUBLE, price INT, file_size INT, comment_num INT, screen STRING, dlnum INT)row format delimited fields terminat

4、ed by 't'用戶歷史下載表:CREATE EXTERNAL TABLE IF NOT EXISTS dw_rcm_hitop_userapps_dm( device_id STRING, devid_applist STRING, device_name STRING, pay_ability STRING)row format delimited fields terminated by 't'正負(fù)例樣本表:CREATE EXTERNAL TABLE IF NOT EXISTS dw_rcm_hitop_sample2learn_dm ( label S

5、TRING, device_id STRING, hitop_id STRING, screen STRING, en_name STRING, ch_name STRING, author STRING, sversion STRING, mnc STRING, event_local_time STRING, interface STRING, designer STRING, is_safe INT, icon_count INT, update_time STRING, stars DOUBLE, comment_num INT, font STRING, price INT, fil

6、e_size INT, ischarge SMALLINT, dlnum INT)row format delimited fields terminated by 't'3.load數(shù)據(jù)分別往三張表load數(shù)據(jù):用戶詞表:load data local inpath '/opt/sxt/recommender/script/applist.txt' into table dim_rcm_hitop_id_list_ds;用戶歷史下載表:load data local inpath '/opt/sxt/recommender/script/userdow

7、nload.txt' into table dw_rcm_hitop_userapps_dm;正負(fù)例樣本表:load data local inpath '/opt/sxt/recommender/script/sample.txt' into table dw_rcm_hitop_sample2learn_dm;4.構(gòu)建訓(xùn)練數(shù)據(jù)1.創(chuàng)建臨時(shí)表CREATE TABLE IF NOT EXISTS tmp_dw_rcm_hitop_prepare2train_dm( device_id STRING, label STRING, hitop_id STRING, scre

8、en STRING, ch_name STRING, author STRING, sversion STRING, mnc STRING, interface STRING, designer STRING, is_safe INT, icon_count INT, update_date STRING, stars DOUBLE, comment_num INT, font STRING, price INT, file_size INT, ischarge SMALLINT, dlnum INT, idlist STRING, device_name STRING, pay_abilit

9、y STRING)row format delimited fields terminated by 't'CREATE TABLE IF NOT EXISTS dw_rcm_hitop_prepare2train_dm ( label STRING, features STRING)row format delimited fields terminated by 't'2.訓(xùn)練數(shù)據(jù)預(yù)處理過(guò)程首先將數(shù)據(jù)從正負(fù)例樣本和用戶歷史下載表數(shù)據(jù)加載到臨時(shí)表中INSERT OVERWRITE TABLE tmp_dw_rcm_hitop_prepare2train_dmS

10、ELECT t2.device_id, t2.label, t2.hitop_id, t2.screen, t2.ch_name, t2.author, t2.sversion, t2.mnc, erface, t2.designer, t2.is_safe, t2.icon_count, to_date(t2.update_time), t2.stars, ment_num, t2.font, t2.price, t2.file_size, t2.ischarge, t2.dlnum, t1.devid_applist, t1.device_name, t1.pay_abilit

11、yFROM( SELECT device_id, devid_applist, device_name, pay_ability FROM dw_rcm_hitop_userapps_dm) t1RIGHT OUTER JOIN ( SELECT device_id, label, hitop_id, screen, ch_name, author, sversion, IF (mnc IN ('00','01','02','03','04','05','06','07

12、9;), mnc,'x') AS mnc, interface, designer, is_safe, IF (icon_count <= 5,icon_count,6) AS icon_count, update_time, stars, IF ( comment_num IS NULL,0, IF ( comment_num <= 10,comment_num,11) AS comment_num, font, price, IF (file_size <= 2*1024*1024,2, IF (file_size <= 4*1024*1024,4,

13、 IF (file_size <= 6*1024*1024,6, IF (file_size <= 8*1024*1024,8, IF (file_size <= 10*1024*1024,10, IF (file_size <= 12*1024*1024,12, IF (file_size <= 14*1024*1024,14, IF (file_size <= 16*1024*1024,16, IF (file_size <= 18*1024*1024,18, IF (file_size <= 20*1024*1024,20,21) AS f

14、ile_size, ischarge, IF (dlnum IS NULL,0, IF (dlnum <= 50,50, IF (dlnum <= 100,100, IF (dlnum <= 500,500, IF (dlnum <= 1000,1000, IF (dlnum <= 5000,5000, IF (dlnum <= 10000,10000, IF (dlnum <= 20000,20000,20001) AS dlnum FROM dw_rcm_hitop_sample2learn_dm) t2ON (t1.device_id = t2.

15、device_id);然后再利用python腳本處理格式這里要先講python腳本加載到hive中ADD FILE /opt/sxt/recommender/script/dw_rcm_hitop_prepare2train_dm.py;可以通過(guò)list files;查看是不是python文件加載到了hivePython文件:dw_rcm_hitop_prepare2train_dm.py 在hive語(yǔ)句中調(diào)用python腳本INSERT OVERWRITE TABLE dw_rcm_hitop_prepare2train_dmSELECTTRANSFORM (t.*)USING 'p

16、ython dw_rcm_hitop_prepare2train_dm.py'AS (label,features)FROM( SELECT label, hitop_id, screen, ch_name, author, sversion, mnc, interface, designer, icon_count, update_date, stars, comment_num, font, price, file_size, ischarge, dlnum, idlist, device_name, pay_ability FROM tmp_dw_rcm_hitop_prepar

17、e2train_dm) t;3.導(dǎo)出訓(xùn)練數(shù)據(jù)將處理完成后的訓(xùn)練數(shù)據(jù)導(dǎo)出用做線下訓(xùn)練的源數(shù)據(jù)insert overwrite local directory '/opt/data/traindata' row format delimited fields terminated by 't' select * from dw_rcm_hitop_prepare2train_dm;注:這里是將數(shù)據(jù)導(dǎo)出到本地,方便后面再本地模式跑數(shù)據(jù),導(dǎo)出模型數(shù)據(jù)。這里是方便演示真正的生產(chǎn)環(huán)境是直接用腳本提交spark任務(wù),從hdfs取數(shù)據(jù)結(jié)果仍然在hdfs,再用ETL工具將訓(xùn)練的模

18、型結(jié)果文件輸出到web項(xiàng)目的文件目錄下,用來(lái)做新的模型,web項(xiàng)目設(shè)置了定時(shí)更新模型文件,每天按時(shí)讀取新模型文件三模型訓(xùn)練將導(dǎo)出的數(shù)據(jù)作為輸入放在recommend類中執(zhí)行,參見(jiàn)附件recommond.scala文件,參數(shù)為四個(gè),分別是spark執(zhí)行的模式,輸入數(shù)據(jù)文件路徑,分隔符和輸出數(shù)據(jù)路徑,注意這里分割是tab鍵或者是逗號(hào),因?yàn)樵磾?shù)據(jù)中的分隔符號(hào)不統(tǒng)一這里的輸入文件為前面導(dǎo)出的訓(xùn)練數(shù)據(jù),地址為linux本地路徑/opt/data/traindata/000000_0例如:local E:/推薦系統(tǒng)/資料/hitop/000000_0 "t|;" E:/推薦系統(tǒng)/資料/hitop/model.csv得到結(jié)果文件為特征和權(quán)重,如圖后面的權(quán)重小數(shù)為科學(xué)計(jì)數(shù)法四線上模型使用1.拷貝模型文件這里需要注意兩個(gè)問(wèn)題:1.是所有m

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論