R與數(shù)據(jù)挖掘(學(xué)習(xí)決策樹和隨機(jī)森林的R語句)_第1頁(yè)
R與數(shù)據(jù)挖掘(學(xué)習(xí)決策樹和隨機(jī)森林的R語句)_第2頁(yè)
R與數(shù)據(jù)挖掘(學(xué)習(xí)決策樹和隨機(jī)森林的R語句)_第3頁(yè)
R與數(shù)據(jù)挖掘(學(xué)習(xí)決策樹和隨機(jī)森林的R語句)_第4頁(yè)
R與數(shù)據(jù)挖掘(學(xué)習(xí)決策樹和隨機(jī)森林的R語句)_第5頁(yè)
已閱讀5頁(yè),還剩12頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、數(shù)據(jù)挖掘報(bào)告 17乳腺癌的分析摘要此次實(shí)驗(yàn)的目的主要是研究分類,對(duì)乳腺癌的類型良性的還是惡性的進(jìn)行分類。比較一下什么方法更好。數(shù)據(jù)共包括699個(gè)觀測(cè)值,每個(gè)觀測(cè)有11個(gè)變量。有缺失值。主要是運(yùn)用了R和SAS兩個(gè)軟件進(jìn)行分析的。R中用的方法都是數(shù)據(jù)挖掘中的一些典型方法。SAS中是采用了判別與聚類的方法。原始數(shù)據(jù)已經(jīng)將類別分好了,對(duì)于分類研究使用不同的方法看一下哪種方法的精度更高。關(guān)鍵詞:數(shù)據(jù)挖掘方法、判別、聚類 一 數(shù)據(jù)的描述: a)一共有699個(gè)觀測(cè),11個(gè)變量。 b)變量解釋: "id" "clump_thickness"" 腫塊的密度 取值

2、1-10 "uniformity_cell_size" 細(xì)胞的大小均勻度 取值1-10 "uniformity_cell_shape" 細(xì)胞的形狀的均勻度 取值1-10 "marginal_adhesion" 邊緣部分的黏著度 取值1-10 "single_epithelialcell_size" 單一的上皮細(xì)胞的大小 取值1-10 "bare_nuclei" 裸露細(xì)胞核 取值1-10 "bland_chromatin" 染色質(zhì) 取值1-10 "normal_nuc

3、leoli" 正常的細(xì)胞核 取值1-10 "mitoses" 有絲分裂 取值1-10 "btype" 類型 2-良性,4-惡性 c)數(shù)據(jù)是共有16個(gè)缺失值的,在"bare_nuclei" 這個(gè)變量中 d)對(duì)缺失值的處理共采用了三種方法:直接刪除、利用均值進(jìn)行插補(bǔ)、利用中 位數(shù)進(jìn)行插補(bǔ)。 e)后面采用的方法最基本的數(shù)據(jù)是采用了中位數(shù)的方法進(jìn)行差補(bǔ)以后的。二 R語言采用的方法介紹共5種方法 (決策樹,神經(jīng)網(wǎng)絡(luò),支持向量機(jī),隨機(jī)森林,最近鄰方法)A) 數(shù)據(jù)的基本處理 1)讀入txt格式數(shù)據(jù),將btype設(shè)為分類變量 breast_

4、cancer <- read.delim("breast_cancer.txt"); breast_cancer$btype <- factor(breast_cancer$btype);2) 顯示16個(gè)缺失值所在的行數(shù) which(complete.cases(breast_cancer) = F); 1 24 41 140 146 159 165 236 250 276 293 295 298 316 322 412 6183) 缺失值的處理方法 a)直接刪除 breast_cancer_delete <- na.omit(breast_cancer)

5、; b)均值進(jìn)行差補(bǔ) breast_cancer_mean <- breast_cancer;for (r in which(!complete.cases(breast_cancer) breast_cancer_meanr, which(is.na(breast_cancerr, ) <- apply(data.frame(breast_cancer, which(is.na(breast_cancerr, ), 2, mean, na.rm = T); c)中位數(shù)進(jìn)行插補(bǔ) breast_cancer_median <- breast_cancer;for (r in w

6、hich(!complete.cases(breast_cancer) breast_cancer_medianr, which(is.na(breast_cancerr, ) <- apply(data.frame(breast_cancer, which(is.na(breast_cancerr, ), 2, median, na.rm = T); B)方法介紹 1)分類樹 使用的包rpart 、rpart.plot a)使用中位數(shù)填補(bǔ)后的數(shù)據(jù)進(jìn)行建模分析以及分析判錯(cuò)率#分類樹,請(qǐng)先安裝rpart程序包library(rpart);set.seed(100);breast.part

7、<- rpart(factor(btype) ., data = breast_cancer_median, method="class");table = table(predict(breast.part,breast_cancer_median,type="class"), breast_cancer_median$btype);# 計(jì)算錯(cuò)判率pError=1 - sum(diag(table)/nrow(breast_cancer_median);cat("分類的錯(cuò)判率pError為:","n",pE

8、rror ,"n"); 分類的錯(cuò)判率pError為: 0.03576538# 畫圖,請(qǐng)先安裝rpart.plot程序包library(rpart.plot);rpart.plot(breast.part); # 畫出分類樹結(jié)果plotcp(breast.part,minline = TRUE); # 交叉驗(yàn)證錯(cuò)誤率與分類樹節(jié)點(diǎn)數(shù)的關(guān)系(?)plot(breast.part,uniform=T,branch=0.4,compress=T); text(breast.part,use.n=T); # 帶頻數(shù)的結(jié)果圖printcp(breast.part); # 查看這棵樹的復(fù)雜

9、性參量表CPnsplitrel errorxerrorxstd交叉驗(yàn)證錯(cuò)誤率葉節(jié)點(diǎn)數(shù)減一預(yù)測(cè)誤差10.78008301.000001.000000.05214220.05394210.219920.261410.03141530.02489620.165980.186720.02692440.01244830.141080.174270.02607150.01000060.103730.174270.026071 誤差原則: 參考文獻(xiàn) # 剪枝breast.part2 <- prune(breast.part, cp = 0.016);rpart.plot(breast.part2);

10、# 剪枝以后的分類樹圖 b)進(jìn)行交叉驗(yàn)證:由于數(shù)據(jù)的觀測(cè)并不是太大(699)采取3折交叉驗(yàn)證n = 699; zz1 = 1:n;zz2 = rep(1:3, ceiling(699/3)1:n; set.seed(100); zz2 = sample(zz2, n);nmse = list(NULL, NULL);c <- breast_cancer_median;for (i in 1:3) data.train = c-c(which(zz2 = i), ; data.test = cc(zz2 = i), ; d.train <- rpart(factor(btype) .

11、, data = data.train, method="class"); table1 = table(predict(d.train, data.train, type = "class"), data.train$btype); table2 = table(predict(d.train, data.test , type = "class"), data.test$btype); nmse1i = 1 - sum(diag(table1)/nrow(data.train); nmse2i = 1 - sum(diag(tab

12、le2)/sum(table2); cat("rpart method第", i, "折:", "n"); cat("訓(xùn)練集錯(cuò)誤率:", nmse1i, "n"); cat("測(cè)試集錯(cuò)誤率:", nmse2i, "n", "n");NMSE = array();NMSE1 = sum(nmse1)/3;NMSE2 = sum(nmse2)/3;cat("rpart method訓(xùn)練集上的平均錯(cuò)誤率為:", "

13、;n", NMSE1, "n");cat("rpart method測(cè)試集上的平均錯(cuò)誤率為:", "n", NMSE2, "n");結(jié)果:rpart method第 1 折: 訓(xùn)練集錯(cuò)誤率: 0.04935622 測(cè)試集錯(cuò)誤率: 0.05579399 rpart method第 2 折: 訓(xùn)練集錯(cuò)誤率: 0.03433476 測(cè)試集錯(cuò)誤率: 0.05150215 rpart method第 3 折: 訓(xùn)練集錯(cuò)誤率: 0.04077253 測(cè)試集錯(cuò)誤率: 0.04291845 rpart method訓(xùn)練集

14、上的平均錯(cuò)誤率為: 0.04148784rpart method測(cè)試集上的平均錯(cuò)誤率為: 0.05007153 2)神經(jīng)網(wǎng)絡(luò) 使用的包有nnet a)使用中位數(shù)填補(bǔ)后的數(shù)據(jù)進(jìn)行建模分析以及分析判錯(cuò)率# 請(qǐng)先安裝nnet程序包library(nnet);a <- nnet(factor(btype) ., data = breast_cancer_median, size = 6, rang = 0.1, decay = 5e-4, maxit = 1000);a.predict <- predict(a, data = breast_cancer_median, type = &#

15、39;class');table=table(a.predict, breast_cancer_median$btype);# 計(jì)算錯(cuò)判率pError=1 - sum(diag(table)/nrow(breast_cancer_median);cat("nnet分類的錯(cuò)判率pError為:","n",pError ,"n"); 結(jié)果顯示全部判斷正確 a.predict24 24580 40241 nnet分類的錯(cuò)判率pError為: 0 b)使用三折交叉驗(yàn)證n = 699; zz1 = 1:n;zz2 = rep(1:3,

16、ceiling(699/3)1:n; set.seed(100); zz2 = sample(zz2, n);nmse = list(NULL, NULL);c <- breast_cancer_median;for (i in 1:3) data.train = c-c(which(zz2 = i), ; data.test = cc(zz2 = i), ; d.train <- nnet(factor(btype)., data = data.train, size = 6, rang = 0.1, decay = 5e-4, maxit = 1000); table1 = t

17、able(predict(d.train, data.train, type = "class"), data.train$btype); table2 = table(predict(d.train, data.test, type = "class"), data.test$btype); nmse1i = 1 - sum(diag(table1)/nrow(data.train); nmse2i = 1 - sum(diag(table2)/sum(table2); cat("n", "nnet method第&quo

18、t;, i, "折:", "n"); cat("第", i, "折訓(xùn)練集的錯(cuò)誤率為:", "n", nmse1i, "n"); cat("第", i, "折測(cè)試集的錯(cuò)誤率為:", "n", nmse2i, "n", "n");NMSE = array();NMSE1 = sum(nmse1)/3;NMSE2 = sum(nmse2)/3;cat("nnet metho

19、d訓(xùn)練集上的平均錯(cuò)誤率為:", "n", NMSE1, "n");cat("nnet method測(cè)試集上的平均錯(cuò)誤率為:", "n", NMSE2, "n");結(jié)果:nnet method第 1 折: 第 1 折訓(xùn)練集的錯(cuò)誤率為: 0 第 1 折測(cè)試集的錯(cuò)誤率為: 0.05150215nnet method第 2 折: 第 2 折訓(xùn)練集的錯(cuò)誤率為: 0.002145923 第 2 折測(cè)試集的錯(cuò)誤率為: 0.08583691nnet method第 3 折: 第 3 折訓(xùn)練集的錯(cuò)誤率為

20、: 0 第 3 折測(cè)試集的錯(cuò)誤率為: 0.03862661nnet method訓(xùn)練集上的平均錯(cuò)誤率為: 0.0007153076 nnet method測(cè)試集上的平均錯(cuò)誤率為: 0.058655223)支持向量機(jī) 使用的包有e1071,ggplot a)ggplot這個(gè)包畫圖的功能很強(qiáng)大,就是可以使圖輸出到pdf,等好多形式的輸出可以拿出任意兩個(gè)變量來畫圖給個(gè)直觀的印象那個(gè)變量對(duì)分類影響較明顯,例如使用一下兩個(gè)變量來看,這里之舉一個(gè)看。# 繪制以bare_nuclei為橫軸頻數(shù)為中軸的直方圖,請(qǐng)先安裝ggplot2程序包library(ggplot2); # 載入繪圖函數(shù)報(bào)c6 <-

21、qplot(bare_nuclei, data = breast_cancer_median, colour = factor(btype);ggsave("svm.pdf", width = 6, height = 6); # 將圖保存為PDF格式b)使用中位數(shù)填補(bǔ)后的數(shù)據(jù)進(jìn)行建模分析以及分析判錯(cuò)率#請(qǐng)先安裝e1071程序包library(e1071);s <- svm(factor(btype)., data = breast_cancer_median);pre = predict(s, breast_cancer_median, type = 'cla

22、ss');plot(pre breast_cancer_median$btype);table = table(pre, breast_cancer_median$btype); table;# 計(jì)算錯(cuò)判率error <- 1 - sum(diag(table)/nrow(breast_cancer_median);cat("SVM分類的錯(cuò)判率error為:","n",error ,"n");結(jié)果:pre 2 4 2 446 5 4 12 236SVM分類的錯(cuò)判率error為: 0.02432046 c)3折交叉驗(yàn)證的結(jié)

23、果n = 699; zz1 = 1:n;zz2 = rep(1:3, ceiling(699/3)1:n; set.seed(100); zz2 = sample(zz2, n);nmse = list(NULL, NULL);c <- breast_cancer_median;for (i in 1:3) data.train = c-c(which(zz2 = i), ; data.test = cc(zz2 = i), ; d.train <- svm(factor(btype)., data = data.train); table1 = table(predict(d.t

24、rain, data.train, type = "class"), data.train$btype); table2 = table(predict(d.train, data.test, type = "class"), data.test$btype); nmse1i = 1 - sum(diag(table1)/nrow(data.train); nmse2i = 1 - sum(diag(table2)/sum(table2); cat("n", "svm method第", i, "折:&q

25、uot;, "n"); cat("訓(xùn)練集的錯(cuò)誤率為:", "n", nmse1i, "n"); cat("測(cè)試集的錯(cuò)誤率為:", "n", nmse2i, "n", "n");NMSE = array();NMSE1 = sum(nmse1)/3;NMSE2 = sum(nmse2)/3;cat("svm method訓(xùn)練集上的平均錯(cuò)誤率為:", "n", NMSE1, "n"

26、);cat("svm method測(cè)試集上的平均錯(cuò)誤率為:", "n", NMSE2, "n");結(jié)果:svm method第 1 折: 訓(xùn)練集的錯(cuò)誤率為: 0.02575107 測(cè)試集的錯(cuò)誤率為: 0.03433476 svm method第 2 折: 訓(xùn)練集的錯(cuò)誤率為: 0.027897 測(cè)試集的錯(cuò)誤率為: 0.04291845 svm method第 3 折: 訓(xùn)練集的錯(cuò)誤率為: 0.02145923 測(cè)試集的錯(cuò)誤率為: 0.03862661svm method訓(xùn)練集上的平均錯(cuò)誤率為: 0.02503577svm method測(cè)

27、試集上的平均錯(cuò)誤率為: 0.038626614) 隨機(jī)森林方法 使用的包randomForesta) 使用中位數(shù)填補(bǔ)后的數(shù)據(jù)進(jìn)行建模分析并輸出變量的重要性#請(qǐng)先安裝randomForest程序包library(randomForest);r.breast <- randomForest(factor(btype) ., data = breast_cancer_median, ntree = 2000, importance = T, replace = TRUE, keep.inbag = TRUE, norm.votes=FALSE, oob.times=TRUE, proximit

28、y=T);r.breast;summary(r.breast);imp <- importance(r.breast); imp;impvar <- imporder(imp, 3, decreasing = TRUE), ; impvar;varImpPlot(r.breast);getTree(r.breast, k = 1, labelVar = FALSE);結(jié)果:Confusion matrix:24class.error2 446120.026200874 92320.03734440 24MeanDecreaseAccuracy 重要性bare_nuclei2.309

29、86743.96706532.2676480uniformity_cell_size1.87425312.70701891.9318088clump_thickness1.87516113.50292391.9280315bland_chromatin1.10957893.01804011.8088237uniformity_cell_shape1.22527722.85703781.7930345normal_nucleoli1.60260171.86527281.4735789marginal_adhesion0.98895582.08229431.3521871single_epithe

30、lialcell_size1.14645651.06883831.0714372mitoses1.05742800.95296730.9765416以圖示顯示變量的重要性b)使用三折交叉驗(yàn)證的結(jié)果n = 699; zz1 = 1:n;zz2 = rep(1:3, ceiling(699/3)1:n; set.seed(100); zz2 = sample(zz2, n);nmse = list(NULL, NULL);c <- breast_cancer_median;for (i in 1:3) data.train = c-c(which(zz2 = i), ; data.test

31、= cc(zz2 = i), ; d.train <- randomForest(factor(btype) ., data = data.train, ntree = 2000, importance = T, replace = TRUE, keep.inbag = TRUE, norm.votes=FALSE, oob.times=TRUE, proximity=T); table1 = table(predict(d.train, data.train, type = "class"), data.train$btype); table2 = table(pr

32、edict(d.train, data.test, type = "class"), data.test$btype); nmse1i = 1 - sum(diag(table1)/nrow(data.train); nmse2i = 1 - sum(diag(table2)/sum(table2); cat("n", "randomForest method第", i, "折:", "n"); cat("訓(xùn)練集的錯(cuò)誤率為:", "n", nmse1i,

33、"n"); cat("測(cè)試集的錯(cuò)誤率為:", "n", nmse2i, "n", "n");NMSE = array();NMSE1 = sum(nmse1)/3;NMSE2 = sum(nmse2)/3;cat("randomForest method訓(xùn)練集上的平均錯(cuò)誤率為:", "n", NMSE1, "n");cat("randomForest method測(cè)試集上的平均錯(cuò)誤率為:", "n"

34、;, NMSE2, "n");結(jié)果:randomForest method第 1 折: 訓(xùn)練集的錯(cuò)誤率為: 0 測(cè)試集的錯(cuò)誤率為: 0.02575107 randomForest method第 2 折: 訓(xùn)練集的錯(cuò)誤率為: 0 測(cè)試集的錯(cuò)誤率為: 0.04291845 randomForest method第 3 折: 訓(xùn)練集的錯(cuò)誤率為: 0 測(cè)試集的錯(cuò)誤率為: 0.03433476randomForest method訓(xùn)練集上的平均錯(cuò)誤率為: 0 randomForest method測(cè)試集上的平均錯(cuò)誤率為: 0.034334765)最近鄰方法 用到的包有kknna)通

35、過循環(huán),以第一折測(cè)試集上的正確率最好為準(zhǔn)則,選擇k值。library(igraph); library(kknn);corr = array();m = 699; zz1 = 1:m;zz2 = rep(1:3, ceiling(699/3)1:m; set.seed(100); zz2 = sample(zz2, m);data.test = breast_cancer_medianzz2 = 3, ;data.train = breast_cancer_median-c(which(zz2 = 3), ;for (i in 7:100) a = kknn(factor(btype) .,

36、test = data.test, train = data.train, k = i); table = table(data.test$btype, a$fit); corri = sum(diag(table)/sum(table);plot(0, 0.3, xlim = c(6, 100), ylim = c(0.5, 1), ylab = "正確率");for(i in 7:100) points(i,corri);identify(corr);結(jié)果顯示k取11時(shí)正確率最高。b)接下來是交叉驗(yàn)證的結(jié)果# 3折交叉驗(yàn)證,k取11n = 699; zz1 = 1:n; # zz1為所有觀測(cè)值(行)的下標(biāo)zz2 = rep(1:3, ceiling(699/3)1:n; set.seed(100); zz2 = sample(zz2, n); # 將樣本隨機(jī)分成三份,用zz2標(biāo)記分到哪一份nmse = list(NULL, NULL);c <- breast_cancer_median; # 為了代碼簡(jiǎn)潔將原來的數(shù)據(jù)賦值給cfor (i in 1:3) data.train = c-c(which(zz2 = i), ; # 訓(xùn)練集 data.test = cc(zz2 = i), ;

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論