Missing Value Treatment

在進行資料分析時，常遇到的問題就是遺失值處理(Missing Value Treatment)。特別是重要特徵變數有遺失值時，是無法輕易忽略的。比如說，在進行回歸模型建置時，資料列若有任一特徵值是NA(Not Available)時，整列資料就將被忽略不使用，這樣無疑會使我們失去很多資訊。

一般來說，我們會先判斷遺失資料在整體資料中所佔比例，若極小比例，是可以直接刪除資料列。但如果特徵變數中有超過5％的遺失比例，就得進行遺失值處理(Missing Value Treatment)。

本篇將介紹常用的幾種Missing Value Treatment，依據不同狀況而使用，包括：

刪除有遺失值的資料列
刪除特徵變數
使用平均數(mean)/中位數(median)/眾數(mode)填補
預測(Prediction)
1. kNN (k-Nearest Neighbours)
2. rpart (決策樹)
3. MICE (Multivariate Imputation by Chained Equations)

資料集準備

我們將使用mlbench package中的BostonHousing資料集來進行遺失值處理示範。由於BostonHousing原始資料集是沒有遺失值的，所以我們會隨機插入遺失值。最後則將預測值與實際值進行比較，以評估各個遺失值處理法的效果好壞。

首先，我們載入資料集，並隨機將變數rad和ptraio的40列變成遺失值NA。

data("BostonHousing", package = "mlbench")
original <- BostonHousing

set.seed(100)
BostonHousing[sample(1:nrow(BostonHousing),40),"rad"] <- NA
BostonHousing[sample(1:nrow(BostonHousing),40), "ptratio"] <- NA
head(BostonHousing)

data("BostonHousing", package = "mlbench")

original

set.seed(100)

BostonHousing[sample(1:nrow(BostonHousing),40),"rad"]

BostonHousing[sample(1:nrow(BostonHousing),40), "ptratio"]

head(BostonHousing)

檢視隨機插入遺失值得資料頭幾列。

> head(BostonHousing)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393 4.03 34.7
4 0.03237 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7

> head(BostonHousing)

crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397 4.98 24.0

2 0.02731 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397 9.14 21.6

3 0.02729 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393 4.03 34.7

4 0.03237 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395 2.94 33.4

5 0.06905 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397 5.33 36.2

6 0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7

接著，我們使用的mice套件檢視資料遺失值分佈的pattern。

install.packages("mice")
library(mice)
md.pattern(BostonHousing) # missing data pattern

# crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
# 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397 4.98 24.0
# 2 0.02731 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397 9.14 21.6
# 3 0.02729 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393 4.03 34.7
# 4 0.03237 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395 2.94 33.4
# 5 0.06905 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397 5.33 36.2
# 6 0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7

install.packages("mice")

library(mice)

md.pattern(BostonHousing) # missing data pattern

# crim zn indus chas nox rm age dis rad tax ptratio b lstat medv

# 1 0.00632 18 2.31 0 0.538 6.58 65.2 4.09 1 296 15.3 397 4.98 24.0

# 2 0.02731 0 7.07 0 0.469 6.42 78.9 4.97 2 242 17.8 397 9.14 21.6

# 3 0.02729 0 7.07 0 0.469 7.18 61.1 4.97 2 242 17.8 393 4.03 34.7

# 4 0.03237 0 2.18 0 0.458 7.00 45.8 6.06 3 222 18.7 395 2.94 33.4

# 5 0.06905 0 2.18 0 0.458 7.15 54.2 6.06 3 222 18.7 397 5.33 36.2

# 6 0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7

1. 刪除資料列法

如果資料集數量夠大，且要預測的事件在訓練資料集中有足夠具代表性的比重，可以選擇忽略遺失資料列。可以在建立模型時，將參數設成na.action=na.omit。務必要確保：(1)資料量夠大，模型不會失去預測能力(2)不構成偏差，即原始目標事件比例不會因為刪除遺失值而發生改變

# example 1
lm(medv ~ ptratio + rad, data = BostonHousing, na.action = na.omit) # 雖然na.omit是預設值

1 2	# example 1 lm(medv ~ ptratio + rad, data = BostonHousing, na.action = na.omit) # 雖然na.omit是預設值

2. 刪除特徵變數法

如果該特徵變數遺失值得數量較其他變數都來的多，而且如果移除該變數你可以拯救許多筆資料，你可以考慮將該變數移除，除非該變數在業務經驗中是很重要的變數。屬於一種「變數重要性」與「遺失資料數」之間的抉擇。

3. 使用平均數(mean)/中位數(median)/眾數(mode)填補

使用平均數/中位數/眾數填補遺失值是一個較粗糙的手段，視情況而定，如果該變數變異性低或是對目標變數的影響利沒這麼高，則此粗略的近似法是可被接受，且可能可以產生不錯的結果。

# 可以使用Hmisc套件的imputte()來填補
library(Hmisc)
impute(BostonHousing$ptratio, mean)  # replace with mean
impute(BostonHousing$ptratio, median)  # median
impute(BostonHousing$ptratio, 20)  # replace specific number
# 或是手動填補
BostonHousing$ptratio[is.na(BostonHousing$ptratio)] <- mean(BostonHousing$ptratio, na.rm = T)  # not run

# 可以使用Hmisc套件的imputte()來填補

library(Hmisc)

impute(BostonHousing$ptratio, mean) # replace with mean

impute(BostonHousing$ptratio, median) # median

impute(BostonHousing$ptratio, 20) # replace specific number

# 或是手動填補

BostonHousing$ptratio[is.na(BostonHousing$ptratio)]

我們來計算使用平均值填補的正確率。

library(DMwR)
actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- rep(mean(BostonHousing$ptratio, na.rm = TRUE), length(actuals))
regr.eval(trues = actuals,preds = predicteds)

# > regr.eval(trues = actuals,preds = predicteds)
#   mae    mse   rmse   mape 
#   1.6232 4.1931 2.0477 0.0955

library(DMwR)

actuals

predicteds

regr.eval(trues = actuals,preds = predicteds)

# > regr.eval(trues = actuals,preds = predicteds)

# mae mse rmse mape

# 1.6232 4.1931 2.0477 0.0955

其中幾個衡量預測精準(Forcast Accuracy)的方法如下：

MAPE代表的是平均絕對誤差百分比(mean absolute percentage error)。

$$MAPE=\frac{1}{n}\sum_{t=1}^n\left|\frac{y_{t}-\hat{y_{t}}}{y_{t}}\right|$$

其中，$\hat{y_{t}}$代表預測值，$y_{t}$則為實際值。

RSME代表均方根誤差(Root-Square-Mean Error)。

$$RSME=\sqrt\frac{\sum_{t=1}^n(\hat{y_{t}}-y_{t})^2}{n}$$

MSE代表均方誤差(Mean-Square Error)。即絕對誤差的平均值。

$$MSE=\frac{\sum_{t=1}^n(\hat{y_{t}}-y_{t})^2}{n}$$

MAE代表平均絕對誤差(mean absolute error)。

$$MAE=\frac{\sum_{t=1}^n(\hat{y_{t}}-y_{t})}{n}$$

我們可以看到平均值填補的MAPE(平均絕對誤差百分比)為0.0955。

4. 預測(Prediction)

4-1. kNN法(k-Nearest Neighbours)

套件DMwR中的knnImputation法，使用k個最鄰近的資料點來做估計遺失值。簡單來說，就是會根據每個遺失資料點，去計算歐幾里得距離找出最鄰近的k個資料點，並計算這k個資料加權平均值(使用距離加權)來填補。優點是，你可以呼叫一次此函數，即可一次填補完所有特徵變數中的遺失值。你也不需要指定要填補的變數，因為該函數式使用所有資料集當作參數。但必須注意，不能將目標變數納入預測參數中，因為在測試/正式預測環境中，如果資料有遺失值，你是無法使用未知的目標變數來當作預測變數的。

library(DMwR)
# 將目標變數移除，進行knn預測填補
knnOutput <- knnImputation(data = BostonHousing[,!names(BostonHousing)%in% 'medv'])
anyNA(knnOutput)

# > anyNA(knnOutput)
#   [1] FALSE

library(DMwR)

# 將目標變數移除，進行knn預測填補

knnOutput

anyNA(knnOutput)

# > anyNA(knnOutput)

# [1] FALSE

計算使用k-NN法的精準度。

actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- knnOutput[is.na(BostonHousing$ptratio), "ptratio"]
regr.eval(actuals, predicteds)

#        mae        mse       rmse       mape 
# 1.00188715 1.97910183 1.40680554 0.05859526

actuals

predicteds

regr.eval(actuals, predicteds)

# mae mse rmse mape

# 1.00188715 1.97910183 1.40680554 0.05859526

其中MAPE(平均絕對誤差百分比)為0.059(<0.0955)。誤差值較平均值填補法改善(降低)了38%。

4-2. 決策樹(rpart)

kNN法的限制就是，如果遺失值是factor(類別變數)，則不能使用。而函數rpat()和mice()則能處理這個情況。函數rpart()的優勢就是，預測變數中，只要有一個變數沒有遺失值即可。(更多有關決策樹遺失值填補: Tree Surrogate in CART)

我們來使用rpart()處理遺失直預測。

如果要預測的目標變數為類別變數，則將參數設定為method=class。
如果要預測的目標變數為數值變數，則間參數設定為method=anova。
並同樣注意，不要將目標變數投入為預測變數了。

library(rpart)

# 預測類別變數model
class_mod <- rpart(formula = rad ~ . -medv, data = BostonHousing[!is.na(BostonHousing$rad),], method = "class", na.action = na.omit)
# 預測數值變數model
anova_mod <- rpart(formula = ptratio ~ . -medv, data = BostonHousing[!is.na(BostonHousing$ptratio),], method = "anova", na.action = na.omit)

rad_predict <- predict(object = class_mod,newdata = BostonHousing[is.na(BostonHousing$rad),])
ptratio_predict <- predict(object = anova_mod, newdata = BostonHousing[is.na(BostonHousing$ptratio),])

library(rpart)

# 預測類別變數model

class_mod

# 預測數值變數model

anova_mod

rad_predict

ptratio_predict

計算使用rpart決策樹法填補ptratio的精準度：

actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- ptratio_predict
regr.eval(trues = actuals, preds = predicteds)

#        mae        mse       rmse       mape 
# 0.71061673 0.99693845 0.99846805 0.04099908

actuals

predicteds

regr.eval(trues = actuals, preds = predicteds)

# mae mse rmse mape

# 0.71061673 0.99693845 0.99846805 0.04099908

比較起kNN法，MAPE(平均絕對誤差百分比)改善(降低)了30%(from 0.059 to 0.041)。

計算使用rpart決策樹法填補rad的精準度：

actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- as.numeric(colnames(rad_predict)[apply(rad_predict, 1, which.max)])
mean(actuals != predicteds)
# > mean(actuals != predicteds)
#   [1] 0.25

actuals

predicteds

mean(actuals != predicteds)

# > mean(actuals != predicteds)

# [1] 0.25

4-3. MICE (Multivariate Imputation by Chained Equations)

MICE是 Multivariate Imputation by Chained Equations 的縮寫，是一套R進階處理遺失值得套件。
他的做法稍微特殊一點，將遺失值處理分成兩階段來進行，先是使用mice()來建立預測模型，在使用complete()來產生填補完整後的資料集。
mice(df)會產生多組不同資料填補後的資料集結果，而complete()則會使用其中一組填補結果來完整資料集，預設為第一組。

MICE代表： Multivariate Imputations by Chained Equations (MICE)，具有以下幾個特點：

主要可用來產生多元變數(multivariate)的多組空值填補值(multiple imputations)（可透過mice()函式中的參數m，number of mutiple imputations，來指定要產生幾組填補值解，預設為5組）。
透過Gibbs sampling法來對多元變數產稱多組空值填補值。
MICE填補方法是採用Fully Conditional Specification，即每一個不完整的變數都是分別用獨立的模型來預測空值的。
MICE可處理連續型變數(continuous)、二元變數(binary)、無順序類別型變數(unordered categorical)、有序類別變數(ordered categorical)。
MICE在預測目標欄位時(target column)，預設會使用其他非目標欄位來作為預測變數(predictors)。如遇預測變數本身也不完整的情況，最新產生的一組填補值會被用來完整預測變數，再進行目標變數的填補預測。
我們可以替不同欄位指定各自的單變數填補模型(univariate imputation model)，使用方法範例：mice(data, meth=c(‘sample’,’pmm’,’logreg’,’norm’)))。
單變數填補模型可以使用內建的方法或是自訂方法(mice.impute.myfunc)(mice(method = c(“xxx”, “myfunc”))。
常見的幾種內建填補模型包括：
- pmm (Predictive mean matching) : 任何變數型態
- cart (Classification and regression trees) : 任何變數型態
- rf (Random forest imputations) : 任何變數型態
- logreg (Logistic regression) : 二元類別型態 (binary)

MICE package套件中的主要函式說明：

md.pattern() : inspect the missing data pattern
mice(): impute the missing values *m* times
with(): analyzed completed data sets => Performs a computation of each of imputed datasets in data. with (data, expr = formula, …)。將產生的m組填補值套用到計算公式。
pool(): combine parameter estimates => Pool the results of the repeated analyses。綜合m組填補數據所產生的模型的估計係數(estimates)、標準差(std.error)、和p-value。
pool.compare() : pool.compare(fit1, fit0, method = c(“wald”, “likelihood”), data = NULL)。使用方法”wald”或”likelihood”來比較兩種模型公式套用在所有填補數據的綜合結果。
complete(): export imputed data 儲存和輸出其中一組完整填補數據
ampute(): generate missing data => Generate simulated incomplete data 產生模擬的不完整數據

我們來看看如何使用mice()法來預測ptratio和rad。

library(mice)
miceMod <- mice(data = BostonHousing[,!names(BostonHousing) %in% "medv"],
                method = "rf", #可以指定單一或多個補值法for不同的欄位，If specified as a single string, the same method will be used for all blocks
                m = 5, #可以指定產生幾組預測果，預設為五組
                maxit = 5 #可以指定迭代次數，預設為五次
                ) # perform mice imputation, based on random forests.
miceOutput <- complete(miceMod,action = 1) #Complete a data frame with missing combinations of data. 預設action為第一組資料。

anyNA(miceOutput)
# [1] FALSE

library(mice)

miceMod

method = "rf", #可以指定單一或多個補值法for不同的欄位，If specified as a single string, the same method will be used for all blocks

m = 5, #可以指定產生幾組預測果，預設為五組

maxit = 5 #可以指定迭代次數，預設為五次

) # perform mice imputation, based on random forests.

miceOutput

anyNA(miceOutput)

# [1] FALSE

計算使用MICE法填補ptratio的精準度：

actuals <- original$ptratio[is.na(BostonHousing$ptratio)]
predicteds <- miceOutput[is.na(BostonHousing$ptratio),"ptratio"]
regr.eval(actuals,predicteds)

#        mae        mse       rmse       mape 
# 0.35000000 0.77700000 0.88147603 0.01965896

actuals

predicteds

regr.eval(actuals,predicteds)

# mae mse rmse mape

# 0.35000000 0.77700000 0.88147603 0.01965896

使用MICE法填補ptraio的MAPE(平均絕對誤差百分比)較rpart決策樹法改進（降低）了48% (from 0.041 to 0.020)。

計算使用MICE法填補rad的精準度：

actuals <- original$rad[is.na(BostonHousing$rad)]
predicteds <- miceOutput[is.na(BostonHousing$rad),"rad"]
mean(actuals != predicteds) 
# [1] 0.225

actuals

predicteds

mean(actuals != predicteds)

# [1] 0.225

使用MICE法填補rad的誤差較rpart決策樹法改進（降低）了10% （from 0.25 to 0.225)。

更多遺失值處理參考文章：