Logistic Regression 羅吉斯迴歸 | part1 - 資料探勘與處理

Logistic Regression, 羅吉斯回歸模型，適用於預測二元類別目標變數的發生機率(p)，和線性回歸模型類似，與線性回歸主要不同之處在於：(1) 目標變數是目標事件發生機率P經過log函數轉換成log odds值才進行線性預測，且(2)羅吉斯回歸的各項參數是透過最大概似法(MLE)進行估計的。

Logistic Regression 羅吉斯回歸模型簡介

線性迴歸模型是用來預測連續型變數，而羅吉斯回歸則是用來預測類別型變數。
羅吉斯回歸和線性回歸模型類似，只不過預測類別目標變數是經過log函數轉換才投入線性模型中的。羅吉斯回歸方程式將類別目標變數轉換為事件的log odds值，也就是$\log\lgroup\frac{P_{i}}{1-P_{i}}\rgroup$，來預測Z與預測變數間(X1~Xn)的線性關係。

羅吉斯回歸方程式：

$$Z_{i}=\log\lgroup\frac{P_{i}}{1-P_{i}}\rgroup=\beta_{0}+\beta_{1}*x_{1}+…+\beta_{n}*x_{n}$$

其中：

(1) $P_{i}$為事件發生的機率值。

(2) $\lgroup\frac{P_{i}}{1-P_{i}}\rgroup$為勝算比(Odds Ratio)。

在R語言中，上述模型可透過glm()函式，將參數設定為family=”binomial”來執行。

但因為我們真正關心的是模型預測的事件發生機率值Pi。所以，預測結果則透過plogis()函數將事件的log odds轉換為Pi。轉換公式如下：

$$P_{i}=1-\lgroup\frac{1}{1+e_{i}^z}\rgroup$$

羅吉斯回歸透過log函數轉換，產生了一個臨界遞增的S型函數，適用於分析機率模型。
而不同於線性迴歸，羅吉斯回歸分析的各項參數係數，是透過最大概似法(MLE)進行估計。

分析資料與問題

Problem: 預測薪資大於50K(ABOVE50K)的影響特徵因素有哪些？

Data: adult dataset (snapshot)

目標變數(1)：ABOVE50K

預測變數(13)：AGE, WORKCLASS, EDUCATION, EDUCATIONNUM, MARITALSTATUS, OCCUPATION, RELATIONSHIP, RACE, SEX, CAPITALGAIN, CAPITALLOSS, HOURSPERWEEK, NATIVECOUNTRY

分析步驟

(part 1 篇會說明步驟1~5的部分)

資料載入與檢視
資料探勘
資料前處理
產生訓練資料集與測試資料集
計算特徵變數IV值(Information Value)，篩選變數
訓練模型與預測
模型診斷與調整
模型比較(v.s. Machine Learning Methods)

1. 資料載入與檢視

# download from google drive's shareable link:
# https://drive.google.com/file/d/1L71tumU33xmVa2EtxjTyrZ2F6CaXn9iR/view?usp=sharing
id <- "1L71tumU33xmVa2EtxjTyrZ2F6CaXn9iR" # google file ID
inputData <- read.csv(sprintf("https://docs.google.com/uc?id=%s&export=download", id))
# 摘要資料集
summary(inputData)

# AGE                    WORKCLASS         FNLWGT                EDUCATION    
# Min.   :17.00    Private         :22696   Min.   :  12285    HS-grad     :10501  
# 1st Qu.:28.00    Self-emp-not-inc: 2541   1st Qu.: 117827    Some-college: 7291  
# Median :37.00    Local-gov       : 2093   Median : 178356    Bachelors   : 5355  
# Mean   :38.58    ?               : 1836   Mean   : 189778    Masters     : 1723  
# 3rd Qu.:48.00    State-gov       : 1298   3rd Qu.: 237051    Assoc-voc   : 1382  
# Max.   :90.00    Self-emp-inc    : 1116   Max.   :1484705    11th        : 1175  
#                 (Other)          :  981                     (Other)      : 5134  
# EDUCATIONNUM                  MARITALSTATUS              OCCUPATION  
# Min.   : 1.00    Divorced             : 4443    Prof-specialty :4140  
# 1st Qu.: 9.00    Married-AF-spouse    :   23    Craft-repair   :4099  
# Median :10.00    Married-civ-spouse   :14976    Exec-managerial:4066  
# Mean   :10.08    Married-spouse-absent:  418    Adm-clerical   :3770  
# 3rd Qu.:12.00    Never-married        :10683    Sales          :3650  
# Max.   :16.00    Separated            : 1025    Other-service  :3295  
#                  Widowed              :  993   (Other)         :9541  
# RELATIONSHIP                    RACE            SEX       
# Husband       :13193    Amer-Indian-Eskimo:  311    Female:10771  
# Not-in-family : 8305    Asian-Pac-Islander: 1039    Male  :21790  
# Other-relative:  981    Black             : 3124                  
# Own-child     : 5068    Other             :  271                  
# Unmarried     : 3446    White             :27816                  
# Wife          : 1568                                              
# 
# CAPITALGAIN     CAPITALLOSS      HOURSPERWEEK          NATIVECOUNTRY  
# Min.   :    0   Min.   :   0.0   Min.   : 1.00    United-States:29170  
# 1st Qu.:    0   1st Qu.:   0.0   1st Qu.:40.00    Mexico       :  643  
# Median :    0   Median :   0.0   Median :40.00    ?            :  583  
# Mean   : 1078   Mean   :  87.3   Mean   :40.44    Philippines  :  198  
# 3rd Qu.:    0   3rd Qu.:   0.0   3rd Qu.:45.00    Germany      :  137  
# Max.   :99999   Max.   :4356.0   Max.   :99.00    Canada       :  121  
#                                                  (Other)       : 1709  
# ABOVE50K     
# Min.   :0.0000  
# 1st Qu.:0.0000  
# Median :0.0000  
# Mean   :0.2408  
# 3rd Qu.:0.0000  
# Max.   :1.0000

# download from google drive's shareable link:

# https://drive.google.com/file/d/1L71tumU33xmVa2EtxjTyrZ2F6CaXn9iR/view?usp=sharing

inputData

# 摘要資料集

summary(inputData)

# AGE WORKCLASS FNLWGT EDUCATION

# Min. :17.00 Private :22696 Min. : 12285 HS-grad :10501

# 1st Qu.:28.00 Self-emp-not-inc: 2541 1st Qu.: 117827 Some-college: 7291

# Median :37.00 Local-gov : 2093 Median : 178356 Bachelors : 5355

# Mean :38.58 ? : 1836 Mean : 189778 Masters : 1723

# 3rd Qu.:48.00 State-gov : 1298 3rd Qu.: 237051 Assoc-voc : 1382

# Max. :90.00 Self-emp-inc : 1116 Max. :1484705 11th : 1175

# (Other) : 981 (Other) : 5134

# EDUCATIONNUM MARITALSTATUS OCCUPATION

# Min. : 1.00 Divorced : 4443 Prof-specialty :4140

# 1st Qu.: 9.00 Married-AF-spouse : 23 Craft-repair :4099

# Median :10.00 Married-civ-spouse :14976 Exec-managerial:4066

# Mean :10.08 Married-spouse-absent: 418 Adm-clerical :3770

# 3rd Qu.:12.00 Never-married :10683 Sales :3650

# Max. :16.00 Separated : 1025 Other-service :3295

# Widowed : 993 (Other) :9541

# RELATIONSHIP RACE SEX

# Husband :13193 Amer-Indian-Eskimo: 311 Female:10771

# Not-in-family : 8305 Asian-Pac-Islander: 1039 Male :21790

# Other-relative: 981 Black : 3124

# Own-child : 5068 Other : 271

# Unmarried : 3446 White :27816

# Wife : 1568

# CAPITALGAIN CAPITALLOSS HOURSPERWEEK NATIVECOUNTRY

# Min. : 0 Min. : 0.0 Min. : 1.00 United-States:29170

# 1st Qu.: 0 1st Qu.: 0.0 1st Qu.:40.00 Mexico : 643

# Median : 0 Median : 0.0 Median :40.00 ? : 583

# Mean : 1078 Mean : 87.3 Mean :40.44 Philippines : 198

# 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.:45.00 Germany : 137

# Max. :99999 Max. :4356.0 Max. :99.00 Canada : 121

# (Other) : 1709

# ABOVE50K

# Min. :0.0000

# 1st Qu.:0.0000

# Median :0.0000

# Mean :0.2408

# 3rd Qu.:0.0000

# Max. :1.0000

由於我們的目標變數目前是數值0,1，為了後續分析，我們決定(1)ABOVE50K轉變成(<=50K, >50K)的factor變數，並(2)另外新增ABOVE50K_y(0,1)的factor變數。

# 將ABOVE50K原始0,1轉換成<=50K, > 50K的factor資料類型
inputData$ABOVE50K <- factor(x = inputData$ABOVE50K, levels = c(0,1), labels = c('<=50K', '>50K'))
# 產生新factor變數ABOVE50K_y：
library(dplyr)
inputData$ABOVE50K_y <- as.factor(case_when(inputData$ABOVE50K == "<=50K" ~ 0,
                                             inputData$ABOVE50K == ">50K" ~ 1))
# 確認轉換後的變數沒有問題
summary(inputData$ABOVE50K)
# <=50K  >50K 
# 24720  7841 
summary(inputData$ABOVE50K_y)
# 0     1 
# 24720  7841

# 將ABOVE50K原始0,1轉換成 50K的factor資料類型

inputData$ABOVE50K 50K'))

# 產生新factor變數ABOVE50K_y：

library(dplyr)

inputData$ABOVE50K_y

inputData$ABOVE50K == ">50K" ~ 1))

# 確認轉換後的變數沒有問題

summary(inputData$ABOVE50K)

# 50K

# 24720 7841

summary(inputData$ABOVE50K_y)

# 0 1

# 24720 7841

最後再確認一次資料集所有變數型態。

# 查看資料結構:資料變數、型態
str(inputData)

# 'data.frame':	32561 obs. of  16 variables:
# $ AGE          : int  39 50 38 53 28 37 49 52 31 42 ...
# $ WORKCLASS    : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...
# $ FNLWGT       : int  77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...
# $ EDUCATION    : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...
# $ EDUCATIONNUM : int  13 13 9 7 13 14 5 9 14 13 ...
# $ MARITALSTATUS: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...
# $ OCCUPATION   : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...
# $ RELATIONSHIP : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...
# $ RACE         : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...
# $ SEX          : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...
# $ CAPITALGAIN  : int  2174 0 0 0 0 0 0 0 14084 5178 ...
# $ CAPITALLOSS  : int  0 0 0 0 0 0 0 0 0 0 ...
# $ HOURSPERWEEK : int  40 13 40 40 40 40 16 45 50 40 ...
# $ NATIVECOUNTRY: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...
# $ ABOVE50K     : Factor w/ 2 levels "<=50K",">50K": 1 1 1 1 1 1 1 2 2 2 ...
# $ ABOVE50K_y   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...

# 查看資料結構:資料變數、型態

str(inputData)

# 'data.frame': 32561 obs. of 16 variables:

# $ AGE : int 39 50 38 53 28 37 49 52 31 42 ...

# $ WORKCLASS : Factor w/ 9 levels " ?"," Federal-gov",..: 8 7 5 5 5 5 5 7 5 5 ...

# $ FNLWGT : int 77516 83311 215646 234721 338409 284582 160187 209642 45781 159449 ...

# $ EDUCATION : Factor w/ 16 levels " 10th"," 11th",..: 10 10 12 2 10 13 7 12 13 10 ...

# $ EDUCATIONNUM : int 13 13 9 7 13 14 5 9 14 13 ...

# $ MARITALSTATUS: Factor w/ 7 levels " Divorced"," Married-AF-spouse",..: 5 3 1 3 3 3 4 3 5 3 ...

# $ OCCUPATION : Factor w/ 15 levels " ?"," Adm-clerical",..: 2 5 7 7 11 5 9 5 11 5 ...

# $ RELATIONSHIP : Factor w/ 6 levels " Husband"," Not-in-family",..: 2 1 2 1 6 6 2 1 2 1 ...

# $ RACE : Factor w/ 5 levels " Amer-Indian-Eskimo",..: 5 5 5 3 3 5 3 5 5 5 ...

# $ SEX : Factor w/ 2 levels " Female"," Male": 2 2 2 2 1 1 1 2 1 2 ...

# $ CAPITALGAIN : int 2174 0 0 0 0 0 0 0 14084 5178 ...

# $ CAPITALLOSS : int 0 0 0 0 0 0 0 0 0 0 ...

# $ HOURSPERWEEK : int 40 13 40 40 40 40 16 45 50 40 ...

# $ NATIVECOUNTRY: Factor w/ 42 levels " ?"," Cambodia",..: 40 40 40 40 6 40 24 40 40 40 ...

# $ ABOVE50K : Factor w/ 2 levels "50K": 1 1 1 1 1 1 1 2 2 2 ...

# $ ABOVE50K_y : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 2 2 ...

幾個光看變數名稱不足以了解的變數說明如下：
(1) fnlwgt : 代表final weight，即該代表個體在母體所佔比例
(2) educatoin_num：代表受教育的時間(年份)
為了簡化分析，我們將忽略fnlwgt每個代表個體在母體中所佔比重。

2. 資料探勘(Exploratory Data Analysis)

# 載入繪圖所需套件
library(ggplot2)
library(scales) # needed for labels=percent

# 載入繪圖所需套件

library(ggplot2)

library(scales) # needed for labels=percent

2-1. Age

查看年齡變數(AGE)各區間次數分佈與每區間薪資水準組成比例。

# 2-1. age
summary(inputData$AGE)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 17.00   28.00   37.00   38.58   48.00   90.00 

gg <- 
  ggplot(data = inputData, aes(x = as.numeric(AGE))) +
  geom_histogram(color = "black",binwidth = 1, aes(group=ABOVE50K, fill = ABOVE50K)) +
  geom_vline(xintercept = mean(inputData$AGE[inputData$ABOVE50K == 1]), col = "blue", lty = 5, lwd = 0.8) +
  geom_vline(xintercept = mean(inputData$AGE[inputData$ABOVE50K == 0]), col = "black", lty = 5, lwd = 0.8)
gg

# 2-1. age

summary(inputData$AGE)

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 17.00 28.00 37.00 38.58 48.00 90.00

ggplot(data = inputData, aes(x = as.numeric(AGE))) +

geom_histogram(color = "black",binwidth = 1, aes(group=ABOVE50K, fill = ABOVE50K)) +

geom_vline(xintercept = mean(inputData$AGE[inputData$ABOVE50K == 1]), col = "blue", lty = 5, lwd = 0.8) +

geom_vline(xintercept = mean(inputData$AGE[inputData$ABOVE50K == 0]), col = "black", lty = 5, lwd = 0.8)

可以發現薪資水平高於50K的年齡偏高，平均數約落在40-50歲，而薪資水平小於等於50K的族群偏年輕。

2-2. Workclass

summary(inputData$WORKCLASS)
# ?       Federal-gov         Local-gov      Never-worked 
# 1836               960              2093                 7 
# Private      Self-emp-inc  Self-emp-not-inc         State-gov 
# 22696              1116              2541              1298 
# Without-pay 
# 14

summary(inputData$WORKCLASS)

# ? Federal-gov Local-gov Never-worked

# 1836 960 2093 7

# Private Self-emp-inc Self-emp-not-inc State-gov

# 22696 1116 2541 1298

# Without-pay

# 14

查看workclass各類別次數分佈。

inputData$WORKCLASS <- factor(x = inputData$WORKCLASS,levels = names(sort(table(inputData$WORKCLASS),decreasing = TRUE)))
ggplot(data = inputData, aes(x = WORKCLASS, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired")

inputData$WORKCLASS

ggplot(data = inputData, aes(x = WORKCLASS, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired")

查看workclass各類別薪資水準組成比例。

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(WORKCLASS = inputData$WORKCLASS,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(WORKCLASS,ABOVE50K) %>% 
  summarise(n = n()) %>% 
  mutate(ratio=scales::percent(n/sum(n)))

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = WORKCLASS, x = as.numeric(ABOVE50K),fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired")

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(WORKCLASS,ABOVE50K) %>%

summarise(n = n()) %>%

mutate(ratio=scales::percent(n/sum(n)))

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = WORKCLASS, x = as.numeric(ABOVE50K),fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired")

2-3. Education

基礎敘述統計

summary(inputData$EDUCATION)
# 10th          11th          12th       1st-4th       5th-6th       7th-8th 
# 933          1175           433           168           333           646 
# 9th    Assoc-acdm     Assoc-voc     Bachelors     Doctorate       HS-grad 
# 514          1067          1382          5355           413         10501 
# Masters     Preschool   Prof-school  Some-college 
# 1723            51           576          7291

summary(inputData$EDUCATION)

# 10th 11th 12th 1st-4th 5th-6th 7th-8th

# 933 1175 433 168 333 646

# 9th Assoc-acdm Assoc-voc Bachelors Doctorate HS-grad

# 514 1067 1382 5355 413 10501

# Masters Preschool Prof-school Some-college

# 1723 51 576 7291

次數分配圖

inputData$EDUCATION <- factor(x = inputData$EDUCATION,levels = names(sort(table(inputData$EDUCATION),decreasing = TRUE)))
ggplot(data = inputData, aes(x = EDUCATION, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

inputData$EDUCATION

ggplot(data = inputData, aes(x = EDUCATION, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

各類別水準值中目標變數分佈圖

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(EDUCATION = inputData$EDUCATION,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(EDUCATION) %>% 
  count(ABOVE50K) %>% 
  mutate(ratio=scales::percent(n/sum(n))) %>% 
  ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = EDUCATION, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(EDUCATION) %>%

count(ABOVE50K) %>%

mutate(ratio=scales::percent(n/sum(n))) %>%

ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = EDUCATION, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

2-4. Educationnum (教育年份)

基礎敘述統計

summary(inputData$AGE)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 17.00   28.00   37.00   38.58   48.00   90.00

summary(inputData$AGE)

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 17.00 28.00 37.00 38.58 48.00 90.00

次數分配圖

inputData$AGE_cut <- inputData$AGE 
inputData$AGE_cut <- cut(inputData$AGE_cut, breaks = seq(15,90,5),include.lowest = TRUE)
ggplot(data = inputData, aes(x = AGE_cut, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired") +
  labs(x = "AGE")

inputData$AGE_cut

ggplot(data = inputData, aes(x = AGE_cut, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired") +

labs(x = "AGE")

各類別水準值中目標變數分佈圖

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(AGE_cut = inputData$AGE_cut,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(AGE_cut) %>% 
  count(ABOVE50K) %>% 
  mutate(ratio=scales::percent(n/sum(n))) %>% 
  ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = AGE_cut, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired") +
  labs(x = "AGE")

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(AGE_cut) %>%

count(ABOVE50K) %>%

mutate(ratio=scales::percent(n/sum(n))) %>%

ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = AGE_cut, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired") +

labs(x = "AGE")

2-5. MaritalStatus

基礎敘述統計

summary(inputData$MARITALSTATUS)
# Divorced      Married-AF-spouse     Married-civ-spouse 
# 4443                     23                  14976 
# Married-spouse-absent          Never-married              Separated 
# 418                  10683                   1025 
# Widowed 
# 993

summary(inputData$MARITALSTATUS)

# Divorced Married-AF-spouse Married-civ-spouse

# 4443 23 14976

# Married-spouse-absent Never-married Separated

# 418 10683 1025

# Widowed

# 993

次數分配圖

inputData$MARITALSTATUS <- factor(x = inputData$MARITALSTATUS,levels = names(sort(table(inputData$MARITALSTATUS),decreasing = TRUE)))
ggplot(data = inputData, aes(x = MARITALSTATUS, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

inputData$MARITALSTATUS

ggplot(data = inputData, aes(x = MARITALSTATUS, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

各類別水準值中目標變數分佈圖

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(MARITALSTATUS = inputData$MARITALSTATUS,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(MARITALSTATUS) %>% 
  count(ABOVE50K) %>% 
  mutate(ratio=scales::percent(n/sum(n))) %>% 
  ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = MARITALSTATUS, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired") +
  labs(x = "MARITALSTATUS", y = "percentage") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(MARITALSTATUS) %>%

count(ABOVE50K) %>%

mutate(ratio=scales::percent(n/sum(n))) %>%

ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = MARITALSTATUS, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired") +

labs(x = "MARITALSTATUS", y = "percentage") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

如果想要同一張圖y軸可以看到frequency，stacked bar可以看到目標類別水平的比例分佈，可以使用下面方法繪製。

# 如果想要在同一張圖同時看到frequencies & proportion:
# 先新增一欄label位置的欄位
# (依照ABOVE50K的類別水準順序，計算累積數值，並將數值置中於各類別累積高度)
df3.summary <-
  df2.summary %>%
  group_by(MARITALSTATUS) %>%
  #因為繪圖資訊是先畫>50K，再畫<=50K，故要調整累積數值計算的順序
  do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>% 
  ungroup() %>% 
  group_by(MARITALSTATUS) %>%
  mutate(pos = (cumsum(n) - (0.5 * n))) %>%
  as.data.frame() %>% 
  ungroup()

ggplot(data = df3.summary,mapping = aes(x = MARITALSTATUS, y = n, fill = ABOVE50K)) +
  geom_bar(stat = "identity") + 
  geom_text(mapping = aes(y = pos, label = ratio), size = 3) +
  scale_fill_brewer(palette="Paired") +
  labs(title = "MARITALSTATUS VS ABOVE50K",
       x = "MARITALSTATUS", y = "Frequencies") + # Add Title and Labels
  theme(plot.title = element_text(hjust = 0.5)) # Change the appearance of the main title

# 如果想要在同一張圖同時看到frequencies & proportion:

# 先新增一欄label位置的欄位

# (依照ABOVE50K的類別水準順序，計算累積數值，並將數值置中於各類別累積高度)

df3.summary

df2.summary %>%

group_by(MARITALSTATUS) %>%

#因為繪圖資訊是先畫>50K，再畫

do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>%

ungroup() %>%

group_by(MARITALSTATUS) %>%

mutate(pos = (cumsum(n) - (0.5 * n))) %>%

as.data.frame() %>%

ungroup()

ggplot(data = df3.summary,mapping = aes(x = MARITALSTATUS, y = n, fill = ABOVE50K)) +

geom_bar(stat = "identity") +

geom_text(mapping = aes(y = pos, label = ratio), size = 3) +

scale_fill_brewer(palette="Paired") +

labs(title = "MARITALSTATUS VS ABOVE50K",

x = "MARITALSTATUS", y = "Frequencies") + # Add Title and Labels

theme(plot.title = element_text(hjust = 0.5)) # Change the appearance of the main title

2-6. Occupation職業

基礎敘述統計

summary(inputData$OCCUPATION)
# ?       Adm-clerical       Armed-Forces       Craft-repair 
# 1843               3770                  9               4099 
# Exec-managerial    Farming-fishing  Handlers-cleaners  Machine-op-inspct 
# 4066                994               1370               2002 
# Other-service    Priv-house-serv     Prof-specialty    Protective-serv 
# 3295                149               4140                649 
# Sales       Tech-support   Transport-moving 
# 3650                928               1597

summary(inputData$OCCUPATION)

# ? Adm-clerical Armed-Forces Craft-repair

# 1843 3770 9 4099

# Exec-managerial Farming-fishing Handlers-cleaners Machine-op-inspct

# 4066 994 1370 2002

# Other-service Priv-house-serv Prof-specialty Protective-serv

# 3295 149 4140 649

# Sales Tech-support Transport-moving

# 3650 928 1597

可以發現類別水準值滿複雜的。

次數分配圖

inputData$OCCUPATION <- factor(x = inputData$OCCUPATION,levels = names(sort(table(inputData$OCCUPATION),decreasing = TRUE)))
ggplot(data = inputData, aes(x = OCCUPATION, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

inputData$OCCUPATION

ggplot(data = inputData, aes(x = OCCUPATION, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

各類別水準值中目標變數分佈圖

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(OCCUPATION = inputData$OCCUPATION,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(OCCUPATION) %>% 
  count(ABOVE50K) %>% 
  mutate(ratio=scales::percent(n/sum(n))) %>% 
  as.data.frame() %>% 
  ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = OCCUPATION, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired") +
  labs(x = "OCCUPATION", y = "percentage") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(OCCUPATION) %>%

count(ABOVE50K) %>%

mutate(ratio=scales::percent(n/sum(n))) %>%

as.data.frame() %>%

ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = OCCUPATION, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired") +

labs(x = "OCCUPATION", y = "percentage") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

如果想要同一張圖y軸可以看到frequency，stacked bar可以看到目標類別水平的比例分佈，可以使用下面方法繪製。

df3.summary <-
  df2.summary %>%
  group_by(OCCUPATION) %>%
  #因為繪圖資訊是先畫>50K，再畫<=50K，故要調整累積數值計算的順序
  do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>% 
  ungroup() %>% 
  group_by(OCCUPATION) %>%
  mutate(pos = (cumsum(n) - (0.5 * n))) %>%
  as.data.frame() %>% 
  ungroup()

ggplot(data = df3.summary,mapping = aes(x = OCCUPATION, y = n, fill = ABOVE50K)) +
  geom_bar(stat = "identity") + 
  geom_text(mapping = aes(y = pos, label = ratio), size = 3) +
  scale_fill_brewer(palette="Paired") +
  labs(title = "OCCUPATION VS ABOVE50K",
       x = "OCCUPATION", y = "Frequencies") + # Add Title and Labels
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 30, hjust = 1))

df3.summary

df2.summary %>%

group_by(OCCUPATION) %>%

#因為繪圖資訊是先畫>50K，再畫

do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>%

ungroup() %>%

group_by(OCCUPATION) %>%

mutate(pos = (cumsum(n) - (0.5 * n))) %>%

as.data.frame() %>%

ungroup()

ggplot(data = df3.summary,mapping = aes(x = OCCUPATION, y = n, fill = ABOVE50K)) +

geom_bar(stat = "identity") +

geom_text(mapping = aes(y = pos, label = ratio), size = 3) +

scale_fill_brewer(palette="Paired") +

labs(title = "OCCUPATION VS ABOVE50K",

x = "OCCUPATION", y = "Frequencies") + # Add Title and Labels

theme(plot.title = element_text(hjust = 0.5),

axis.text.x = element_text(angle = 30, hjust = 1))

2-7. Relationship

基礎敘述統計

summary(inputData$RELATIONSHIP)
# Husband   Not-in-family  Other-relative       Own-child       Unmarried 
# 13193            8305             981            5068            3446 
# Wife 
# 1568

summary(inputData$RELATIONSHIP)

# Husband Not-in-family Other-relative Own-child Unmarried

# 13193 8305 981 5068 3446

# Wife

# 1568

次數分配圖

inputData$RELATIONSHIP <- factor(x = inputData$RELATIONSHIP,levels = names(sort(table(inputData$RELATIONSHIP),decreasing = TRUE)))
ggplot(data = inputData, aes(x = RELATIONSHIP, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired")

inputData$RELATIONSHIP

ggplot(data = inputData, aes(x = RELATIONSHIP, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired")

各類別水準值中目標變數分佈圖

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(RELATIONSHIP = inputData$RELATIONSHIP,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(RELATIONSHIP) %>% 
  count(ABOVE50K) %>% 
  mutate(ratio=scales::percent(n/sum(n))) %>% 
  as.data.frame() %>% 
  ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = RELATIONSHIP, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired") +
  labs(x = "RELATIONSHIP", y = "percentage") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(RELATIONSHIP) %>%

count(ABOVE50K) %>%

mutate(ratio=scales::percent(n/sum(n))) %>%

as.data.frame() %>%

ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = RELATIONSHIP, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired") +

labs(x = "RELATIONSHIP", y = "percentage") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

繪製同時一張圖呈現frequency與stacked bar目標類別水平的比例分佈圖。

df3.summary <-
  df2.summary %>%
  group_by(RELATIONSHIP) %>%
  #因為繪圖資訊是先畫>50K，再畫<=50K，故要調整累積數值計算的順序
  do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>% 
  ungroup() %>% 
  group_by(RELATIONSHIP) %>%
  mutate(pos = (cumsum(n) - (0.5 * n))) %>%
  as.data.frame() %>% 
  ungroup()

ggplot(data = df3.summary,mapping = aes(x = RELATIONSHIP, y = n, fill = ABOVE50K)) +
  geom_bar(stat = "identity") + 
  geom_text(mapping = aes(y = pos, label = ratio), size = 3) +
  scale_fill_brewer(palette="Paired") +
  labs(title = "RELATIONSHIP VS ABOVE50K",
       x = "RELATIONSHIP", y = "Frequencies") + # Add Title and Labels
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 30, hjust = 1)) # Change the appearance of the main title

df3.summary

df2.summary %>%

group_by(RELATIONSHIP) %>%

#因為繪圖資訊是先畫>50K，再畫

do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>%

ungroup() %>%

group_by(RELATIONSHIP) %>%

mutate(pos = (cumsum(n) - (0.5 * n))) %>%

as.data.frame() %>%

ungroup()

ggplot(data = df3.summary,mapping = aes(x = RELATIONSHIP, y = n, fill = ABOVE50K)) +

geom_bar(stat = "identity") +

geom_text(mapping = aes(y = pos, label = ratio), size = 3) +

scale_fill_brewer(palette="Paired") +

labs(title = "RELATIONSHIP VS ABOVE50K",

x = "RELATIONSHIP", y = "Frequencies") + # Add Title and Labels

theme(plot.title = element_text(hjust = 0.5),

axis.text.x = element_text(angle = 30, hjust = 1)) # Change the appearance of the main title

2-8. Race種族

基礎敘述統計

summary(inputData$RACE)
# Amer-Indian-Eskimo  Asian-Pac-Islander               Black               Other               White 
# 311                1039                3124                 271               27816

summary(inputData$RACE)

# Amer-Indian-Eskimo Asian-Pac-Islander Black Other White

# 311 1039 3124 271 27816

次數分配圖

inputData$RACE <- factor(x = inputData$RACE,levels = names(sort(table(inputData$RACE),decreasing = TRUE)))
ggplot(data = inputData, aes(x = RACE, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

inputData$RACE

ggplot(data = inputData, aes(x = RACE, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

各類別水準值中目標變數分佈圖

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(RACE = inputData$RACE,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(RACE) %>% 
  count(ABOVE50K) %>% 
  mutate(ratio=scales::percent(n/sum(n))) %>% 
  as.data.frame() %>% 
  ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = RACE, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired") +
  labs(x = "RACE", y = "percentage") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(RACE) %>%

count(ABOVE50K) %>%

mutate(ratio=scales::percent(n/sum(n))) %>%

as.data.frame() %>%

ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = RACE, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired") +

labs(x = "RACE", y = "percentage") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

繪製同時一張圖呈現frequency與stacked bar目標類別水平的比例分佈圖。

df3.summary <-
  df2.summary %>%
  group_by(RACE) %>%
  #因為繪圖資訊是先畫>50K，再畫<=50K，故要調整累積數值計算的順序
  do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>% 
  ungroup() %>% 
  group_by(RACE) %>%
  mutate(pos = (cumsum(n) - (0.5 * n))) %>%
  as.data.frame() %>% 
  ungroup()

ggplot(data = df3.summary,mapping = aes(x = RACE, y = n, fill = ABOVE50K)) +
  geom_bar(stat = "identity") + 
  geom_text(mapping = aes(y = pos, label = ratio), size = 3) +
  scale_fill_brewer(palette="Paired") +
  labs(title = "RACE VS ABOVE50K",
       x = "RACE", y = "Frequencies") + # Add Title and Labels
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 30, hjust = 1)) # Change the appearance of the main title

df3.summary

df2.summary %>%

group_by(RACE) %>%

#因為繪圖資訊是先畫>50K，再畫

do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>%

ungroup() %>%

group_by(RACE) %>%

mutate(pos = (cumsum(n) - (0.5 * n))) %>%

as.data.frame() %>%

ungroup()

ggplot(data = df3.summary,mapping = aes(x = RACE, y = n, fill = ABOVE50K)) +

geom_bar(stat = "identity") +

geom_text(mapping = aes(y = pos, label = ratio), size = 3) +

scale_fill_brewer(palette="Paired") +

labs(title = "RACE VS ABOVE50K",

x = "RACE", y = "Frequencies") + # Add Title and Labels

theme(plot.title = element_text(hjust = 0.5),

axis.text.x = element_text(angle = 30, hjust = 1)) # Change the appearance of the main title

2-9. Sex 性別

基礎敘述統計

summary(inputData$SEX)
# Female    Male 
# 10771   21790

summary(inputData$SEX)

# Female Male

# 10771 21790

次數分配圖

inputData$SEX <- factor(x = inputData$SEX,levels = names(sort(table(inputData$SEX),decreasing = TRUE)))
ggplot(data = inputData, aes(x = SEX, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired")

inputData$SEX

ggplot(data = inputData, aes(x = SEX, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired")

各類別水準值中目標變數分佈圖

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例
df2 <- cbind.data.frame(SEX = inputData$SEX,ABOVE50K = inputData$ABOVE50K)
df2.summary <- 
  df2 %>% group_by(SEX) %>% 
  count(ABOVE50K) %>% 
  mutate(ratio=scales::percent(n/sum(n))) %>% 
  as.data.frame() %>% 
  ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例
ggplot(inputData, aes(x = forcats::fct_reorder(f = SEX, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +
  geom_bar(position = "fill") + 
  geom_text(data=df2.summary, aes(y=n,label=ratio),
            position=position_fill(vjust=0.5)) +
  scale_y_continuous(labels = scales::percent) +
  scale_fill_brewer(palette="Paired") +
  labs(x = "SEX", y = "percentage") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

# 首先，先計算類別變數中，不同類別水準值中的薪資水平分佈比例

df2

df2.summary

df2 %>% group_by(SEX) %>%

count(ABOVE50K) %>%

mutate(ratio=scales::percent(n/sum(n))) %>%

as.data.frame() %>%

ungroup()

# 依據ABOVE50K的比例排序，更可以看出哪些類別水準有較高ABOVE50K比例

ggplot(inputData, aes(x = forcats::fct_reorder(f = SEX, x = as.numeric(ABOVE50K), fun = mean, .desc = TRUE), fill = ABOVE50K)) +

geom_bar(position = "fill") +

geom_text(data=df2.summary, aes(y=n,label=ratio),

position=position_fill(vjust=0.5)) +

scale_y_continuous(labels = scales::percent) +

scale_fill_brewer(palette="Paired") +

labs(x = "SEX", y = "percentage") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

繪製同時一張圖呈現frequency與stacked bar目標類別水平的比例分佈圖。

df3.summary <-
  df2.summary %>%
  group_by(SEX) %>%
  #因為繪圖資訊是先畫>50K，再畫<=50K，故要調整累積數值計算的順序
  do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>% 
  ungroup() %>% 
  group_by(SEX) %>%
  mutate(pos = (cumsum(n) - (0.5 * n))) %>%
  as.data.frame() %>% 
  ungroup()

ggplot(data = df3.summary,mapping = aes(x = SEX, y = n, fill = ABOVE50K)) +
  geom_bar(stat = "identity") + 
  geom_text(mapping = aes(y = pos, label = ratio), size = 3) +
  scale_fill_brewer(palette="Paired") +
  labs(title = "SEX VS ABOVE50K",
       x = "SEX", y = "Frequencies") + # Add Title and Labels
  theme(plot.title = element_text(hjust = 0.5),
        axis.text.x = element_text(angle = 30, hjust = 1)) # Change the appearance of the main title

df3.summary

df2.summary %>%

group_by(SEX) %>%

#因為繪圖資訊是先畫>50K，再畫

do( data.frame(with(data=., .[order(desc(ABOVE50K)),] )) ) %>%

ungroup() %>%

group_by(SEX) %>%

mutate(pos = (cumsum(n) - (0.5 * n))) %>%

as.data.frame() %>%

ungroup()

ggplot(data = df3.summary,mapping = aes(x = SEX, y = n, fill = ABOVE50K)) +

geom_bar(stat = "identity") +

geom_text(mapping = aes(y = pos, label = ratio), size = 3) +

scale_fill_brewer(palette="Paired") +

labs(title = "SEX VS ABOVE50K",

x = "SEX", y = "Frequencies") + # Add Title and Labels

theme(plot.title = element_text(hjust = 0.5),

axis.text.x = element_text(angle = 30, hjust = 1)) # Change the appearance of the main title

2-10. CapitalGain 資本獲利

基礎敘述統計

summary(inputData$CAPITALGAIN)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0       0       0    1078       0   99999

summary(inputData$CAPITALGAIN)

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 0 0 0 1078 0 99999

可以發現資料很右偏，至少有75％(第三個四分位數)的資料都是0。

次數分配圖

ggplot(data = inputData,mapping = aes(x = CAPITALGAIN, fill = ABOVE50K, group = ABOVE50K)) +
  geom_histogram(bins = 10,color = 'black',lwd =0.3, binwidth = 10000) +  # default bins = 30
  stat_bin(binwidth = 10000, geom = "text", color = "white", size = 3.5,
           mapping = aes(label = ..count.., group = ABOVE50K),position=position_stack(vjust=0.5)) +
  scale_fill_brewer(palette="Paired") +
  scale_x_continuous(breaks= seq(0,max(inputData$CAPITALGAIN), 10000))

ggplot(data = inputData,mapping = aes(x = CAPITALGAIN, fill = ABOVE50K, group = ABOVE50K)) +

geom_histogram(bins = 10,color = 'black',lwd =0.3, binwidth = 10000) + # default bins = 30

stat_bin(binwidth = 10000, geom = "text", color = "white", size = 3.5,

mapping = aes(label = ..count.., group = ABOVE50K),position=position_stack(vjust=0.5)) +

scale_fill_brewer(palette="Paired") +

scale_x_continuous(breaks= seq(0,max(inputData$CAPITALGAIN), 10000))

我們可以觀察到，視覺化的資料極右偏，幾乎所有資料都是0

看一下capitalgain資料是0的比例：

sum(inputData$CAPITALGAIN == 0)/length(inputData$CAPITALGAIN)
# [1] 0.9167102

1 2	sum(inputData$CAPITALGAIN == 0)/length(inputData$CAPITALGAIN) # [1] 0.9167102

2-11. Capital Loss 資本失利

基礎敘述統計

summary(inputData$CAPITALLOSS)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 0.0     0.0     0.0    87.3     0.0  4356.0

summary(inputData$CAPITALLOSS)

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 0.0 0.0 0.0 87.3 0.0 4356.0

次數分配圖

ggplot(data = inputData,mapping = aes(x = CAPITALLOSS, fill = ABOVE50K, group = ABOVE50K)) +
  geom_histogram(color = 'black',lwd =0.3, binwidth = 1000) +  # default bins = 30
  stat_bin(geom = "text", color = "white", size = 3.5, binwidth = 1000,
           mapping = aes(label = ..count.., group = ABOVE50K),position=position_stack(vjust=0.5)) +
  scale_fill_brewer(palette="Paired")

ggplot(data = inputData,mapping = aes(x = CAPITALLOSS, fill = ABOVE50K, group = ABOVE50K)) +

geom_histogram(color = 'black',lwd =0.3, binwidth = 1000) + # default bins = 30

stat_bin(geom = "text", color = "white", size = 3.5, binwidth = 1000,

mapping = aes(label = ..count.., group = ABOVE50K),position=position_stack(vjust=0.5)) +

scale_fill_brewer(palette="Paired")

看一下CAPITALLOSS資料是0的比例：

sum(inputData$CAPITALLOSS == 0)/length(inputData$CAPITALLOSS)
# [1] 0.9533491

1 2	sum(inputData$CAPITALLOSS == 0)/length(inputData$CAPITALLOSS) # [1] 0.9533491

2-12. HOURSPERWEEK

基礎敘述統計

summary(inputData$HOURSPERWEEK)
# Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
# 1.00   40.00   40.00   40.44   45.00   99.00

summary(inputData$HOURSPERWEEK)

# Min. 1st Qu. Median Mean 3rd Qu. Max.

# 1.00 40.00 40.00 40.44 45.00 99.00

次數分配圖

ggplot(data = inputData,mapping = aes(x = HOURSPERWEEK, fill = ABOVE50K, group = ABOVE50K)) +
  geom_histogram(bins = 10, color = 'black',lwd =0.3) +  # default bins = 30
  stat_bin(geom = "text", color = "white", size = 3.5, bins = 10,
           mapping = aes(label = ..count.., group = ABOVE50K),position=position_stack(vjust=0.5)) +
  scale_fill_brewer(palette="Paired")

ggplot(data = inputData,mapping = aes(x = HOURSPERWEEK, fill = ABOVE50K, group = ABOVE50K)) +

geom_histogram(bins = 10, color = 'black',lwd =0.3) + # default bins = 30

stat_bin(geom = "text", color = "white", size = 3.5, bins = 10,

mapping = aes(label = ..count.., group = ABOVE50K),position=position_stack(vjust=0.5)) +

scale_fill_brewer(palette="Paired")

2-13. NATIVECOUNTRY 國籍

次數分配圖

inputData$NATIVECOUNTRY <- factor(x = inputData$NATIVECOUNTRY,levels = names(sort(table(inputData$NATIVECOUNTRY),decreasing = TRUE)))
ggplot(data = inputData, aes(x = NATIVECOUNTRY, fill = ABOVE50K)) +
  geom_bar() +
  geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +
  scale_fill_brewer(palette="Paired") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

inputData$NATIVECOUNTRY

ggplot(data = inputData, aes(x = NATIVECOUNTRY, fill = ABOVE50K)) +

geom_bar() +

geom_text(stat = "count", aes(label=..count..),size=3.5,position = position_stack(vjust = 0.5)) +

scale_fill_brewer(palette="Paired") +

theme(axis.text.x = element_text(angle = 30, hjust = 1))

可以發現資料超偏，都集中在美國，因此不適合投入模型。

3. 資料前處理

3- 1. 類別水準值簡化

OCCUPATION的部分，因為水平值偏多，且沒有替代變數欄位，故我們考慮將此變數水平值簡化。我們將之簡化為以下幾群：
Blue-Collar, Professional, Sales, Service, and White-Collar, Other/Unknown。

# 先將?改命名成Unknown
levels(inputData$OCCUPATION)[1] <- "Unknown"

# 新增一個replacement水準值後的欄位：
inputData$OCCUPATION_rep <- inputData$OCCUPATION

# (1) Craft-repair, Farming-fishing, Handlers-cleaners , Machine-op-inspct, Transport-moving => 合併成Blue-Collar
inputData$OCCUPATION_rep <- gsub(pattern = '^ Craft-repair', replacement = 'Blue-Collar', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Farming-fishing', replacement = 'Blue-Collar', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Handlers-cleaners', replacement = 'Blue-Collar', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Machine-op-inspct', replacement = 'Blue-Collar', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Transport-moving', replacement = 'Blue-Collar', x = inputData$OCCUPATION_rep)

# (2) Prof-specialty, => Professional
inputData$OCCUPATION_rep <- gsub(pattern = '^ Prof-specialty', replacement = 'Professional', x = inputData$OCCUPATION_rep)

# (3) Sales => Sales不變

# (4) Other-service, Priv-house-serv, Protective-serv, Tech-support, => Service
inputData$OCCUPATION_rep <- gsub(pattern = '^ Other-service', replacement = 'Service', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Priv-house-serv', replacement = 'Service', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Protective-serv', replacement = 'Service', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Tech-support', replacement = 'Service', x = inputData$OCCUPATION_rep)

# (5) Adm-clerical, Exec-managerial => 合併成White-Collar
inputData$OCCUPATION_rep <- gsub(pattern = '^ Adm-clerical', replacement = 'White-Collar', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Exec-managerial', replacement = 'White-Collar', x = inputData$OCCUPATION_rep)
# (6) ?,Armed-Forces => Other/Unknown
inputData$OCCUPATION_rep <- gsub(pattern = '^Unknown', replacement = 'Other/Unknown', x = inputData$OCCUPATION_rep)
inputData$OCCUPATION_rep <- gsub(pattern = '^ Armed-Forces', replacement = 'Other/Unknown', x = inputData$OCCUPATION_rep)

inputData$OCCUPATION_rep <- as.factor(inputData$OCCUPATION_rep)

summary(inputData$OCCUPATION_rep)
# Sales   Blue-Collar Other/Unknown  Professional       Service  White-Collar 
# 3650         10062          1852          4140          5021          7836
levels(inputData$OCCUPATION_rep)
# [1] " Sales"        "Blue-Collar"   "Other/Unknown" "Professional"  "Service"       "White-Collar"

# 先將?改命名成Unknown

levels(inputData$OCCUPATION)[1]

# 新增一個replacement水準值後的欄位：

inputData$OCCUPATION_rep

# (1) Craft-repair, Farming-fishing, Handlers-cleaners , Machine-op-inspct, Transport-moving => 合併成Blue-Collar

inputData$OCCUPATION_rep

# (2) Prof-specialty, => Professional

inputData$OCCUPATION_rep

# (3) Sales => Sales不變

# (4) Other-service, Priv-house-serv, Protective-serv, Tech-support, => Service

inputData$OCCUPATION_rep

# (5) Adm-clerical, Exec-managerial => 合併成White-Collar

inputData$OCCUPATION_rep

# (6) ?,Armed-Forces => Other/Unknown

inputData$OCCUPATION_rep

summary(inputData$OCCUPATION_rep)

# Sales Blue-Collar Other/Unknown Professional Service White-Collar

# 3650 10062 1852 4140 5021 7836

levels(inputData$OCCUPATION_rep)

# [1] " Sales" "Blue-Collar" "Other/Unknown" "Professional" "Service" "White-Collar"

3-2. (optional)產生類別變數的WOE值(本分析不會執行此段程式碼）

# 這是一個選擇性的步驟，我們在本分析不會執行
# factor_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION", "OCCUPATION_rep", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY")

# not run
# for(factor_var in factor_vars){
#   inputData[[factor_var]] <- WOE(X = inputData[,factor_var], Y = inputData$ABOVE50K_y)
# }

# 這是一個選擇性的步驟，我們在本分析不會執行

# factor_vars

# not run

# for(factor_var in factor_vars){

# inputData[[factor_var]]

# }

4. 產生訓練資料集與測試資料集

# 產生訓練資料集
input_ones <- inputData[which(inputData$ABOVE50K_y == 1),]
input_zeros <- inputData[which(inputData$ABOVE50K_y == 0), ]
set.seed(100)
input_ones_training_row <- sample(1:nrow(input_ones),0.7*nrow(input_ones))
input_zeros_training_row <- sample(1:nrow(input_zeros),0.7*nrow(input_zeros))

training_ones <- input_ones[input_ones_training_row,]
training_zeros <- input_zeros[input_zeros_training_row,]
trainingData <- rbind(training_ones, training_zeros)

# 產生測試資料集
test_ones <- input_ones[-input_ones_training_row,]
test_zeros <- input_zeros[-input_zeros_training_row,]
testData <- rbind(test_ones, test_zeros)

# 產生訓練資料集

input_ones

input_zeros

set.seed(100)

input_ones_training_row

input_zeros_training_row

training_ones

training_zeros

trainingData

# 產生測試資料集

test_ones

test_zeros

testData

5. 計算IV值，篩選變數

# 我們會使用套件中的函數smbinning::smbinning，來將連續變數切割為類別變數
library(smbinning)
# 將連續型變數和類別型變數分開
# 記得加入"OCCUPATION_rep"
factor_vars <- c ("WORKCLASS", "EDUCATION", "MARITALSTATUS", "OCCUPATION","OCCUPATION_rep", "RELATIONSHIP", "RACE", "SEX", "NATIVECOUNTRY")
continuous_vars <- c("AGE", "FNLWGT","EDUCATIONNUM", "HOURSPERWEEK", "CAPITALGAIN", "CAPITALLOSS")

# 建立一個變數IV值表格 : numeric(參數要放變數行數)
iv_df <- data.frame(VARS = c(factor_vars, continuous_vars), IV = numeric(15))

# 計算類別變數的IV值
for(factor_var in factor_vars){
  smb <- smbinning.factor(df = trainingData ,y = "ABOVE50K_y",x = factor_var)
  if(class(smb) != "character"){
    iv_df[iv_df$VARS == factor_var,"IV"] <- smb$iv
  }
}
# 計算連續變數的IV值
for(continuous_var in continuous_vars){
  smb <- smbinning(df = trainingData,y = "ABOVE50K_y",x = continuous_var)
  if(class(smb) != "character"){
    iv_df[iv_df$VARS == continuous_var,"IV"] <- smb$iv
  }
}

# 將變數依據IV值高低排列
iv_df <- iv_df[order(-iv_df$IV),]
iv_df

# VARS     IV
# 6    RELATIONSHIP 1.5435
# 3   MARITALSTATUS 1.3195
# 10            AGE 1.1815
# 12   EDUCATIONNUM 0.7169
# 14    CAPITALGAIN 0.6849
# 13   HOURSPERWEEK 0.4592
# 5  OCCUPATION_rep 0.3594
# 8             SEX 0.3127
# 1       WORKCLASS 0.1673
# 7            RACE 0.0634
# 2       EDUCATION 0.0000
# 4      OCCUPATION 0.0000
# 9   NATIVECOUNTRY 0.0000
# 11         FNLWGT 0.0000
# 15    CAPITALLOSS 0.0000

# 我們會使用套件中的函數smbinning::smbinning，來將連續變數切割為類別變數

library(smbinning)

# 將連續型變數和類別型變數分開

# 記得加入"OCCUPATION_rep"

factor_vars

continuous_vars

# 建立一個變數IV值表格 : numeric(參數要放變數行數)

iv_df

# 計算類別變數的IV值

for(factor_var in factor_vars){

smb

if(class(smb) != "character"){

iv_df[iv_df$VARS == factor_var,"IV"]

}

# 計算連續變數的IV值

for(continuous_var in continuous_vars){

smb

if(class(smb) != "character"){

iv_df[iv_df$VARS == continuous_var,"IV"]

}

# 將變數依據IV值高低排列

iv_df

# VARS IV

# 6 RELATIONSHIP 1.5435

# 3 MARITALSTATUS 1.3195

# 10 AGE 1.1815

# 12 EDUCATIONNUM 0.7169

# 14 CAPITALGAIN 0.6849

# 13 HOURSPERWEEK 0.4592

# 5 OCCUPATION_rep 0.3594

# 8 SEX 0.3127

# 1 WORKCLASS 0.1673

# 7 RACE 0.0634

# 2 EDUCATION 0.0000

# 4 OCCUPATION 0.0000

# 9 NATIVECOUNTRY 0.0000

# 11 FNLWGT 0.0000

# 15 CAPITALLOSS 0.0000

關於IV Table的部分，我們可以得出以下資訊：

根據經驗，IV>=0.3即表示該預測變數與目標變數有較強的關係。
類別變數IV=0表示類別水準值過多。
- EDUCATION: 可以使用IV值較高的EDUCATIONNUM代替。
- NATIVECOUNTRY:如我們一開始在資料探勘時所見，絕大多數資料皆為United States，因此不予進一步討論。
連續變數IV=0表示沒有顯著的切點存在。
- FNLWGT：一開始為簡化分析，即不考慮納入模型分析。
- CAPITALLOSS: 因為該特徵資料分佈極偏，故不納入進一步模型分析。
從IV表數值，我們決定初步篩選出變數RELATIONSHIP, MARITALSTATUS, AGE, EDUCATIONNUM, CAPITALGAIN, OCCUPATION_rep進行模型分析。

Logistic Regression Modeling 羅吉斯回歸建模 part2 的部分請參考: 羅吉斯回歸 – part2 模型建置、診斷與比較。

更多模型建置筆記連結：

參考:

歐萊禮 R資料科學