第9講判別分析

分析の評価

Published

December 5, 2025

準備

以下で利用する共通パッケージを読み込む．

library(conflicted) # 関数名の衝突を警告
conflicts_prefer(   # 優先的に使う関数を指定
    dplyr::filter(),
    dplyr::lag(),
    dplyr::select(),
    scales::discard(),
    recipes::fixed(),
    yardstick::spec(),
    recipes::step(),
)
library(tidyverse)  
library(ggfortify)
library(ggrepel)
library(GGally)
library(tidymodels)
library(MASS)
library(gt)         
library(gtsummary)  
#' macOSのための日本語表示の設定
if(Sys.info()["sysname"] == "Darwin") { # macOSか調べる
    jp_font <- "HiraMaruProN-W4"
    theme_update(text = element_text(family = jp_font))
    update_geom_defaults("text", list(family = theme_get()$text$family))
    update_geom_defaults("label", list(family = theme_get()$text$family))
    update_geom_defaults("text_repel", list(family = theme_get()$text$family))
    update_geom_defaults("label_repel", list(family = theme_get()$text$family))
} else {jp_font <- NULL}

データセット

以下では3つのデータセットを使用する．

tokyo_weather.csv の概要

気象庁より取得した東京の気候データを整理したもの

year,month,day,day_of_week : 年，月，日，曜日
temp : 平均気温(℃)
rain : 降水量の合計(mm)
solar : 合計全天日射量(MJ/㎡)
snow : 降雪量合計(cm)
wdirs : 最多風向(16方位)
wind : 平均風速(m/s)
press : 平均現地気圧(hPa)
humid : 平均湿度(％)
cloud : 平均雲量(10分比)

Wine Quality Data Set

UC Irvaine の Machine Learning Repository で公開されているデータ．物理化学的な試験結果からワインの質を予測する練習問題データとして良く知られる．

詳細は https://archive.ics.uci.edu/dataset/186/wine+quality を参照

データの読み込み方の例．元のデータは10段階の評定(quality)だが，これからA-Dの4段階の評価(grade)を構成している．

wq_data <-
    read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
               delim = ";") |>
    mutate(grade = factor(
               case_when( # quality を A,B,C,D に割り当てる
                   quality >= 7 ~ "A",
                   quality >= 6 ~ "B",
                   quality >= 5 ~ "C",
                   .default = "D"
               )))

MASS::biopsy の概要

Biopsy Data on Breast Cancer Patients

This breast cancer database was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. He assessed biopsies of breast tumours for 699 patients up to 15 July 1992; each of nine attributes has been scored on a scale of 1 to 10, and the outcome is also known. There are 699 rows and 11 columns.

詳細は ?MASS::biopsy を参照

データの整理の仕方の例．元のデータは data.frame クラスなので，tibble クラスに変換して整理しておく．

glimpse(MASS::biopsy) # 各変数の属性を確認する．
bio_data <-
    MASS::biopsy |>
    as_tibble() |>    # tibble クラスに変換
    na.omit()         # NA を除く

判別分析に用いる主な関数

分析に用いる主な関数としては以下がある．

MASS::lda(): 線形判別分析の実行
MASS::qda(): 2次判別分析の実行
yardstick::conf_mat(): 混同行列の計算
自作 tidy() : 判別分析の推定結果
自作 augment() : 判別分析のあてはめ値・予測値

自作の関数(再掲)

#' lda クラス用
tidy.lda <- function(x, type = c("statistic","estimate")){
  type <- match.arg(type)
  lda_stat <- \(x){
    bind_cols(x[["prior"]] |> enframe(value = "prior"),
              x[["means"]] |> as_tibble() |> rename_with(\(x)paste0(x,"_ave")))
  }
  lda_est <- \(x){
    x[["scaling"]] |> as.data.frame() |> rownames_to_column() |> as_tibble()
  }
  switch(type,
         statistic = lda_stat(x),
         estimate = lda_est(x))
}
#' qda クラス用
tidy.qda <- function(x, type = c("statistic","estimate")){
  type <- match.arg(type)
  qda_stat <- \(x){
    bind_cols(x[["prior"]] |> enframe(value = "prior"),
              x[["means"]] |> as_tibble() |> rename_with(\(x)paste0(x,"_ave")))
  }
  qda_est <- \(x){
    bind_cols(x[["scaling"]] |> as.data.frame() |> rownames_to_column() |>
              as_tibble() |> rename_with(\(x)paste0("scale_",x)),
              x[["ldet"]] |> as_tibble_col(column_name = "log_det/2"))
  }
  switch(type,
         statistic = qda_stat(x),
         estimate = qda_est(x))
}

#' lda クラス用
augment.lda <- function(x, data = NULL, newdata = NULL) {
  if(!is.null(newdata)) {
    tmp <- predict(x, newdata = newdata)
  } else {
    tmp <- predict(x)
  }
  out <- bind_cols(
    tmp[["class"]] |> as_tibble_col(column_name = ".class"),
    tmp[["posterior"]] |> as_tibble() |> rename_with(\(x)paste0(".posterior",x)),
    tmp[["x"]] |> as_tibble() |> rename_with(\(x)paste0(".x",x))
  )
  if(!is.null(newdata)) {return(bind_cols(newdata, out))}
  if(!is.null(data)) {return(bind_cols(data, out))}
  return(out)
}
#' qda クラス用
augment.qda <- function(x, data = NULL, newdata = NULL) {
  if(!is.null(newdata)) {
    tmp <- predict(x, newdata = newdata)
  } else {
    tmp <- predict(x)
  }
  out <- bind_cols(
    tmp[["class"]] |> as_tibble_col(column_name = ".class"),
    tmp[["posterior"]] |> as_tibble() |> rename_with(\(x)paste0(".posterior",x)),
  )
  if(!is.null(newdata)) {return(bind_cols(newdata, out))}
  if(!is.null(data)) {return(bind_cols(data, out))}
  return(out)
}

典型的な分析の流れは以下のようになる．

#' データフレーム 'toy_data' を分析
formula <- class ~ . # class以外の変数で説明
toy_lda <- lda(formula, data = toy_data)
toy_qda <- qda(formula, data = toy_data)
#' 分析結果の確認
toy_lda   # 関数 print() による分析結果の表示．qda も同様
toy_lda |>
    tidy("statistic") # または "estimate"
#' 混同行列の計算と図示
toy_lda |>
    augment(data = toy_data) |>
    conf_mat(truth = class, estimate = .class) |>
    autoplot(type = "mosaic") # または "heatmap"

MASS::biopsy を用いた例

データフレームを tibble クラスに変換して整理する．

bio_data <-
    MASS::biopsy |>
    as_tibble() |>
    na.omit() # NA を除く

データの散布図を描く．

bio_data |>
    select(!ID) |> # IDを除く
    ggpairs(diag = list(mapping = aes(colour = class)),
            lower = list(mapping = aes(colour = class)))

主成分分析を用いて2次元に縮約した生検と良性・悪性の関係を視覚化する．

bio_data |> select(where(is.numeric)) |> prcomp() |>
    autoplot(data = bio_data,
             colour = "class")

線形判別関数による判別関数値の分布を見る．

bio_lda <- lda(class ~ . -ID, data = bio_data)
bio_lda |>
    augment(data = bio_data) |>
    ggplot(aes(x = .xLD1)) +
    geom_histogram(aes(fill = class), show.legend = FALSE) +
    facet_grid(class ~ .)

判別結果の評価

評価のための関数

評価のための枠組としては

caret : Max Kuhn @Posit による base R のためのパッケージ https://topepo.github.io/caret/
tidymodels : Max Kuhn, Hadley Wickham @Posit による tidyverse 向けに再設計されたパッケージ https://www.tidymodels.org

がある．本講義では tidymodels を中心に説明する．以下で用いる評価のための主な関数を列挙する．

混同行列の計算

conf_mat(data, truth, estimate,
         dnn = c("Prediction", "Truth"), case_weights = NULL, ...)
#' data: 真値と予測値が含まれるデータフレーム
#' truth: 真値(ラベル)の列名
#' estimate: 予測値(ラベル)の列名
#' 詳細は '?yardstick::conf_mat' を参照

集計した評価指標を得るには以下を用いればよい．

summary(object,
        prevalence = NULL, beta = 1, estimator = NULL,
        event_level = yardstick_event_level(), ...)
#' object: conf_mat の出力
#' 様々な評価指標を tibble 形式で出力
#' 詳細は '?yardstick::summary.conf_mat' を参照

混同行列を図示するには以下を用いればよい．

autoplot(object, type = c("mosaic", "heatmap"))
#' object: conf_mat の出力

ROC曲線の計算

roc_curve(data, truth, ...,
          na_rm = TRUE, event_level = yardstick_event_level(),
          case_weights = NULL, options = list())
#' data: 真値と予測値が含まれるデータフレーム
#' truth: 真値(ラベル)の列名
#' ...: 予測値(ラベル)の事後確率を与える列名
#' 詳細は '?yardstick::roc_curve' を参照

ROC曲線を図示するには以下を用いればよい．

autoplot(object)
#' object: roc_curve の出力

AUCの計算

roc_auc(data, truth, ..., estimator = NULL,
        na_rm = TRUE,  event_level = yardstick_event_level(),
        case_weights = NULL, options = list())
#' data: 真値と予測値が含まれるデータフレーム
#' truth: 真値(ラベル)の列名
#' ...: 予測値(ラベル)の事後確率を与える列名
#' 詳細は '?yardstick::roc_auc' を参照

問題

前回と同様に東京の気候データの線形判別を行い，以下を確認しなさい．
- 9月と10月の気温と湿度のデータを抽出する．
- 線形判別関数を構成する．
- 混同行列を用いて構成した判別関数の評価を行う．¹
- ROC曲線を描画しAUCを求める．²
```
#' 必要なデータを抽出して整理
tw_data <- read_csv("data/tokyo_weather.csv") 
tw_subset  <- tw_data |>
    filter(month %in% c(9,10)) |>
    select(temp, humid, month) |>
    mutate(month = as_factor(month)) # 月を因子化
```

¹ 関数 lda() の返値の class は vector 型なので，関数 mutate() を用いてデータフレームの列として加えることができる．

² 関数 lda() の返値の posterior は matrix 型なので，関数 bind_cols() を用いてデータフレームとして結合することができる．

解答欄

解答例

データを整理する．

tw_data <- read_csv("data/tokyo_weather.csv")
tw_subset  <-
    tw_data |> 
    filter(month %in% c(9,10)) |>    # 9,10月のデータ
    select(temp, humid, month) |>    # 気温・湿度・月を選択
    mutate(month = as_factor(month)) # 月を因子化

判別関数を作成する．

tw_lda <- lda(formula = month ~ temp + humid, data = tw_subset)

混同行列を用いて判別結果を評価する．

tw_lda_cm <-
    tw_lda |>
    augment(data = tw_subset) |>
    conf_mat(truth = month, estimate = .class)
tw_lda_cm          # 簡便な表示

          Truth
Prediction  9 10
        9  24  4
        10  6 27

summary(tw_lda_cm) # 詳細な表示

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.836
 2 kap                  binary         0.672
 3 sens                 binary         0.8  
 4 spec                 binary         0.871
 5 ppv                  binary         0.857
 6 npv                  binary         0.818
 7 mcc                  binary         0.673
 8 j_index              binary         0.671
 9 bal_accuracy         binary         0.835
10 detection_prevalence binary         0.459
11 precision            binary         0.857
12 recall               binary         0.8  
13 f_meas               binary         0.828

関数 ggplot2::autoplot() を用いて視覚化する．

autoplot(tw_lda_cm, type = "mosaic")  # モザイクプロット

autoplot(tw_lda_cm, type = "heatmap") # 行列表示

ROC曲線の描画を行う．

tw_lda |>
    augment(data = tw_subset) |>
    roc_curve(truth = month, .posterior9) |> # 9月への判別確率を陽性とする
    autoplot()

AUCを計算する．

tw_lda |>
    augment(data = tw_subset) |>
    roc_auc(truth = month, .posterior9)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.915

以下は12ヶ月分のデータを用いた例で，これを参考に説明変数は適宜選択して検討してほしい．

データを整理する．

tw_subset12  <- tw_data |>
    select(temp, solar, wind, humid, month) |>
    mutate(month = as_factor(month))

Jan, Feb などの文字を扱いたい場合は関数 as_factor() を用いるかわりに month(month, label = TRUE) とすればよい．その場合，列名の指定なども ‘Jan:Dec’ などと変わるので注意する．

判別関数を作成する．

tw_lda12 <- lda(month ~ ., # 右辺の . は month 以外の全てを説明変数として指定
                data = tw_subset12)

判別結果を評価する．

tw_lda12_cm <- 
    tw_lda12 |>
    augment(data = tw_subset12) |>
    conf_mat(truth = month, estimate = .class)
tw_lda12_cm # 表示

          Truth
Prediction  1  2  3  4  5  6  7  8  9 10 11 12
        1  12  6  5  0  0  0  0  0  0  0  0 12
        2   4 10  5  0  0  0  0  0  0  0  3  1
        3   2  5 12  2  0  0  0  0  0  0  0  0
        4   0  2  2 10  6  2  0  0  0  3  4  0
        5   0  1  2  7 15  3  0  0  0  0  0  0
        6   0  0  0  0  3 13  1  0  2  1  0  0
        7   0  0  0  0  0  1 22 17 10  0  0  0
        8   0  0  0  0  0  0  2  9  6  0  0  0
        9   0  0  0  0  1  4  5  5  7  4  0  0
        10  0  0  0  5  6  7  1  0  5 19  3  0
        11  0  1  3  6  0  0  0  0  0  4 14  4
        12 13  4  2  0  0  0  0  0  0  0  6 14

summary(tw_lda12_cm) # 詳細表示

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass    0.429 
 2 kap                  multiclass    0.377 
 3 sens                 macro         0.428 
 4 spec                 macro         0.948 
 5 ppv                  macro         0.444 
 6 npv                  macro         0.948 
 7 mcc                  multiclass    0.379 
 8 j_index              macro         0.376 
 9 bal_accuracy         macro         0.688 
10 detection_prevalence macro         0.0833
11 precision            macro         0.444 
12 recall               macro         0.428 
13 f_meas               macro         0.424

autoplot(tw_lda12_cm, type = "mosaic") # モザイクプロット

autoplot(tw_lda12_cm, type = "heatmap") # 行列表示

tw_lda12 |>
    augment(data = tw_subset12) |>
    roc_curve(truth = month, .posterior1:.posterior12) |> 
    autoplot()

tw_lda12 |>
    augment(data = tw_subset12) |>
    roc_auc(truth = month, .posterior1:.posterior12)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc hand_till      0.918

訓練・試験データによる評価

訓練・試験データを作成するための関数

以下の関数群でデータセットを適切な条件で分割して利用することができる．

訓練・試験データの作成

以下の関数で分割を行う．

initial_split(data, prop = 3/4,
              strata = NULL, breaks = 4, pool = 0.1, ...)
#' data: データフレーム
#' prop: 訓練データの比率
#' strata: 層別に分割する場合の変数
#' 詳細は '?rsample::initial_split' を参照

訓練データを取得するためには以下の関数を用いる．

training(x, ...)
#' x : initial_split の出力

試験データを取得するためには以下の関数を用いる．

testing(x, ...)
#' x : initial_split の出力

問題

Wine Quality Data Set を用いて，以下を確認しなさい．
- 8:2の比率で訓練データと試験データに分割する．
- 訓練データを用いて線形・2次判別関数を構成する．
- 訓練データを用いて評価を行う．
- 試験データを用いて評価を行う．
```
#' データの分割の例
wq_split <-
    initial_split(wq_data, prop = 0.8, strata = grade)
training(wq_split)
testing(wq_split)
```

解答欄

解答例

データを整理する．

wq_data <-
    read_delim("https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv",
               delim = ";") |>
    mutate(grade = factor(case_when( # quality を A,B,C,D に割り当てる
               quality >= 7 ~ "A",
               quality >= 6 ~ "B",
               quality >= 5 ~ "C",
               .default = "D"
           )))

訓練・試験のためにデータを分割する．

set.seed(987987) # 適宜シード値は設定する
wq_split <- initial_split(wq_data, prop = 0.8,
                          strata = grade)

オプション strata (層別) を指定すると分割したデータの grade の比率が保たれる．

線形および2次判別関数を作成する．ただし，grade のもとになっている quality は除く．

wq_formula <- grade ~ . - quality
wq_lda <- lda(formula = wq_formula, data = training(wq_split))
wq_qda <- qda(formula = wq_formula, data = training(wq_split))

訓練データを用いて判別結果を評価する．

wq_lda_train_cm <- 
    wq_lda |>
    augment(data = training(wq_split)) |>
    conf_mat(truth = grade, estimate = .class)
summary(wq_lda_train_cm)                       # 線形判別の評価

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass     0.615
 2 kap                  multiclass     0.380
 3 sens                 macro          0.484
 4 spec                 macro          0.844
 5 ppv                  macro          0.534
 6 npv                  macro          0.848
 7 mcc                  multiclass     0.381
 8 j_index              macro          0.328
 9 bal_accuracy         macro          0.664
10 detection_prevalence macro          0.25 
11 precision            macro          0.534
12 recall               macro          0.484
13 f_meas               macro          0.501

autoplot(wq_lda_train_cm, type = "heatmap") +
    labs(title = "線形判別 (訓練データ)")

wq_qda_train_cm <- 
    wq_qda |>
    augment(data = training(wq_split)) |>
    conf_mat(truth = grade, estimate = .class)
summary(wq_qda_train_cm)                       # 2次判別の評価

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass     0.626
 2 kap                  multiclass     0.411
 3 sens                 macro          0.549
 4 spec                 macro          0.852
 5 ppv                  macro          0.568
 6 npv                  macro          0.852
 7 mcc                  multiclass     0.411
 8 j_index              macro          0.401
 9 bal_accuracy         macro          0.700
10 detection_prevalence macro          0.25 
11 precision            macro          0.568
12 recall               macro          0.549
13 f_meas               macro          0.556

autoplot(wq_qda_train_cm, type = "heatmap") +
    labs(title = "2次判別 (訓練データ)")

試験データを用いて判別結果を評価する．

wq_lda_test_cm <-
    wq_lda |>
    augment(newdata = testing(wq_split)) |>
    conf_mat(truth = grade, estimate = .class)
summary(wq_lda_test_cm)                        # 線形判別の評価

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass     0.607
 2 kap                  multiclass     0.379
 3 sens                 macro          0.485
 4 spec                 macro          0.843
 5 ppv                  macro          0.569
 6 npv                  macro          0.847
 7 mcc                  multiclass     0.382
 8 j_index              macro          0.329
 9 bal_accuracy         macro          0.664
10 detection_prevalence macro          0.25 
11 precision            macro          0.569
12 recall               macro          0.485
13 f_meas               macro          0.508

autoplot(wq_lda_test_cm, type = "heatmap") +
    labs(title = "線形判別 (試験データ)")

wq_qda_test_cm <-
    wq_qda |>
    augment(newdata = testing(wq_split)) |>
    conf_mat(truth = grade, estimate = .class)
summary(wq_qda_test_cm)                        # 2次判別の評価

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass     0.604
 2 kap                  multiclass     0.383
 3 sens                 macro          0.502
 4 spec                 macro          0.844
 5 ppv                  macro          0.541
 6 npv                  macro          0.847
 7 mcc                  multiclass     0.383
 8 j_index              macro          0.346
 9 bal_accuracy         macro          0.673
10 detection_prevalence macro          0.25 
11 precision            macro          0.541
12 recall               macro          0.502
13 f_meas               macro          0.513

autoplot(wq_qda_test_cm, type = "heatmap") +
    labs(title = "2次判別 (試験データ)")

LOO交叉検証法による予測誤差の評価

LOO交叉検証

関数 MASS::lda()/qda() はオプションで LOO交叉検証を行うことができる．

CV オプションの指定方法

toy_lda <- lda(formula, toy_data, CV = TRUE)
toy_lda[["class"]] # LOO CV による予測結果
#' 特定のデータを除いて判別関数を構成し，そのデータの予測を行っている
toy_qda <- qda(formula, toy_data, CV = TRUE)
toy_qda[["class"]] # LOO CV による予測結果
#' 2次判別についても同様

問題

データセット MASS::biopsy を用いて2次判別の分析を行いなさい．
- 全てのデータを用いて訓練誤差を評価する．
- LOO交叉検証法を用いて予測誤差を評価する．
```
bio_qda_loo <- qda(class ~ ., data = bio_data, CV = TRUE)
```

解答欄

解答例

データを整理する．

bio_data <-
    MASS::biopsy |>
    as_tibble() |>
    na.omit()

散布図で項目間の関係を確認する．

bio_data |>
    select(-ID) |> # 患者IDを除く
    ggpairs(diag = list(mapping = aes(colour = class)),
            lower = list(mapping = aes(colour = class)))

以下は2次判別の LOO CV による評価の例である．線形判別も同様に行うことができる．

bio_formula <- class ~ . - ID # IDを除く
bio_qda <- qda(formula = bio_formula, data = bio_data) 
bio_qda_loo <- qda(formula = bio_formula, data = bio_data, CV = TRUE)

訓練誤差での評価を確認し，可視化する．

bio_qda_cm <-
    bio_qda |>
    augment(data = bio_data) |>
    conf_mat(truth = class, estimate = .class)
summary(bio_qda_cm) # 2次判別によるあてはめ値の評価(訓練誤差)

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.959
 2 kap                  binary         0.911
 3 sens                 binary         0.950
 4 spec                 binary         0.975
 5 ppv                  binary         0.986
 6 npv                  binary         0.914
 7 mcc                  binary         0.912
 8 j_index              binary         0.925
 9 bal_accuracy         binary         0.963
10 detection_prevalence binary         0.627
11 precision            binary         0.986
12 recall               binary         0.950
13 f_meas               binary         0.968

autoplot(bio_qda_cm, type = "heatmap") +
    labs(title = "訓練誤差")

LOO CV での評価を確認し，可視化する．オプション CV=TRUE を指定した場合には返値が list クラスになるので，別途処理する必要があるので注意する．

bio_qda_loo_cm <-
    bio_data |>
    bind_cols(.class = bio_qda_loo[["class"]]) |>
    conf_mat(truth = class, estimate = .class)
summary(bio_qda_loo_cm) # LOO CVによる予測の評価(予測誤差)

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             binary         0.950
 2 kap                  binary         0.893
 3 sens                 binary         0.937
 4 spec                 binary         0.975
 5 ppv                  binary         0.986
 6 npv                  binary         0.893
 7 mcc                  binary         0.895
 8 j_index              binary         0.912
 9 bal_accuracy         binary         0.956
10 detection_prevalence binary         0.618
11 precision            binary         0.986
12 recall               binary         0.937
13 f_meas               binary         0.961

autoplot(bio_qda_loo_cm, type = "heatmap") +
    labs(title = "試験誤差 (LOO CV)")

あてはめ値による評価は LOO CV より若干良くなっており，あてはめ値では精度を過剰に評価する可能性があることが示唆される．

k重交叉検証法による予測誤差の評価

k重交叉検証法のための関数

package::tidymodels の関数群を利用してk重交叉検証を行うことができる．

Note

基本的な手続きは以下のとおりである．

#' 交叉検証用のデータ分割 
#' 詳細は '?rsample::vfold_cv' を参照
vfold_cv(data, v = 10, repeats = 1,
         strata = NULL, breaks = 4, pool = 0.1, ...)
#' 最も簡単な処理の流れ(以下の関数の組み合わせ)
#' 詳細は '?workflows::workflow'
#' および '?tune::fit_resamples' を参照
workflow() |> 
    add_formula(目的変数 ~ 説明変数) |>
    add_model(推定に用いるモデル) |> 
    fit_resamples(resamples = vfold_cvの出力)
#' 評価の取得
#' 詳細は '?tune::collect_metrics' を参照
collect_metrics(fit_resamplesの出力)

問題

Wine Quality Data Set を用いて線形判別と2次判別の分析を行いなさい．

LOO交叉検証法を用いて予測誤差を評価する．
k重交叉検証法を用いて予測誤差を評価する．

#' lda/qda での LOO CV の指定の方法
wq_lda_loo <- lda(formula = wq_formula, data = training(wq_split), CV = TRUE)
wq_qda_loo <- qda(formula = wq_formula, data = training(wq_split), CV = TRUE)

解答例

前の問題で既に整理してある wq_data/wq_split を用いる． wq_lda と比較する．

wq_lda_loo <- lda(formula = wq_formula, data = training(wq_split), CV = TRUE)
#' 訓練誤差による評価
wq_lda |>
    augment(data = training(wq_split)) |> 
    conf_mat(truth = grade, estimate = .class) |>
    autoplot(type = "heatmap") +
    labs(title = "訓練誤差 (線形判別)")

#' LOO CV 予測誤差による評価
wq_lda_loo_cm <-
    training(wq_split) |> 
    bind_cols(.class = wq_lda_loo[["class"]]) |>
    conf_mat(truth = grade, estimate = .class)
wq_lda_loo_cm |>
    autoplot(type = "heatmap") +
    labs(title = "LOO CV (線形判別)")

線形判別の過学習は微小であることがわかる．

2次判別の LOO CV を wq_qda と比較する．

wq_qda_loo <- qda(formula = wq_formula, data = training(wq_split), CV = TRUE)
#' 訓練誤差による評価
wq_qda |>
    augment(data = training(wq_split)) |> 
    conf_mat(truth = grade, estimate = .class) |>
    autoplot(type = "heatmap") +
    labs(title = "訓練誤差 (2次判別)")

#' LOO CV 予測誤差による評価
wq_qda_loo_cm <-
    training(wq_split) |> 
    bind_cols(.class = wq_qda_loo[["class"]]) |>
    conf_mat(truth = grade, estimate = .class)
wq_qda_loo_cm |> autoplot(type = "heatmap") +
    labs(title = "LOO CV (2次判別)")

2次判別は若干過学習していることが示唆される．

LOO CV により線形・2次判別の予測誤差を比較する．

summary(wq_lda_loo_cm) # 線形

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass     0.600
 2 kap                  multiclass     0.357
 3 sens                 macro          0.465
 4 spec                 macro          0.839
 5 ppv                  macro          0.507
 6 npv                  macro          0.842
 7 mcc                  multiclass     0.357
 8 j_index              macro          0.304
 9 bal_accuracy         macro          0.652
10 detection_prevalence macro          0.25 
11 precision            macro          0.507
12 recall               macro          0.465
13 f_meas               macro          0.479

summary(wq_qda_loo_cm) # 2次

# A tibble: 13 × 3
   .metric              .estimator .estimate
   <chr>                <chr>          <dbl>
 1 accuracy             multiclass     0.576
 2 kap                  multiclass     0.330
 3 sens                 macro          0.467
 4 spec                 macro          0.832
 5 ppv                  macro          0.481
 6 npv                  macro          0.833
 7 mcc                  multiclass     0.330
 8 j_index              macro          0.300
 9 bal_accuracy         macro          0.650
10 detection_prevalence macro          0.25 
11 precision            macro          0.481
12 recall               macro          0.467
13 f_meas               macro          0.473

予測誤差の観点からは線形判別の方が良さそうであることが伺える．

以下では package::tidymodels を用いた k重交叉検証の基本的な手続きを紹介する．

まず lda/qda を tidymodels 用に宣言する．

library(discrim) # 以下の判別モデルを設定するために必要
tidy_qda <- discrim_quad() |> set_engine("MASS") |> set_mode("classification")
tidy_lda <- discrim_linear() |> set_engine("MASS") |> set_mode("classification")

交叉検証用にデータ分割を行う．

wq_folds <- vfold_cv(training(wq_split),
                     v = 10) # 10-fold を指定 (既定値)

検証に用いる評価指標を定義する．

wq_metrics <- metric_set(accuracy,  # 精度
                         sens,      # 感度 (真陽性率)
                         spec,      # 特異度 (真陰性率)
                         precision, # 適合率
                         recall,    # 再現率 (sensと同じ)
                         roc_auc,   # AUC
                         kap,       # kappa値
                         f_meas)    # f値

共通の処理を定義する．

wq_workflow <-
    workflow() |>           # 枠組の入れ物を用意
    add_formula(wq_formula) # formula を指定

線形判別を行い，評価する．

wq_workflow |>
    add_model(tidy_lda) |>                 # 線形判別モデルを用いることを宣言
    fit_resamples(resamples = wq_folds,    # 交叉検証を実行
                  metrics = wq_metrics) |>
    collect_metrics()                      # 評価指標を整理

# A tibble: 8 × 6
  .metric   .estimator  mean     n std_err .config        
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy  multiclass 0.597    10 0.00753 pre0_mod0_post0
2 f_meas    macro      0.467    10 0.0221  pre0_mod0_post0
3 kap       multiclass 0.349    10 0.0130  pre0_mod0_post0
4 precision macro      0.511    10 0.0339  pre0_mod0_post0
5 recall    macro      0.456    10 0.0187  pre0_mod0_post0
6 roc_auc   hand_till  0.784    10 0.0109  pre0_mod0_post0
7 sens      macro      0.456    10 0.0187  pre0_mod0_post0
8 spec      macro      0.837    10 0.00291 pre0_mod0_post0

2次判別を行い，評価する．

wq_workflow |>
    add_model(tidy_qda) |>                 # 2次判別モデルを用いることを宣言
    fit_resamples(resamples = wq_folds,
                  metrics = wq_metrics) |>
    collect_metrics()

# A tibble: 8 × 6
  .metric   .estimator  mean     n std_err .config        
  <chr>     <chr>      <dbl> <int>   <dbl> <chr>          
1 accuracy  multiclass 0.568    10 0.00958 pre0_mod0_post0
2 f_meas    macro      0.445    10 0.0217  pre0_mod0_post0
3 kap       multiclass 0.317    10 0.0171  pre0_mod0_post0
4 precision macro      0.460    10 0.0256  pre0_mod0_post0
5 recall    macro      0.447    10 0.0201  pre0_mod0_post0
6 roc_auc   hand_till  0.732    10 0.0192  pre0_mod0_post0
7 sens      macro      0.447    10 0.0201  pre0_mod0_post0
8 spec      macro      0.830    10 0.00401 pre0_mod0_post0