第8講判別分析

基本的な考え方

Published

November 28, 2025

準備

以下で利用する共通パッケージを読み込む．

library(conflicted) # 関数名の衝突を警告
conflicts_prefer(   # 優先的に使う関数を指定
    dplyr::filter(),
    dplyr::lag(),
    dplyr::select(),
    scales::discard(),
    recipes::fixed(),
    yardstick::spec(),
    recipes::step(),
)
library(tidyverse)  
library(ggfortify)
library(ggrepel)
library(GGally)
library(tidymodels) # モデル評価のためのメタパッケージ
library(MASS)
library(gt)         
library(gtsummary)  
#' macOSのための日本語表示の設定
if(Sys.info()["sysname"] == "Darwin") { # macOSか調べる
    jp_font <- "HiraMaruProN-W4"
    theme_update(text = element_text(family = jp_font))
    update_geom_defaults("text", list(family = theme_get()$text$family))
    update_geom_defaults("label", list(family = theme_get()$text$family))
    update_geom_defaults("text_repel", list(family = theme_get()$text$family))
    update_geom_defaults("label_repel", list(family = theme_get()$text$family))
} else {jp_font <- NULL}

データセット

以下のデータセットを使用する

tokyo_weather.csv の概要

気象庁より取得した東京の気候データを整理したもの

year,month,day,day_of_week : 年，月，日，曜日
temp : 平均気温(℃)
rain : 降水量の合計(mm)
solar : 合計全天日射量(MJ/㎡)
snow : 降雪量合計(cm)
wdirs : 最多風向(16方位)
wind : 平均風速(m/s)
press : 平均現地気圧(hPa)
humid : 平均湿度(％)
cloud : 平均雲量(10分比)
参考 : 気象庁

データの読み込み方

tw_data <- read_csv("data/tokyo_weather.csv")

判別分析に用いる主な関数

判別分析を行うRの標準的な関数として

MASS::lda()
MASS::qda()

がある．講義で説明するように観測データに関する仮定が異なるが，使い方はほぼ同様であるので，ここで関連する関数をまとめて紹介しておく．

線形判別分析の実行

lda(formula, data, ..., subset, na.action)
#' formula: モデル式 (ラベル ~ 判別に用いる変数)
#' data: 必要な情報を含むデータフレーム
#' 詳細は '?MASS::lda' を参照

2次判別分析の実行

qda(formula, data, ..., subset, na.action)
#' formula: モデル式 (ラベル ~ 判別に用いる変数)
#' data: 必要な情報を含むデータフレーム
#' 詳細は '?MASS::qda' を参照

結果を得る基本関数

分析結果の概要を確認するには，関数 MASS::lda()/qda() の返値を関数 base::print() で単に表示すれば良い．

print(x, ...)
#' x: lda/qda クラス
#' command line では単に print は書かなくてもよい

関数 MASS::lda()/qda() にどのような情報が含まれているかを見るには関数 pillar::glimpse() を利用する．

glimpse(x, width = NULL, ...)
#' x: lda/qda クラス
#' width: 表示の長さを指定．既定値(NULL)は表示環境が可能な範囲
#' 詳細は '?pillar::glimpse' を参照
#' 関数 utils::str() も同様な働きをする

判別のあてはめ値や予測値は関数 MASS::predict.lda()/qda() を利用して得ることができる．

判別結果の取得

predict(object, newdata, prior = object$prior, dimen,
        method = c("plug-in", "predictive", "debiased"), ...)
#' object: lda クラス
#' newdata: 予測の対象とするデータフレーム
#' prior: ラベルの事前分布
#' 返値はリスト型で以下の項目がある
#' class: 判別関数による予測ラベル
#' posterior: ラベルの事後確率
#' x: 判別関数値
#' 詳細は '?MASS::predict.lda' を参照

predict(object, newdata, prior = object$prior,
        method = c("plug-in", "predictive", "debiased", "looCV"), ...)
#' object: qda クラス
#' newdata: 予測対象のデータフレーム．省略すると推定に用いたデータフレーム
#' 返値はリスト型で以下の項目がある
#' class: 判別関数による予測ラベル
#' posterior: ラベルの事後確率
#' 詳細は '?MASS::predict.qda' を参照

判別結果などをデータフレームとして取得するための関数は現在のところpackage::broom には実装されていないので，関数 MASS::predict.lda()/qda() を用いて tibble クラスに情報を集約する関数を作成して利用することにする．

分析結果の取得

関数 tidy() は汎用関数(generic function)で，第1引数のクラスに依存してその働きを定義できる．関数 lda()/qda() は lda/qda クラスを返すので，クラスごとに定義する．以下はこの回の演習問題のために用いる例であり，必要に応じて自由に改変して構わない．

#' lda クラス用
tidy.lda <- function(x, type = c("statistic","estimate")){
  type <- match.arg(type)
  lda_stat <- \(x){ # 事前確率およびクラスごとの標本平均
    bind_cols(x[["prior"]] |> enframe(value = "prior"),
              x[["means"]] |> as_tibble() |> rename_with(\(x)paste0(x,"_ave")))
  }
  lda_est <- \(x){  # 推定された判別関数の係数
    x[["scaling"]] |> as.data.frame() |> rownames_to_column() |> as_tibble()
  }
  switch(type,
         statistic = lda_stat(x),
         estimate = lda_est(x))
}
#' qda クラス用
tidy.qda <- function(x, type = c("statistic","estimate")){
  type <- match.arg(type)
  qda_stat <- \(x){ # 事前確率およびクラスごとの標本平均
    bind_cols(x[["prior"]] |> enframe(value = "prior"),
              x[["means"]] |> as_tibble() |> rename_with(\(x)paste0(x,"_ave")))
  }
  qda_est <- \(x){ # 分散1共分散0に変換する行列(共分散の-1/2乗)および1/2log_det共分散
    bind_cols(x[["scaling"]] |> as.data.frame() |> rownames_to_column() |>
              as_tibble() |> rename_with(\(x)paste0("scale_",x)),
              x[["ldet"]] |> as_tibble_col(column_name = "log_det/2"))
  }
  switch(type,
         statistic = qda_stat(x),
         estimate = qda_est(x))
}

判別のあてはめ値・予測値の計算

関数 augment() も汎用関数である．

#' lda クラス用
augment.lda <- function(x, data = NULL, newdata = NULL) {
  if(!is.null(newdata)) {
    tmp <- predict(x, newdata = newdata)
  } else {
    tmp <- predict(x)
  }
  out <- bind_cols(
    tmp[["class"]] |> as_tibble_col(column_name = ".class"),
    tmp[["posterior"]] |> as_tibble() |> rename_with(\(x)paste0(".posterior",x)),
    tmp[["x"]] |> as_tibble() |> rename_with(\(x)paste0(".x",x))
  )
  if(!is.null(newdata)) {return(bind_cols(newdata, out))}
  if(!is.null(data)) {return(bind_cols(data, out))}
  return(out)
}
#' qda クラス用
augment.qda <- function(x, data = NULL, newdata = NULL) {
  if(!is.null(newdata)) {
    tmp <- predict(x, newdata = newdata)
  } else {
    tmp <- predict(x)
  }
  out <- bind_cols(
    tmp[["class"]] |> as_tibble_col(column_name = ".class"),
    tmp[["posterior"]] |> as_tibble() |> rename_with(\(x)paste0(".posterior",x)),
  )
  if(!is.null(newdata)) {return(bind_cols(newdata, out))}
  if(!is.null(data)) {return(bind_cols(data, out))}
  return(out)
}

視覚化はこれらの関数で取得したデータフレームを用いて行うことができる．また，関数 augment() で得られたあてはめ値や予測値と真の値を比べて正答率を評価するには関数 yardstick::conf_mat() が利用できる．

混同行列の作成

conf_mat(
  data,
  truth,
  estimate,
  dnn = c("Prediction", "Truth"),
  case_weights = NULL,
  ...
)
#' data: データフレーム
#' truth: 真値の列
#' estimate: あてはめ値・予測値の列
#' 詳細は '?yardstick::conf_mat' を参照

線形判別

問題

以下の人工データを用いて線形判別分析を行いなさい．

#' 分散共分散が同じ2クラスのデータ
n <- 40
toy_data <-
    bind_rows(
        tibble(x1 = rnorm(n, mean=2, sd=2),
               x2 = rnorm(n, mean=1, sd=1),
               class = rep(1, n)),
        tibble(x1 = rnorm(n, mean=-2, sd=2),
               x2 = rnorm(n, mean=-1, sd=1),
               class = rep(2, n))) |>
    mutate(class = as_factor(class))

東京の気候データを用いて以下の分析を行いなさい．

9月と10月の気温と湿度のデータを抽出する．
半分のデータを用いて線形判別関数を構成し，残りのデータを用いて判別を行う．

#' 必要なデータの抽出
tw_subset  <- tw_data |>
    filter(month %in% c(9,10)) |>
    select(temp, humid, month) |>
    mutate(month = as_factor(month))   # 月を因子化

#' データの分割の例
idx <- seq(1, nrow(tw_subset), by = 2) # データの分割(1つおき)
tw_train <- tw_subset[ idx,]           # 訓練データ
tw_test  <- tw_subset[-idx,]           # 試験データ
tw_lda <- lda(month ~ temp + humid, data = tw_train) # 線形判別関数の推定

解答欄

解答例

人工データ

指示に従ってデータを生成する．

n <- 40
toy_data <-
    bind_rows(
        tibble(x1 = rnorm(n, mean=2, sd=2),
               x2 = rnorm(n, mean=1, sd=1),
               class = rep(1, n)),
        tibble(x1 = rnorm(n, mean=-2, sd=2),
               x2 = rnorm(n, mean=-1, sd=1),
               class = rep(2, n))) |>
    mutate(class = as_factor(class))

散布図でデータの性質を確認する．

toy_data |>
    ggplot(aes(x = x1, y = x2, colour = class)) +
    geom_point()

判別分析を行う．

toy_model <- class ~ x1 + x2
toy_lda <- lda(formula = toy_model, data = toy_data)
toy_lda # 結果を表示

Call:
lda(toy_model, data = toy_data)

Prior probabilities of groups:
  1   2 
0.5 0.5 

Group means:
         x1         x2
1  1.710849  0.9251349
2 -1.761402 -0.9750832

Coefficients of linear discriminants:
          LD1
x1 -0.3984087
x2 -0.5219086

判別結果を整理する．

toy_lda |>
    tidy()                   # 推定値を確認

# A tibble: 2 × 4
  name  prior x1_ave x2_ave
  <chr> <dbl>  <dbl>  <dbl>
1 1       0.5   1.71  0.925
2 2       0.5  -1.76 -0.975

toy_lda |>
    augment(data = toy_data) # あてはめ値を確認

# A tibble: 80 × 7
       x1       x2 class .class .posterior1 .posterior2  .xLD1
    <dbl>    <dbl> <fct> <fct>        <dbl>       <dbl>  <dbl>
 1  2.55   0.00753 1     1            0.922    0.0776   -1.04 
 2 -1.09  -0.582   1     2            0.154    0.846     0.717
 3  5.26   2.45    1     1            1.000    0.000314 -3.40 
 4  1.10  -0.629   1     1            0.579    0.421    -0.135
 5  2.22   1.11    1     1            0.972    0.0284   -1.49 
 6  1.78   1.42    1     1            0.971    0.0292   -1.47 
 7  2.16   0.946   1     1            0.963    0.0367   -1.38 
 8  0.972  1.09    1     1            0.911    0.0893   -0.978
 9  2.53   0.367   1     1            0.948    0.0518   -1.22 
10  2.44   0.544   1     1            0.954    0.0457   -1.28 
# ℹ 70 more rows

toy_lda |>
    augment(data = toy_data) |> 
    conf_mat(truth = class, estimate = .class)   # 混同行列

          Truth
Prediction  1  2
         1 36  3
         2  4 37

toy_lda |>                                         
    augment(data = toy_data) |>
    ggplot(aes(x = .xLD1, fill = class)) +       # 判別関数値の視覚化
    geom_histogram(aes(y = after_stat(density)),
                   bins = 30) +
    facet_grid(class ~ .)                        # クラスごとに表示

関数 autoplot() を用いると混同行列を視覚化することができる．詳しくは ‘?yardstick::conf_mat’ を参照のこと．

toy_lda |>
    augment(data = toy_data) |> 
    conf_mat(truth = class, estimate = .class) |>
    autoplot(type = "mosaic") # "mosaic"(既定値)か"heatmap"が指定できる

推定された線形判別関数の識別境界を図示する．

range_x <- range(toy_data[["x1"]])               # x1の範囲
range_y <- range(toy_data[["x2"]])               # x2の範囲
toy_grid_x <- pretty(range_x, 100)               # 格子点の作成
toy_grid_y <- pretty(range_y, 100)
toy_grid_xy <- expand.grid(x1 = toy_grid_x,      # 2次元の格子点を作成
                           x2 = toy_grid_y) 
toy_lda |>
    augment(newdata = toy_grid_xy) |>            # 格子点上の判別関数値を計算
    ggplot(aes(x = x1, y = x2)) +                # 予測されるラベルの図示
    geom_tile(aes(fill = .class), alpha = 0.3) +
    labs(fill = NULL)

上記の図にデータ点を重ね描きする．

last_plot() +
    geom_point(data = toy_data,
               aes(x = x1, y = x2, colour = class)) +
    labs(colour = NULL)

判別関数値を色づけして図示するには以下のようにすればよい．

toy_lda |>
    augment(newdata = toy_grid_xy) |>
    ggplot(aes(x = x1, y = x2)) +
    geom_raster(aes(fill = .xLD1), alpha = 0.5) +
    scale_fill_gradientn(colours=c("red","white","blue")) +
    geom_point(data = toy_data,
               aes(x = x1, y = x2, colour = class)) +
    labs(fill = NULL, colour = NULL)

東京の気候データ

必要なデータを整理する．

tw_data <- read_csv("data/tokyo_weather.csv")
tw_subset  <- tw_data |> 
    filter(month %in% c(9,10)) |>      # 9,10月のデータ
    select(temp, humid, month) |>      # 気温・湿度・月を選択
    mutate(month = as_factor(month))   # 月を因子化
idx <- seq(1, nrow(tw_subset), by = 2) # データの分割(1つおき)
tw_train <- tw_subset[ idx,]           # 訓練データ(推定に用いる)
tw_test  <- tw_subset[-idx,]           # 試験データ(評価に用いる)

ランダムに分割するには例えば以下のようにすれば良い．

n <- nrow(tw_subset); idx <- sample(1:n, n/2)

関数 lubridate::month() を用いれば月を文字列(因子)とすることもできる．

mutate(month = month(month, label = TRUE))

以下のようにして最初に月を因子化することもできるが，関数によっていは処理で使われない因子(例えば1月)も表示してしまうので注意が必要である．

read_csv("data/tokyo_weather.csv") |>
    mutate(month = as_factor(month))

また複数の項目をそれぞれ因子化するには例えば以下のようにすればよい．

mutate(across(c(year, month, day), as_factor))

気温と湿度の散布図を作成して，データを視覚化する．

tw_subset |>  
    ggplot(aes(x = temp, y = humid)) + 
    geom_point(aes(colour = month)) +       # 月ごとに点の色を変える
    labs(x = "Temperature", y = "Humidity",
         title = "September & October")

訓練データを用いて線形判別関数(等分散性を仮定)を作成する．

tw_model <- month ~ temp + humid
tw_lda <- lda(formula = tw_model, data = tw_train)

判別関数によるクラス分類結果を確認する．

tw_lda |>
    augment(data = tw_train)                      # 訓練データのあてはめ値を取得

# A tibble: 31 × 7
    temp humid month .class .posterior9 .posterior10  .xLD1
   <dbl> <dbl> <fct> <fct>        <dbl>        <dbl>  <dbl>
 1  26.2    97 9     9            0.887       0.113  -1.11 
 2  24.9    88 9     9            0.736       0.264  -0.587
 3  26.7    76 9     9            0.870       0.130  -1.03 
 4  28.7    80 9     9            0.963       0.0367 -1.73 
 5  28.3    80 9     9            0.953       0.0469 -1.60 
 6  29.6    79 9     9            0.978       0.0215 -2.01 
 7  29.6    73 9     9            0.975       0.0246 -1.94 
 8  29.9    75 9     9            0.981       0.0195 -2.06 
 9  27.7    79 9     9            0.931       0.0687 -1.39 
10  27.7    84 9     9            0.938       0.0618 -1.45 
# ℹ 21 more rows

tw_lda |>
    augment(data = tw_train) |> 
    conf_mat(truth = month, estimate = .class)    # 混同行列

          Truth
Prediction  9 10
        9  12  2
        10  3 14

tw_lda |>                                         
    augment(data = tw_train) |>
    ggplot(aes(x = .xLD1, fill = month)) +        # 判別関数値の視覚化
    geom_histogram(aes(y = after_stat(density)),
                   bins = 30) +
    facet_grid(month ~ .)                         # 月ごとに表示(y方向に並べる)

関数 base::table() で混同行列を作ることはできる．表の向きは引数の順序で制御できる．

table(Truth = tw_train[["month"]],
      Prediction = predict(tw_lda)[["class"]])

     Prediction
Truth  9 10
   9  12  3
   10  2 14

試験データによる評価を行う．

tw_lda |>
    augment(newdata = tw_test) |>                 # 試験データの予測値を取得
    conf_mat(truth = month, estimate = .class)    # 混同行列

          Truth
Prediction  9 10
        9  13  2
        10  2 13

tw_lda |>
    augment(newdata = tw_test) |>
    ggplot(aes(x = .xLD1, fill = month)) +        # 判別関数値の視覚化
    geom_histogram(aes(y = after_stat(density)),
                   bins = 30) +
    facet_grid(month ~ .)

推定された線形判別関数の識別境界を図示する．

range_x <- range(tw_subset[["temp"]])             # 気温の値域
range_y <- range(tw_subset[["humid"]])            # 湿度の値域
grid_x <- pretty(range_x, 100)                    # 気温の値域の格子点を作成
grid_y <- pretty(range_y, 100)                    # 湿度の値域の格子点を作成
grid_xy <- expand.grid(temp = grid_x,
                       humid = grid_y)            # 2次元の格子点を作成
tw_lda |>
    augment(newdata = grid_xy) |>                 # 格子点上の判別関数値を計算
    ggplot(aes(x = temp, y = humid)) +            # 判別関数により予測されるラベルの図示
    geom_tile(aes(fill = .class), alpha = 0.3) +
    geom_point(data = tw_subset,
               aes(x = temp, y = humid, colour = month)) +
    labs(x = "気温", y = "湿度", fill = NULL, colour = NULL,
         title = "線形判別分析")

2次判別

問題

以下の人工データを用いて線形判別分析を行いなさい．

#' 分散共分散が異なる2クラスのデータ
n <- 40
toy_data <-
    bind_rows(
        tibble(x1 = rnorm(n, mean=2, sd=2),
               x2 = rnorm(n, mean=0, sd=3),
               class = rep(1, n)),
        tibble(x1 = rnorm(n, mean=-2, sd=2),
               x2 = rnorm(n, mean=0, sd=1),
               class = rep(2, n))) |>
    mutate(class = as_factor(class))

東京の気候データを用いて以下の分析を行いなさい．
- 前問と同様な設定で2次判別を行いなさい．
- 別の月や変数を用いて判別分析を行いなさい．
```
#' 2次判別関数の推定
tw_qda <- qda(month ~ temp + humid, data = tw_train) 
```

解答欄

解答例

人工データ

指示に従ってデータを生成する．

n <- 40
toy_data <-
    bind_rows(
        tibble(x1 = rnorm(n, mean=2, sd=2),
               x2 = rnorm(n, mean=0, sd=3),
               class = rep(1, n)),
        tibble(x1 = rnorm(n, mean=-2, sd=2),
               x2 = rnorm(n, mean=0, sd=1),
               class = rep(2, n))) |>
    mutate(class = as_factor(class))

散布図でデータの性質を確認する．

toy_data |>
    ggplot(aes(x = x1, y = x2, colour = class)) +
    geom_point()

判別分析を行う．

toy_model <- class ~ x1 + x2
toy_qda <- qda(formula = toy_model, data = toy_data)
toy_qda # 結果を表示

Call:
qda(toy_model, data = toy_data)

Prior probabilities of groups:
  1   2 
0.5 0.5 

Group means:
         x1          x2
1  1.490707 -0.27092098
2 -1.888392 -0.04892961

判別結果を確認する．

toy_qda |>
    tidy()                                       # 推定値を確認

# A tibble: 2 × 4
  name  prior x1_ave  x2_ave
  <chr> <dbl>  <dbl>   <dbl>
1 1       0.5   1.49 -0.271 
2 2       0.5  -1.89 -0.0489

toy_qda |>
    augment(data = toy_data)                     # あてはめ値を確認

# A tibble: 80 × 6
        x1     x2 class .class .posterior1 .posterior2
     <dbl>  <dbl> <fct> <fct>        <dbl>       <dbl>
 1 -0.486   1.06  1     2            0.289    7.11e- 1
 2  4.50   -6.30  1     1            1.000    8.30e- 8
 3  1.80   -2.50  1     1            0.904    9.56e- 2
 4 -0.0850 -0.593 1     2            0.251    7.49e- 1
 5 -0.795  -8.21  1     1            1.000    1.66e-11
 6  0.996  -2.35  1     1            0.809    1.91e- 1
 7 -2.61   -2.61  1     1            0.503    4.97e- 1
 8  0.0160  7.16  1     1            1.000    1.40e- 9
 9  0.994   1.85  1     1            0.796    2.04e- 1
10  1.20    2.11  1     1            0.882    1.18e- 1
# ℹ 70 more rows

toy_qda |>
    augment(data = toy_data) |> 
    conf_mat(truth = class, estimate = .class)   # 混同行列

          Truth
Prediction  1  2
         1 32  3
         2  8 37

推定された線形判別関数の識別境界を図示する．

range_x <- range(toy_data[["x1"]])          # x1の範囲
range_y <- range(toy_data[["x2"]])          # x2の範囲
toy_grid_x <- pretty(range_x, 100)          # 格子点の作成
toy_grid_y <- pretty(range_y, 100)
toy_grid_xy <- expand.grid(x1 = toy_grid_x, # 2次元の格子点を作成
                           x2 = toy_grid_y) 
toy_qda |>
    augment(newdata = toy_grid_xy) |>       # 格子点上の判別関数値を計算
    ggplot(aes(x = x1, y = x2)) +           # 予測されるラベルの図示
    geom_tile(aes(fill = .class), alpha = 0.3) +
    geom_point(data = toy_data,
               aes(x = x1, y = x2, colour = class)) +
    labs(fill = NULL, colour = NULL)

東京の気候データ

前の練習問題で作成した tw_train および tw_test を利用する．

まず，訓練データで2次判別関数(等分散性を仮定しない)を作成する．線形判別関数と同じモデル式を用いる．

tw_qda <- qda(formula = tw_model, data = tw_train)
tw_qda |>
    augment(data = tw_train) |>                   # 判別関数によるクラス分類結果の取得
    conf_mat(truth = month, estimate = .class)    # 混同行列

          Truth
Prediction  9 10
        9  10  4
        10  5 12

試験データによる評価を行う．

tw_qda |>
    augment(newdata = tw_test) |>
    conf_mat(truth = month, estimate = .class)

          Truth
Prediction  9 10
        9  13  2
        10  2 13

tw_qda_predict <- predict(tw_qda, newdata = tw_test) 
table(true = tw_test[["month"]], # 混同行列
      predict = tw_qda_predict[["class"]])

    predict
true  9 10
  9  13  2
  10  2 13

tibble(true = tw_test[["month"]]) |> # 比較表
    mutate(predict = tw_qda_predict[["class"]])

# A tibble: 30 × 2
   true  predict
   <fct> <fct>  
 1 9     9      
 2 9     9      
 3 9     9      
 4 9     9      
 5 9     9      
 6 9     9      
 7 9     9      
 8 9     9      
 9 9     9      
10 9     9      
# ℹ 20 more rows

推定された2次判別関数を図示する．

tw_qda |>
    augment(newdata = grid_xy) |>                 # 格子点上の判別関数値を計算
    ggplot(aes(x = temp, y = humid)) +            # 判別関数により予測されるラベルの図示
    geom_tile(aes(fill = .class), alpha = 0.3) +
    geom_point(data = tw_subset,                         
               aes(x = temp, y = humid, colour = month)) +
    labs(x = "気温", y = "湿度", fill = NULL, colour = NULL,
         title = "2次判別分析")

多値判別

問題

東京の気候データを用いて以下の分析を行いなさい．

9月，10月，11月の気温と湿度のデータを用いて判別関数を作成しなさい．
別の月や変数を用いて判別分析を行いなさい．

#' 9-11月のデータを整理
tw_subset  <- tw_data |>
    filter(month %in% c(9,10,11)) |>
    select(temp, humid, month) |>
    mutate(month = as_factor(month))

#' 雨の有無を識別する例
tw_subset2 <- tw_data |>
    mutate(rain = factor(rain > 0),
           month = as_factor(month)) |>     # 雨の有無でラベル化する
    select(rain, temp, solar, wind, month)
tw_lda2 <- lda(rain ~ ., data = tw_subset2) # 'rain' をそれ以外で判別

解答欄

解答例

3値判別のためにデータの整理を行う．

tw_subset3  <- tw_data |>
    filter(month %in% c(9,10,11)) |>
    select(temp, humid, month) |>
    mutate(month = as_factor(month))

線形判別関数(3値)を作成する．

tw_lda3 <- lda(formula = tw_model, data = tw_subset3)
tw_lda3 |>
    augment(data = tw_subset3) |>              # 判別関数によるクラス分類結果の取得
    conf_mat(truth = month, estimate = .class) # 混同行列

          Truth
Prediction  9 10 11
        9  24  4  0
        10  6 23  3
        11  0  4 27

推定された線形判別関数(3値)により予測されるラベルを図示する．格子点は再計算する必要があるので注意する．

range_x <- range(tw_subset3[["temp"]])  # 気温の値域
range_y <- range(tw_subset3[["humid"]]) # 湿度の値域
grid_x <- pretty(range_x, 100)          # 気温の値域の格子点を作成
grid_y <- pretty(range_y, 100)          # 湿度の値域の格子点を作成
grid_xy <- expand.grid(temp = grid_x,
                       humid = grid_y)  # 2次元の格子点を作成
tw_lda3 |>
    augment(newdata = grid_xy) |>       # 格子点上の判別関数値を計算
    ggplot(aes(x = temp, y = humid)) +
    geom_tile(aes(fill = .class), alpha = 0.3) +
    geom_point(data = tw_subset3,
               aes(x = temp, y = humid, colour = month)) +
    labs(x = "気温", y = "湿度", fill = NULL, colour = NULL,
         title = "多値判別分析")

3値判別の場合には2つの判別関数を構成するので，これを用いてデータ点の散布図を作成することができる．

tw_lda3 |>
    augment(data = tw_subset3) |>                     # 格子点上の判別関数値を計算
    ggplot(aes(x = .xLD1, y = .xLD2)) + 
    geom_point(aes(colour = month)) +                 # 月ごとに色を変更
    labs(colour = NULL)

関数 geom_tile() が座標軸に沿った格子点を想定しているため，判別関数値の散布図上で判別境界とデータ点を表示するには工夫が必要となる．ここでは関数 geom_point() で代用した簡便な例を示す．

last_plot() + geom_point(data = augment(tw_lda3, newdata = grid_xy),
                         aes(colour = .class), alpha = 0.1)

12ヶ月分のデータを用いる．数が多いのでサンプリングする．

idx <- sample(nrow(tw_data), 100)
tw_subset12 <- slice(tw_data, idx) |>
    mutate(month = as_factor(month))
tw_lda12 <- lda(month ~ temp + solar + wind + humid,
                data = tw_subset12)
tw_lda12 |> 
    augment(data = tw_subset12) |>
    conf_mat(truth = month, estimate = .class) |> # 混同行列
    autoplot("heatmap")

tw_lda12 |>
    augment(data = tw_subset12) |>
    select(month,starts_with(".x")) |>
    ggpairs(aes(colour = month),
            upper = list(continuous = wrap("cor", size = 1.5))) +
    theme(aspect.retio = 1)

判別関数は説明変数の数までしか作成できないので，精度はあまり高くないことがわかる．

雨の有無を識別する例は以下のようになる．

tw_rain <- tw_data |>
    mutate(rain = factor(rain > 0),                    # 雨の有無でラベル化する
           month = as_factor(month))                   # 月ごとの気候の違いの補正のため
tw_rain_lda <- lda(rain ~ temp + solar + wind + month,
                   data = tw_rain,
                   subset = idx)                       # 一部のデータで推定する例
tw_rain_lda |>                                         # 判別関数値の視覚化
    augment(data = tw_rain |> slice(idx)) |>
    ggplot(aes(x = .xLD1, fill = rain)) + 
    geom_histogram(aes(y = after_stat(density)),
                   bins = 30) +
    facet_grid(rain ~ .)

予測の精度を検証する．

tw_rain_lda |>
    augment(data = tw_rain |> slice(idx)) |>       # 推定に用いたデータの混同行列
    conf_mat(truth = rain, estimate = .class)

          Truth
Prediction FALSE TRUE
     FALSE    61    7
     TRUE      4   28

tw_rain_lda |>
    augment(newdata = tw_rain |> slice(-idx)) |>  # 未知データに対する予測の混同行列
    conf_mat(truth = rain, estimate = .class)

          Truth
Prediction FALSE TRUE
     FALSE   146   32
     TRUE     30   58