第12講分散分析

複数のグループ間の違いを検証する

Published

July 3, 2026

準備

以下で利用する共通パッケージを読み込む．

library(conflicted)  # 関数名の衝突を警告
conflicts_prefer(    # 優先的に使う関数を指定
    dplyr::filter(),
    dplyr::select(),
    dplyr::lag(),
    )
library(tidyverse)
library(broom)       # 関数 tidy() を利用するため
#' 日本語を用いるので macOS ではフォントの設定を行う
if(Sys.info()["sysname"] == "Darwin") { # macOS か調べて日本語フォントを指定
    theme_update(text = element_text(family = "HiraginoSans-W4"))
    update_geom_defaults("text", list(family = theme_get()$text$family))
    update_geom_defaults("label", list(family = theme_get()$text$family))}

一元配置分散分析

問題

一元配置分散分析について確率シミュレーションを行いなさい．

解答欄

解答例

人工データにより一元配置分散分析の Monte-Carlo 実験を行う．

A,B,Cの3つの水準をもつデータを考える．

factors <- as.factor(rep(LETTERS[1:3], c(10,8,12)))          # 因子(水準A,B,Cが10,8,12個)
toy_data0 <-                                                 # 平均が全て等しい場合 (帰無仮説が正しい例)
    tibble(factor = factors,                                 # 水準
           value = rnorm(length(factors), mean = 3, sd = 2)) # 観測値

データの状況を図示する．

toy_data0 |>
    ggplot(aes(x = factor, y = value)) +
    geom_boxplot(fill = "pink") +
    geom_jitter(colour = "red", width = 0.2) +
    labs(title = "帰無仮説が正しい場合")

水準ごとの平均にずれのある対立仮説の例を作成する．

alt <- rep(c(1,-1,1), c(10,8,12))                                  # 水準ごとの平均のずれ
toy_data1 <-                                                       # 平均が水準ごとに異なる場合(対立仮説が正しい例)
    tibble(factor = factors,                                       # 水準
           value = rnorm(length(factors), mean = 3, sd = 2) + alt) # 観測値

同様に図示する．

toy_data1 |>
    ggplot(aes(x = factor, y = value)) +
    geom_boxplot(fill = "lightblue") +
    geom_jitter(colour = "blue", width = 0.2) +
    labs(title = "帰無仮説が誤りの場合")

これらの2つのデータに対して分散分析を行う．帰無仮説が正しい場合は以下のようになる．

aov(value ~ factor, data = toy_data0)

Call:
   aov(formula = value ~ factor, data = toy_data0)

Terms:
                  factor Residuals
Sum of Squares   0.72179  87.77530
Deg. of Freedom        2        27

Residual standard error: 1.803036
Estimated effects may be unbalanced

より詳細な情報を得るには関数 base::summary() または broom::tidy() を用いる．

aov(value ~ factor, data = toy_data0) |>
    summary() # 詳細な表示

            Df Sum Sq Mean Sq F value Pr(>F)
factor       2   0.72   0.361   0.111  0.895
Residuals   27  87.78   3.251

aov(value ~ factor, data = toy_data0) |>
    tidy()    # tibble形式のデータフレームに変換

# A tibble: 2 × 6
  term         df  sumsq meansq statistic p.value
  <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
1 factor        2  0.722  0.361     0.111   0.895
2 Residuals    27 87.8    3.25     NA      NA

関数 aov() のようにデータが第1引数でない場合にパイプ演算子を使うには _ (placeholder) を使って位置を明示すればよい．

toy_data0 |>
    aov(value ~ factor, data = _) |>
    tidy()

# A tibble: 2 × 6
  term         df  sumsq meansq statistic p.value
  <chr>     <dbl>  <dbl>  <dbl>     <dbl>   <dbl>
1 factor        2  0.722  0.361     0.111   0.895
2 Residuals    27 87.8    3.25     NA      NA

帰無仮説が誤り，すなわち対立仮説が正しい場合は以下のようになる．

toy_data1 |>
    aov(value ~ factor, data = _)

Call:
   aov(formula = value ~ factor, data = toy_data1)

Terms:
                   factor Residuals
Sum of Squares   33.15607 115.30465
Deg. of Freedom         2        27

Residual standard error: 2.066529
Estimated effects may be unbalanced

toy_data1 |>
    aov(value ~ factor, data = _) |>
    tidy()

# A tibble: 2 × 6
  term         df sumsq meansq statistic p.value
  <chr>     <dbl> <dbl>  <dbl>     <dbl>   <dbl>
1 factor        2  33.2  16.6       3.88  0.0330
2 Residuals    27 115.    4.27     NA    NA

Monte-Carlo 実験を行う．まず，試行のための関数を作成する．

mc_trial <- function(h0 = TRUE){
    if(h0) { # 帰無仮説が正しい場合
        tmp <-
            tibble(factor = factors,
                   value = rnorm(length(factors), mean = 3, sd = 2)) |>
            aov(value ~ factor, data = _) |>
            tidy()
    } else { # 対立仮説が正しい場合
        tmp <- 
            tibble(factor = factors, # 水準ごとに平均が異なる
                   value = rnorm(length(factors), mean = 3, sd = 2) + alt) |>
            aov(value ~ factor, data = _) |>
            tidy()
    }
    #' 分散分析表における factor 行(第1行)の統計量(第5列)とp値(第6列)を返す
    return(c(statistic = tmp[1,5, drop = TRUE],
             p.value   = tmp[1,6, drop = TRUE]))
}

関数 mc_trial() の返値は以下のように書いてもよい

tmp |>
    filter(term == 'factor') |>   # factor (第1行)
    select(statistic, p.value) |> # 統計量とp値 (第5,6列)
    unlist()                      # データフレーム(リスト)を展開してベクトル化

帰無仮説が正しい場合のF統計量およびp値の分布を調べる．この場合，p値は一様分布になることに注意する．

replicate(2000, mc_trial()) |>
    t() |> as_tibble() |> 
    pivot_longer(everything()) |>           # 描画用に変形
    ggplot(aes(x = value, fill = name)) +
    geom_histogram(show.legend = FALSE) +
    facet_wrap(vars(name),                 # 統計量とp値を個別に表示
               ncol = 2,                   # 2列で表示
               scales = "free") +          # それぞれ個別の座標で描画
    labs(title = "帰無仮説が正しい場合")

対立仮説が正しい場合のF統計量およびp値の分布を調べる．この場合，p値は小さな値に偏ることが確認できる．

replicate(2000, mc_trial(h0 = FALSE)) |>
    t() |> as_tibble() |>
    pivot_longer(everything()) |> 
    ggplot(aes(x = value, fill = name)) +
    geom_histogram(show.legend = FALSE) +
    facet_wrap(vars(name),
               ncol = 2,
               scales = "free") + 
    labs(title = "帰無仮説が誤りの場合")

問題

東京の気候データ (tokyo_weather.csv) の気温の項目について以下の問に答えよ．

曜日ごとの気温の平均と分散を求めよ．
曜日ごとに平均が異なるといえるかどうか分散分析を用いて検定しなさい．

ヒント

曜日を因子(factor)とするには例えば以下のような項目を加えればよい．

tw_data <- read_csv("data/tokyo_weather.csv") |>
    mutate(day_of_week = ordered(day_of_week, # 曜日を順序付き因子にする
                                 levels = wday(1:7, label = TRUE)))

関数 lubridate::wday() の出力は使用言語に依存するので，強制的に英語にするには locale を指定して wday(..., locale = "en_US") とすれば良い．省略形を用いない場合は abbr を指定する．以下はドイツ語の例．

wday(1:7, label = TRUE, abbr = FALSE, locale = "de_DE")

利用可能な言語を調べるには関数 stringi::stri_locale_list() を用いる．

解答欄

解答例

気候データにおいて曜日ごとの気温に差があるか否かを分散分析を用いて調べる．

データを読み込み，整理する．

tw_data <-
    read_csv("data/tokyo_weather.csv") |>
    mutate(day_of_week = ordered(day_of_week, # 曜日を順序付き因子にする
                                 levels = wday(1:7, label = TRUE)))

曜日ごとの分布を見るとともに平均・分散(標準偏差)を確認する．

tw_data |>
    ggplot(aes(x = day_of_week, y = temp)) +
    geom_boxplot(fill = "lavender") +
    labs(x = "曜日", y = "気温", title = "曜日ごとの気温の分布")

tw_data |>
    group_by(day_of_week) |>
    summarize(across(temp,
                     list(mean = mean,
                          var = var,
                          sd = sd)))

# A tibble: 7 × 4
  day_of_week temp_mean temp_var temp_sd
  <ord>           <dbl>    <dbl>   <dbl>
1 Sun              17.6     78.0    8.83
2 Mon              17.8     75.2    8.67
3 Tue              17.4     79.3    8.91
4 Wed              17.2     79.1    8.89
5 Thu              17.5     71.0    8.43
6 Fri              17.1     65.8    8.11
7 Sat              16.9     76.7    8.76

曜日ごとの気温差に関する分散分析を行う．

tw_data |>
    aov(temp ~ day_of_week, data = _)    # aovによる分析

Call:
   aov(formula = temp ~ day_of_week, data = tw_data)

Terms:
                day_of_week Residuals
Sum of Squares       29.139 26859.345
Deg. of Freedom           6       358

Residual standard error: 8.661761
Estimated effects may be unbalanced

tw_data |>
    aov(temp ~ day_of_week, data = _) |> 
    summary()                            # 分散分析表の表示(棄却されない)

             Df Sum Sq Mean Sq F value Pr(>F)
day_of_week   6     29    4.86   0.065  0.999
Residuals   358  26859   75.03

tw_data |>
    aov(temp ~ day_of_week, data = _) |> 
    tidy()                               # データフレーム (行・列の番号で必要な情報が取得できる)

# A tibble: 2 × 6
  term           df   sumsq meansq statistic p.value
  <chr>       <dbl>   <dbl>  <dbl>     <dbl>   <dbl>
1 day_of_week     6    29.1   4.86    0.0647   0.999
2 Residuals     358 26859.   75.0    NA       NA

tw_data |>
    aov(temp ~ day_of_week, data = _) |>
    model.tables(type = "means")         # 水準(曜日)ごとの平均値

Tables of means
Grand mean
         
17.36877 

 day_of_week 
      Sun   Mon   Tue   Wed   Thu   Fri  Sat
    17.62 17.77 17.42 17.23 17.54 17.11 16.9
rep 52.00 52.00 52.00 53.00 52.00 52.00 52.0

tw_data |>
    aov(temp ~ day_of_week, data = _) |>
    model.tables(type = "effects")       # 水準(曜日)ごとの効果

Tables of effects

 day_of_week 
        Sun     Mon      Tue     Wed     Thu     Fri     Sat
     0.2466  0.3985  0.05046 -0.1424  0.1755 -0.2553 -0.4707
rep 52.0000 52.0000 52.00000 53.0000 52.0000 52.0000 52.0000

検定のみ実行する場合は，関数 oneway.test() を用いることができる．

級内の分散が等しいことを仮定する場合は以下のように指定する．

tw_data |>
    oneway.test(temp ~ day_of_week, data = _, var.equal = TRUE)


    One-way analysis of means

data:  temp and day_of_week
F = 0.064732, num df = 6, denom df = 358, p-value = 0.9989

級内の分散が等しいことが仮定できない場合はWelchの近似法による検定が行われる．

tw_data |>
    oneway.test(temp ~ day_of_week, data = _)


    One-way analysis of means (not assuming equal variances)

data:  temp and day_of_week
F = 0.063645, num df = 6.00, denom df = 159.09, p-value = 0.999

月ごとの気温に差があるか否かを分散分析で検証する．図示した結果から棄却されることが予想される．

tw_data <-
    tw_data |>
    mutate(month = as_factor(month)) # 月を因子化
tw_data |>
    ggplot(aes(x = month, y = temp)) +
    geom_boxplot(fill = "lavender") +
    labs(x = "月", y = "気温", title = "月ごとの気温の分布")

tw_data |>
    aov(temp ~ month, data = _) |>
    summary()                       # 分散分析表の表示(棄却される)

             Df Sum Sq Mean Sq F value Pr(>F)    
month        11  24132  2193.8   280.9 <2e-16 ***
Residuals   353   2757     7.8                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

曜日の因子化などは package::lubridate に含まれる関数が利用できる．

曜日の文字列を得る方法には以下のものがある．

wday(1:7, label = TRUE)                 # 省略名

[1] Sun Mon Tue Wed Thu Fri Sat
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

wday(1:7, label = TRUE, abbr = FALSE)   # 正式名

[1] Sunday    Monday    Tuesday   Wednesday Thursday  Friday    Saturday 
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

wday(2:8, label = TRUE)                 # 月曜から始まる文字列を得る場合

[1] Mon  Tue  Wed  Thu  Fri  Sat  <NA>
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

wday(1:7, label = TRUE, week_start = 1) # 順序を月曜から始める場合

[1] Sun Mon Tue Wed Thu Fri Sat
Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun

日付から曜日を得る方法は例えば以下のようにすれば良い．

tw_data |>
    mutate(date = as_date(paste(year,month,day, sep = "-")), # 日付に変換
           weekday = wday(date, label = TRUE),               # 日付から曜日(順序付因子)を追加
           .before = year)                                   # year列の前に付加

二元配置分散分析

問題

package::datarium に含まれている jobsatisfaction データ (性別 (gender) と学歴 (education_level) の違いによる仕事の満足度 (score) を収集したデータ) について以下の問に答えよ．

データを適当な方法で可視化しなさい．
それぞれの因子で満足度の平均に違いがあるか，二元配置の分散分析を用いて検討しなさい．

ヒント

パッケージのインストールと読み込みは以下のように行うことができる．

install.packages("datarium") # インストール
library("datarium") # 読み込み

解答欄

解答例

datarium::jobsatisfaction は性別(gender)と学歴(education_level)による仕事の満足度(score)を集計したデータである．二元配置分散分析を用いて満足度にそれぞれの因子の効果があるかを検証する．

必要であれば package::datarium をインストールし，パッケージの読み込みを行う．

#' install.packages("datarium") # 検定用のデータが集められている
library("datarium")             # packageの読み込み
jobsatisfaction                 # データの内容を確認

# A tibble: 58 × 4
   id    gender education_level score
   <fct> <fct>  <fct>           <dbl>
 1 1     male   school           5.51
 2 2     male   school           5.65
 3 3     male   school           5.07
 4 4     male   school           5.51
 5 5     male   school           5.94
 6 6     male   school           5.8 
 7 7     male   school           5.22
 8 8     male   school           5.36
 9 9     male   school           4.78
10 10    male   college          6.01
# ℹ 48 more rows

データの視覚化を行う．

jobsatisfaction |>                   
    ggplot(aes(x = education_level,       # 学歴ごとに性別を比較
               y = score,
               fill = gender)) +          # 性別を色分け
    geom_boxplot()

jobsatisfaction |> 
    ggplot(aes(x = gender,                # 性別ごとに学歴を比較
               y = score,
               fill = education_level)) + # 学歴を色分け
    geom_boxplot()

二元配置分散分析を行い，結果をいくつかの方法で表示する．

jobsatisfaction |>
    aov(score ~ gender + education_level, data = _) |>
    summary()

                Df Sum Sq Mean Sq F value Pr(>F)    
gender           1   0.54    0.54   1.447  0.234    
education_level  2 113.68   56.84 152.172 <2e-16 ***
Residuals       54  20.17    0.37                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

jobsatisfaction |>
    aov(score ~ gender + education_level, data = _) |>
    model.tables(type = "means")

Tables of means
Grand mean
         
6.963276 

 gender 
      male female
     7.063   6.87
rep 28.000  30.00

 education_level 
    school college university
     5.594   6.351      8.846
rep 19.000  19.000     20.000

jobsatisfaction |>
    aov(score ~ gender + education_level, data = _) |>
    model.tables(type = "effects")

Tables of effects

 gender 
        male   female
     0.09994 -0.09328
rep 28.00000 30.00000

 education_level 
    school college university
    -1.369  -0.612      1.882
rep 19.000  19.000     20.000