統計データ解析I

第12講練習問題解答例

Published

July 11, 2025

準備

以下で利用する共通パッケージを読み込む．

library(conflicted)  # 関数名の衝突を警告
conflicts_prefer(    # 優先的に使う関数を指定
    dplyr::filter(),
    dplyr::select(),
    dplyr::lag(),
    )
library(tidyverse)
#' 日本語を用いるので macOS ではフォントの設定を行う
if(Sys.info()["sysname"] == "Darwin") { # macOS か調べて日本語フォントを指定
    theme_update(text = element_text(family = "HiraginoSans-W4"))
    update_geom_defaults("text", list(family = theme_get()$text$family))
    update_geom_defaults("label", list(family = theme_get()$text$family))}

一元配置分散分析

問題

一元配置分散分析について確率シミュレーションを行いなさい．

解答例

人工データにより一元配置分散分析の Monte-Carlo 実験を行う．

A,B,Cの3つの水準をもつデータを考える．

fact <- as.factor(rep(LETTERS[1:3], c(10,8,12))) # 因子(水準A,B,Cが10,8,12個)
toy_data0 <- # 平均が全て等しい場合 (帰無仮説が正しい例)
    tibble(factor = fact,                                 # 水準
           value = rnorm(length(fact), mean = 3, sd = 2)) # 観測値

データの状況を図示する．

toy_data0 |>
    ggplot(aes(x = factor, y = value)) +
    geom_boxplot(fill = "pink") +
    geom_jitter(colour = "red", width = 0.2) +
    labs(title = "帰無仮説が正しい場合")

水準ごとの平均にずれのある対立仮説の例を作成する．

alt <- rep(c(1,-1,1), c(10,8,12)) # 水準ごとの平均のずれ
toy_data1 <- # 平均が水準ごとに異なる場合(対立仮説が正しい例)
    tibble(factor = fact,                                       # 水準
           value = rnorm(length(fact), mean = 3, sd = 2) + alt) # 観測値

同様に図示する．

toy_data1 |>
    ggplot(aes(x = factor, y = value)) +
    geom_boxplot(fill = "lightblue") +
    geom_jitter(colour = "blue", width = 0.2) +
    labs(title = "帰無仮説が誤りの場合")

これらの2つのデータに対して分散分析を行う．帰無仮説が正しい場合は以下のようになる．

aov(value ~ factor, data = toy_data0)

Call:
   aov(formula = value ~ factor, data = toy_data0)

Terms:
                  factor Residuals
Sum of Squares   9.17250  82.59041
Deg. of Freedom        2        27

Residual standard error: 1.748972
Estimated effects may be unbalanced

aov(value ~ factor, data = toy_data0) |>
    broom::tidy() # 分散分析表

# A tibble: 2 × 6
  term         df sumsq meansq statistic p.value
  <chr>     <dbl> <dbl>  <dbl>     <dbl>   <dbl>
1 factor        2  9.17   4.59      1.50   0.241
2 Residuals    27 82.6    3.06     NA     NA

帰無仮説が誤り，すなわち対立仮説が正しい場合は以下のようになる．

aov(value ~ factor, data = toy_data1)

Call:
   aov(formula = value ~ factor, data = toy_data1)

Terms:
                  factor Residuals
Sum of Squares  68.24978  85.14212
Deg. of Freedom        2        27

Residual standard error: 1.775785
Estimated effects may be unbalanced

aov(value ~ factor, data = toy_data1) |>
    broom::tidy() # 分散分析表

# A tibble: 2 × 6
  term         df sumsq meansq statistic   p.value
  <chr>     <dbl> <dbl>  <dbl>     <dbl>     <dbl>
1 factor        2  68.2  34.1       10.8  0.000354
2 Residuals    27  85.1   3.15      NA   NA

Note

関数 aov() のようにデータが第1引数でない場合にパイプ演算子を使うには _ (placeholder) を使って位置を明示すればよい．

toy_data0 |> aov(value ~ factor, data = _)

Call:
   aov(formula = value ~ factor, data = toy_data0)

Terms:
                  factor Residuals
Sum of Squares   9.17250  82.59041
Deg. of Freedom        2        27

Residual standard error: 1.748972
Estimated effects may be unbalanced

toy_data0 |>
    aov(value ~ factor, data = _) |>
    broom::tidy()

# A tibble: 2 × 6
  term         df sumsq meansq statistic p.value
  <chr>     <dbl> <dbl>  <dbl>     <dbl>   <dbl>
1 factor        2  9.17   4.59      1.50   0.241
2 Residuals    27 82.6    3.06     NA     NA

Monte-Carlo 実験を行う．まず，試行のための関数を作成する．

mc_trial <- function(h0 = TRUE){
    if(h0) { # 帰無仮説が正しい場合
        tmp <-
            tibble(factor = fact,
                   value = rnorm(length(fact), mean = 3, sd = 2)) |>
            aov(value ~ factor, data = _) |> broom::tidy()
    } else { # 対立仮説が正しい場合
        tmp <- 
            tibble(factor = fact, # 水準ごとに平均が異なる
                   value = rnorm(length(fact), mean = 3, sd = 2) + alt) |>
            aov(value ~ factor, data = _) |> broom::tidy()
    }
    #' 分散分析表における factor 行の統計量とp値を返す
    return(c(statistic = tmp[1,5, drop = TRUE],
             p.value   = tmp[1,6, drop = TRUE]))
    #' 以下のように書いても良い
    #' return(tmp |> select(5:6) |> t() |> _[,1])
}

帰無仮説が正しい場合のF統計量およびp値の分布を調べる．この場合，p値は一様分布になることに注意する．

replicate(2000, mc_trial()) |>
    t() |> as_tibble() |> pivot_longer(everything()) |> # 描画用に変形
    ggplot(aes(x = value, fill = name)) +
    geom_histogram(show.legend = FALSE) +
    facet_wrap(vars(name), ncol = 2, scales = "free") + # それぞれ個別の座標で描画
    labs(title = "帰無仮説が正しい場合")

対立仮説が正しい場合のF統計量およびp値の分布を調べる．この場合，p値は小さな値に偏ることが確認できる．

replicate(2000, mc_trial(h0 = FALSE)) |>
    t() |> as_tibble() |> pivot_longer(everything()) |> 
    ggplot(aes(x = value, fill = name)) +
    geom_histogram(show.legend = FALSE) +
    facet_wrap(vars(name), ncol = 2, scales = "free") + 
    labs(title = "帰無仮説が誤りの場合")

問題

東京の気候データ (tokyo_weather.csv) の気温の項目について以下の問に答えよ．

曜日ごとの気温の平均と分散を求めよ．
曜日ごとに平均が異なるといえるかどうか分散分析を用いて検定しなさい．

Note

週日を因子(factor)とするには例えば以下のような項目を加えればよい．

tw_data <- read_csv("data/tokyo_weather.csv") |>
    mutate(day_of_week = ordered(day_of_week, # 曜日を順序付き因子にする
                                 levels = wday(1:7, label = TRUE)))

関数 lubridate::wday() の出力は使用言語に依存するので，強制的に英語にするには locale を指定して wday(..., locale = "en_US") とすれば良い．省略形を用いない場合は abbr を指定する．以下はドイツ語の例．

wday(1:7, label = TRUE, abbr = FALSE, locale = "de_DE")

利用可能な言語を調べるには stringi::stri_locale_list() を用いる．

解答例

気候データにおいて曜日ごとの気温に差があるか否かを分散分析を用いて調べる．

データを読み込み，整理する．

tw_data <- read_csv("data/tokyo_weather.csv") |>
    mutate(day_of_week = ordered(day_of_week, # 曜日を順序付き因子にする
                                 levels = wday(1:7, label = TRUE)))

曜日ごとの分布を見るとともに平均・分散(標準偏差)を確認する．

tw_data |>
    ggplot(aes(x = day_of_week, y = temp)) +
    geom_boxplot(fill = "lavender") +
    labs(x = "曜日", y = "気温", title = "曜日ごとの気温の分布")

tw_data |>
    group_by(day_of_week) |>
    summarize(across(temp, list(mean = mean,
                                var = var,
                                sd = sd)))

# A tibble: 7 × 4
  day_of_week temp_mean temp_var temp_sd
  <ord>           <dbl>    <dbl>   <dbl>
1 Sun              18.0     69.2    8.32
2 Mon              17.8     71.6    8.46
3 Tue              17.3     70.9    8.42
4 Wed              17.5     70.2    8.38
5 Thu              17.5     69.5    8.34
6 Fri              17.5     71.7    8.46
7 Sat              17.9     73.6    8.58

曜日ごとの気温差に関する分散分析を行う．

aov(temp ~ day_of_week, data = tw_data) # aovによる分析

Call:
   aov(formula = temp ~ day_of_week, data = tw_data)

Terms:
                day_of_week Residuals
Sum of Squares       20.965 25478.308
Deg. of Freedom           6       359

Residual standard error: 8.424382
Estimated effects may be unbalanced

aov(temp ~ day_of_week, data = tw_data) |>
    summary() # 分散分析表の表示(棄却されない)

             Df Sum Sq Mean Sq F value Pr(>F)
day_of_week   6     21    3.49   0.049      1
Residuals   359  25478   70.97

aov(temp ~ day_of_week, data = tw_data) |>
    broom::tidy() # 分散分析表のデータフレーム (行・列の番号で必要な情報が取得できる)

# A tibble: 2 × 6
  term           df   sumsq meansq statistic p.value
  <chr>       <dbl>   <dbl>  <dbl>     <dbl>   <dbl>
1 day_of_week     6    21.0   3.49    0.0492    1.00
2 Residuals     359 25478.   71.0    NA        NA

aov(temp ~ day_of_week, data = tw_data) |>
    model.tables(type = "means")   # 水準(曜日)ごとの平均値

Tables of means
Grand mean
         
17.64809 

 day_of_week 
      Sun   Mon   Tue   Wed  Thu   Fri   Sat
    18.03 17.75 17.32 17.52 17.5 17.49 17.93
rep 52.00 53.00 53.00 52.00 52.0 52.00 52.00

aov(temp ~ day_of_week, data = tw_data) |>
    model.tables(type = "effects") # 水準(曜日)ごとの効果

Tables of effects

 day_of_week 
        Sun     Mon     Tue     Wed     Thu     Fri     Sat
     0.3788  0.1047 -0.3254 -0.1308 -0.1442 -0.1577  0.2788
rep 52.0000 53.0000 53.0000 52.0000 52.0000 52.0000 52.0000

検定のみ実行する場合は，関数 oneway.test() を用いることができる．

級内の分散が等しいことを仮定する場合は以下のように指定する．

oneway.test(temp ~ day_of_week, data = tw_data, var.equal = TRUE)


    One-way analysis of means

data:  temp and day_of_week
F = 0.049235, num df = 6, denom df = 359, p-value = 0.9995

級内の分散が等しいことが仮定できない場合はWelchの近似法による検定が行われる．

oneway.test(temp ~ day_of_week, data = tw_data)


    One-way analysis of means (not assuming equal variances)

data:  temp and day_of_week
F = 0.048376, num df = 6.00, denom df = 159.54, p-value = 0.9995

月ごとの気温に差があるか否かを分散分析で検証する．図示した結果から棄却されることが予想される．

tw_data <- tw_data |> mutate(month = as_factor(month)) # 月を因子化
tw_data |>
    ggplot(aes(x = month, y = temp)) +
    geom_boxplot(fill = "lavender") +
    labs(x = "月", y = "気温", title = "月ごとの気温の分布")

aov(temp ~ month, data = tw_data) |>
    summary() # 分散分析表の表示(棄却される)

             Df Sum Sq Mean Sq F value Pr(>F)    
month        11  22916  2083.3   285.5 <2e-16 ***
Residuals   354   2583     7.3                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note

曜日の因子化などは package::lubridate に含まれる関数が利用できる．

曜日の文字列を得る方法には以下のものがある．

wday(1:7, label = TRUE) # 省略名

[1] Sun Mon Tue Wed Thu Fri Sat
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

wday(1:7, label = TRUE, abbr = FALSE) # 正式名

[1] Sunday    Monday    Tuesday   Wednesday Thursday  Friday    Saturday 
7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday

wday(2:8, label = TRUE) # 月曜から始まる文字列を得る場合

[1] Mon  Tue  Wed  Thu  Fri  Sat  <NA>
Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat

wday(1:7, label = TRUE, week_start = 1) # 順序を月曜から始める場合

[1] Sun Mon Tue Wed Thu Fri Sat
Levels: Mon < Tue < Wed < Thu < Fri < Sat < Sun

日付から曜日を得る方法は例えば以下のようにすれば良い．

tw_data |>
    mutate(date = as_date(paste(year,month,day, sep = "-")), # 日付に変換
           weekday = wday(date, label = TRUE), # 日付から曜日(順序付因子)を追加
           .before = year) # year列の前に付加

# A tibble: 366 × 15
   date       weekday  year month   day day_of_week  temp  rain solar  snow
   <date>     <ord>   <dbl> <fct> <dbl> <ord>       <dbl> <dbl> <dbl> <dbl>
 1 2024-01-01 Mon      2024 1         1 Mon           9.2     0 11.7      0
 2 2024-01-02 Tue      2024 1         2 Tue           6.1     0  5.43     0
 3 2024-01-03 Wed      2024 1         3 Wed           7.4     0  5.91     0
 4 2024-01-04 Thu      2024 1         4 Thu           9.7     0 10.3      0
 5 2024-01-05 Fri      2024 1         5 Fri           7.4     0 11.4      0
 6 2024-01-06 Sat      2024 1         6 Sat           9.1     0 11.4      0
 7 2024-01-07 Sun      2024 1         7 Sun           8.3     0  7.85     0
 8 2024-01-08 Mon      2024 1         8 Mon           5.4     0 12.3      0
 9 2024-01-09 Tue      2024 1         9 Tue           5.4     0 11.9      0
10 2024-01-10 Wed      2024 1        10 Wed           7.2     0  9.35     0
# ℹ 356 more rows
# ℹ 5 more variables: wdir <chr>, wind <dbl>, press <dbl>, humid <dbl>,
#   cloud <dbl>

二元配置分散分析

問題

package::datarium に含まれている jobsatisfaction データ (性別 (gender) と学歴 (education_level) の違いによる仕事の満足度 (score) を収集したデータ) について以下の問に答えよ．

データを適当な方法で可視化しなさい．
それぞれの因子で満足度の平均に違いがあるか，二元配置の分散分析を用いて検討しなさい．

Note

パッケージのインストールと読み込みは以下のように行うことができる．

install.packages("datarium") # インストール
library("datarium") # 読み込み

解答例

datarium::jobsatisfaction は性別(gender)と学歴(education_level)による仕事の満足度(score)を集計したデータである．二元配置分散分析を用いて満足度にそれぞれの因子の効果があるかを検証する．

必要であれば package::datarium をインストールし，パッケージの読み込みを行う．

#' install.packages("datarium") # 検定用のデータが集められている
library("datarium") # packageの読み込み
jobsatisfaction     # データの内容を確認

# A tibble: 58 × 4
   id    gender education_level score
   <fct> <fct>  <fct>           <dbl>
 1 1     male   school           5.51
 2 2     male   school           5.65
 3 3     male   school           5.07
 4 4     male   school           5.51
 5 5     male   school           5.94
 6 6     male   school           5.8 
 7 7     male   school           5.22
 8 8     male   school           5.36
 9 9     male   school           4.78
10 10    male   college          6.01
# ℹ 48 more rows

データの視覚化を行う．

jobsatisfaction |> # 学歴ごとに性別を比較
    ggplot(aes(x = education_level, y = score, fill = gender)) +
    geom_boxplot()

jobsatisfaction |> # 性別ごとに学歴を比較
    ggplot(aes(x = gender, y = score, fill = education_level)) +
    geom_boxplot()

二元配置分散分析を行い，結果をいくつかの方法で表示する．

aov(score ~ gender + education_level, data=jobsatisfaction) |>
    summary()

                Df Sum Sq Mean Sq F value Pr(>F)    
gender           1   0.54    0.54   1.447  0.234    
education_level  2 113.68   56.84 152.172 <2e-16 ***
Residuals       54  20.17    0.37                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

aov(score ~ gender + education_level, data=jobsatisfaction) |>
    model.tables(type = "means")

Tables of means
Grand mean
         
6.963276 

 gender 
      male female
     7.063   6.87
rep 28.000  30.00

 education_level 
    school college university
     5.594   6.351      8.846
rep 19.000  19.000     20.000

aov(score ~ gender + education_level, data=jobsatisfaction) |>
    model.tables(type = "effects")

Tables of effects

 gender 
        male   female
     0.09994 -0.09328
rep 28.00000 30.00000

 education_level 
    school college university
    -1.369  -0.612      1.882
rep 19.000  19.000     20.000