分析の評価
(Press ?
for help, n
and p
for next and previous slide)
村田 昇
\(X=\boldsymbol{x}\) が与えられた後に予測されるクラス
\begin{equation} p_k(\boldsymbol{x})=P(Y=k|X=\boldsymbol{x}) \end{equation}
判別関数 : \(\delta_k(\boldsymbol{x})\) (\(k=1,\dots,K\))
\begin{equation} p_k(\boldsymbol{x}) < p_l(\boldsymbol{x}) \Leftrightarrow \delta_k(\boldsymbol{x}) < \delta_l(\boldsymbol{x}) \end{equation}
事後確率の順序を保存する計算しやすい関数
共分散行列 \(\Sigma\) : すべてのクラスで共通
\begin{equation} f_k(\boldsymbol{x}) = \frac{1}{(2\pi)^{q/2}\sqrt{\det\Sigma}} \exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_k)^{\mathsf{T}} \Sigma^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_k)\right) \end{equation}
線形判別関数 : \(\boldsymbol{x}\) の1次式
\begin{equation} \delta_k(\boldsymbol{x}) = \boldsymbol{x}^{\mathsf{T}}\Sigma^{-1}\boldsymbol{\mu}_k -\frac{1}{2}\boldsymbol{\mu}_k^{\mathsf{T}}\Sigma^{-1}\boldsymbol{\mu}_k +\log\pi_k \end{equation}
共分散行列 \(\Sigma_k\) : クラスごとに異なる
\begin{equation} f_k(\boldsymbol{x}) = \frac{1}{(2\pi)^{q/2}\sqrt{\det\Sigma_k}} \exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_k)^{\mathsf{T}} \Sigma_k^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_k)\right) \end{equation}
2次判別関数 : \(\boldsymbol{x}\) の2次式
\begin{equation} \delta_k(\boldsymbol{x}) = -\frac{1}{2}\det\Sigma_k -\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu}_k)^{\mathsf{T}} \Sigma_k^{-1}(\boldsymbol{x}-\boldsymbol{\mu}_k) +\log\pi_k \end{equation}
Fisherの基準
\begin{equation} \text{maximize}\quad \boldsymbol{\alpha}^{\mathsf{T}} B\boldsymbol{\alpha} \quad\text{s.t.}\quad \boldsymbol{\alpha}^{\mathsf{T}} W\boldsymbol{\alpha}=\text{const.} \end{equation}
単純な誤り
\begin{equation} \text{(誤り率)} =\frac{\text{(誤って判別されたデータ数)}} {\text{(全データ数)}} \end{equation}
真値は陽性 | 真値は陰性 | |
---|---|---|
判別は陽性 | 真陽性 (True Positive) | 偽陽性 (False Positive) |
判別は陰性 | 偽陰性 (False Negative) | 真陰性 (True Negative) |
判別は陽性 | 判別は陰性 | |
---|---|---|
真値は陽性 | 真陽性 (True Positive) | 偽陰性 (False Negative) |
真値は陰性 | 偽陽性 (False Positive) | 真陰性 (True Negative) |
定義
\begin{align} \text{(真陽性率)} &=\frac{TP}{TP+FN} \qquad\text{(true positive rate)}\\ \text{(真陰性率)} &=\frac{TN}{FP+TN} \qquad\text{(true negative rate)}\\ \text{(適合率)} &=\frac{TP}{TP+FP} \qquad\text{(precision)}\\ \text{(正答率)} &=\frac{TP+TN}{TP+FP+TN+FN} \qquad\text{(accuracy)} \end{align}
感度 (sensitivity) あるいは 再現率 (recall)
\begin{equation} \text{(真陽性率)} =\frac{TP}{TP+FN} \end{equation}
特異度 (specificity)
\begin{equation} \text{(真陰性率)} =\frac{TN}{FP+TN} \end{equation}
精度 (accuracy)
\begin{equation} \text{(正答率)} =\frac{TP+TN}{TP+FP+TN+FN} \end{equation}
定義 (F-measure, F-score)
\begin{align} F_{1}&=\frac{2}{{1}/{\text{(再現率)}}+{1}/{\text{(適合率)}}}\\ F_{\beta}&=\frac{\beta^{2}+1}{{\beta^{2}}/{\text{(再現率)}}+{1}/{\text{(適合率)}}} \end{align}
定義 (Cohen’s kappa measure)
\begin{align} p_{o} &=\frac{TP+TN}{TP+FP+TN+FN} \qquad\text{(accuracy)}\\ p_{e} &=\frac{TP+FP}{TP+FP+TN+FN}\cdot\frac{TP+FN}{TP+FP+TN+FN}\\ &\quad +\frac{FN+TN}{TP+FP+TN+FN}\cdot\frac{FP+TN}{TP+FP+TN+FN}\\ \kappa &= \frac{p_{o}-p_{e}}{1-p_{e}} = 1-\frac{1-p_{o}}{1-p_{e}} \end{align}
2値判別における判別関数を用いた判定方法の一般形
\begin{equation} H(x;c) = \begin{cases} \text{陽性},&\delta(x)>c\\ \text{陰性},&\text{それ以外} \end{cases} \end{equation}
真陽性率と偽陽性率
\begin{align} \mathrm{TPR}(c) &=P(\text{陽性を正しく陽性と判別})\\ \mathrm{FPR}(c)&=P(\text{陰性を誤って陽性と判別})\\ &=1-P(\text{陰性を正しく陰性と判別})\\ \end{align}
最大最小と平均の関係から以下が成り立つ
\begin{equation} \min(\text{再現率},\text{適合率}) \le F_{1} \le\max(\text{再現率},\text{適合率}) \end{equation}さらに相加・相乗平均の関係から
\begin{equation} F_{1} \le\text{(相乗平均)} \le\text{(相加平均)} \end{equation}も成り立つ
相関係数の定義に従って計算すればよい
\begin{equation} \rho = \frac{\mathrm{Cov}(Y,\hat{Y})} {\sqrt{\mathrm{Var}(Y)\mathrm{Var}(\hat{Y})}} \end{equation}
例えば分子の共分散は以下のように計算される
\begin{align} \mathrm{Cov}(Y,\hat{Y}) &= \mathbb{E}[(Y-\mathbb{E}[Y])(\hat{Y}-\mathbb{E}[\hat{Y}])]\\ &= \mathbb{E}[Y\hat{Y}]-\mathbb{E}[Y]\mathbb{E}[\hat{Y}]\\ &= \frac{TP}{N}-\frac{TP+FN}{N}\frac{TP+FP}{N}\\ &= \frac{TP(TP+FN+FP+TN)}{N^{2}}\\ &\qquad- \frac{(TP+FN)(TP+FP)}{N^{2}}\\ &= \frac{TP\cdot TN - FP\cdot FN}{N^{2}} \end{align}
同様に分母の分散は以下のようになる
\begin{align} \mathrm{Var}(Y) &= \mathbb{E}[Y^{2}]-\mathbb{E}[Y]^{2}\\ &= \frac{(TP+FN)(TN+FP)}{N^{2}}\\ \mathrm{Var}(\hat{Y}) &= \mathbb{E}[\hat{Y}^{2}]-\mathbb{E}[\hat{Y}]^{2}\\ &= \frac{(TP+FP)(TN+FN)}{N^{2}} \end{align}
したがって以下のようにまとめられる
\begin{equation} \rho = \frac{TP\cdot TN-FP\cdot FN} {\sqrt{(TP+FP)(TP+FN)(TN+FP)(TN+FN)}} \end{equation}
温度と湿度による8,9月の線形判別
Figure 1: 線形判別
温度と湿度による8,9月の2次判別
Figure 2: 2次判別
Figure 3: 線形判別の混同行列
Figure 4: 2次判別の混同行列
指標 | 値 |
accuracy | 0.721 |
kap | 0.442 |
sens | 0.742 |
spec | 0.700 |
ppv | 0.719 |
npv | 0.724 |
mcc | 0.442 |
j_index | 0.442 |
bal_accuracy | 0.721 |
detection_prevalence | 0.525 |
precision | 0.719 |
recall | 0.742 |
f_meas | 0.730 |
指標 | 値 |
accuracy | 0.754 |
kap | 0.508 |
sens | 0.742 |
spec | 0.767 |
ppv | 0.767 |
npv | 0.742 |
mcc | 0.509 |
j_index | 0.509 |
bal_accuracy | 0.754 |
detection_prevalence | 0.492 |
precision | 0.767 |
recall | 0.742 |
f_meas | 0.754 |
Figure 5: 線形判別のROC曲線
Figure 6: 2次判別の混同行列
https://archive.ics.uci.edu/ml/datasets/Wine+Quality
Wine Quality Data Set
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
winequality-red.csv
を利用説明変数 (based on physicochemical tests)
1 - fixed acidity
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
目的変数 (based on sensory data)
12 - quality (score between 0 and 10)
実際のデータの一部
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | grade |
7.400 | 0.700 | 0 | 1.900 | 0.076 | 11 | 34 | 0.998 | 3.510 | 0.560 | 9.400 | 5 | C |
7.800 | 0.880 | 0 | 2.600 | 0.098 | 25 | 67 | 0.997 | 3.200 | 0.680 | 9.800 | 5 | C |
7.800 | 0.760 | 0.040 | 2.300 | 0.092 | 15 | 54 | 0.997 | 3.260 | 0.650 | 9.800 | 5 | C |
11.200 | 0.280 | 0.560 | 1.900 | 0.075 | 17 | 60 | 0.998 | 3.160 | 0.580 | 9.800 | 6 | B |
7.400 | 0.700 | 0 | 1.900 | 0.076 | 11 | 34 | 0.998 | 3.510 | 0.560 | 9.400 | 5 | C |
7.400 | 0.660 | 0 | 1.800 | 0.075 | 13 | 40 | 0.998 | 3.510 | 0.560 | 9.400 | 5 | C |
7.900 | 0.600 | 0.060 | 1.600 | 0.069 | 15 | 59 | 0.996 | 3.300 | 0.460 | 9.400 | 5 | C |
7.300 | 0.650 | 0 | 1.200 | 0.065 | 15 | 21 | 0.995 | 3.390 | 0.470 | 10 | 7 | A |
7.800 | 0.580 | 0.020 | 2 | 0.073 | 9 | 18 | 0.997 | 3.360 | 0.570 | 9.500 | 7 | A |
7.500 | 0.500 | 0.360 | 6.100 | 0.071 | 17 | 102 | 0.998 | 3.350 | 0.800 | 10.500 | 5 | C |
6.700 | 0.580 | 0.080 | 1.800 | 0.097 | 15 | 65 | 0.996 | 3.280 | 0.540 | 9.200 | 5 | C |
7.500 | 0.500 | 0.360 | 6.100 | 0.071 | 17 | 102 | 0.998 | 3.350 | 0.800 | 10.500 | 5 | C |
5.600 | 0.615 | 0 | 1.600 | 0.089 | 16 | 59 | 0.994 | 3.580 | 0.520 | 9.900 | 5 | C |
7.800 | 0.610 | 0.290 | 1.600 | 0.114 | 9 | 29 | 0.997 | 3.260 | 1.560 | 9.100 | 5 | C |
8.900 | 0.620 | 0.180 | 3.800 | 0.176 | 52 | 145 | 0.999 | 3.160 | 0.880 | 9.200 | 5 | C |
8.900 | 0.620 | 0.190 | 3.900 | 0.170 | 51 | 148 | 0.999 | 3.170 | 0.930 | 9.200 | 5 | C |
8.500 | 0.280 | 0.560 | 1.800 | 0.092 | 35 | 103 | 0.997 | 3.300 | 0.750 | 10.500 | 7 | A |
8.100 | 0.560 | 0.280 | 1.700 | 0.368 | 16 | 56 | 0.997 | 3.110 | 1.280 | 9.300 | 5 | C |
7.400 | 0.590 | 0.080 | 4.400 | 0.086 | 6 | 29 | 0.997 | 3.380 | 0.500 | 9 | 4 | D |
7.900 | 0.320 | 0.510 | 1.800 | 0.341 | 17 | 56 | 0.997 | 3.040 | 1.080 | 9.200 | 6 | B |
Figure 7: 訓練誤差
Figure 8: 予測誤差
Figure 9: 訓練誤差
Figure 10: 予測誤差
Figure 11: 線形判別
Figure 12: 2次判別