[통계] Day 2-3 가설 검정과 분석 방법

728x90

가설검정은 통계학에서 중요한 개념으로, 데이터를 분석하여 특정 가설이 맞는지 여부를 결정하는 과정입니다. 가설검정은 아래와 같은 핵심 요소들로 구성됩니다.

귀무가설과 대립가설
- 귀무가설 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ null hypothesis $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 연구자가 처음부터 기각하려는 가설로, 일반적으로 무의미한 차이나 효과가 없다는 가정입니다. 귀무가설은 유의수준 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ alpha $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ 과 함께 정의되며, 가설 검정 결과에 따라 기각 여부가 결정됩니다.
- 대립가설 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ alternative hypothesis $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 연구자가 입증하려는 가설로, 일반적으로 귀무가설과 반대되는 주장을 나타냅니다. 대립가설은 단측 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ 한 방향으로의 차이 $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ 또는 양측 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ 양방향으로의 차이 $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ 으로 설정됩니다.

유의수준과 p-값
- 유의수준 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ alpha $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 가설 검정에서 귀무가설을 기각할 기준을 나타내는 값입니다. 일반적으로 0.05 또는 0.01과 같이 설정됩니다. 유의수준은 검정 결과를 해석할 때 중요한 역할을 합니다.
- p-값 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ p-value $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 관측된 데이터와 귀무가설 사이의 일치 정도를 나타내는 지표입니다. 작은 p-값은 귀무가설이 틀릴 가능성이 높음을 나타냅니다. 유의수준과 비교하여 p-값이 작으면 귀무가설을 기각하게 됩니다.

t-검정과 z-검정
- t-검정 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ t-test $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 표본 평균과 모집단 평균 간의 차이가 유의미한지 여부를 검정하는 방법입니다. 표본 크기가 작을 때, 그리고 모집단 표준편차를 알지 못할 때 주로 사용됩니다.
- z-검정 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ z-test $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 표본 평균과 모집단 평균 간의 차이가 유의미한지 여부를 검정하는 방법입니다. 표본 크기가 크고, 모집단 표준편차를 알고 있을 때 주로 사용됩니다.

일원분산분석, 카이제곱검정, 이원분산분석
- 일원분산분석 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ One-Way ANOVA $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 세 개 이상의 그룹 간 평균 차이가 유의미한지 검정하는 방법입니다. 일반적으로 F-통계량을 사용하여 분산의 비율을 비교합니다.

import pandas as pd
import numpy as np
from scipy import stats

# 생산 방법 3가지에 대하여 품질점수 1~5에 해당하는 샘플을
# 각 10회씩, 총 30개 샘플을 포함하는 모의데이터

# np.random.seed(1)
data = {
    'A': np.random.randint(1, 5+1,10),
    'B': np.random.randint(1, 5+1,10),
    'C': np.random.randint(1, 5+1,10)
}
df = pd.DataFrame(data)
print(df)

>>>

   A  B  C
0  3  3  4
1  2  5  1
2  1  5  3
3  3  2  2
4  5  1  5
5  5  3  3
6  3  5  5
7  4  3  1
8  4  2  2
9  1  3  3

anova_stat, p_val = stats.f_oneway(df['A'], df['B'], df['C'])
print(f"일원 분산 분석 통계치 : {anova_stat}")
print(f"P-Value : {p_val}")

alpha = 0.05
if p_val < alpha:
    print("귀무가설 기각")
else:
    print("귀무가설 채택")

>>>

일원 분산 분석 통계치 : 0.11371841155234658
P-Value : 0.8929344313337509
귀무가설 채택

- 카이제곱검정 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ Chi-Square Test $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 범주형 데이터의 빈도 분포가 관측된 빈도와 기대된 빈도 간의 차이가 유의미한지 여부를 검정하는 방법입니다.

- 이원분산분석 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ Two-Way ANOVA $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ : 두 개의 독립 변수 $(<math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">(</mo></math>$ 요인 $) <math xmlns="http://www.w3.org/1998/Math/MathML"><mo stretchy="false">)</mo></math>$ 가 종속 변수에 미치는 영향과 교호작용 효과를 파악하는 방법입니다. 다중 변수 간의 영향과 상호작용 효과를 분석하며 분석의 정확도를 높입니다.

import pandas as pd
import seaborn as sns
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
import matplotlib.pyplot as plt

titanic_df = pd.read_csv("./data/Titanic_data.csv")

# 이원 분산 분석 모델 생성
# model = ols('Survived ~ C(Sex) + C(Pclass) + C(Sex)*C(Pclass)', data=titanic_df)
model = ols('Survived ~ C(Sex) + C(Pclass) + C(Sex):C(Pclass)', data=titanic_df)
model = model.fit()
print(model.summary())

>>>

                            OLS Regression Results                            
==============================================================================
Dep. Variable:               Survived   R-squared:                       0.394
Model:                            OLS   Adj. R-squared:                  0.390
Method:                 Least Squares   F-statistic:                     114.9
Date:                Fri, 11 Aug 2023   Prob (F-statistic):           1.32e-93
Time:                        17:10:43   Log-Likelihood:                -399.13
No. Observations:                 891   AIC:                             810.3
Df Residuals:                     885   BIC:                             839.0
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
=================================================================================================
                                    coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------------
Intercept                         0.9681      0.039     24.700      0.000       0.891       1.045
C(Sex)[T.male]                   -0.5992      0.052    -11.490      0.000      -0.702      -0.497
C(Pclass)[T.2]                   -0.0470      0.059     -0.802      0.423      -0.162       0.068
C(Pclass)[T.3]                   -0.4681      0.050     -9.290      0.000      -0.567      -0.369
C(Sex)[T.male]:C(Pclass)[T.2]    -0.1644      0.077     -2.130      0.033      -0.316      -0.013
C(Sex)[T.male]:C(Pclass)[T.3]     0.2347      0.064      3.648      0.000       0.108       0.361
==============================================================================
Omnibus:                       80.766   Durbin-Watson:                   1.945
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              101.275
Skew:                           0.817   Prob(JB):                     1.02e-22
Kurtosis:                       3.247   Cond. No.                         13.5
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

# 이원분산분석 진행
anova_results = anova_lm(model, typ=2)

sns.catplot(data=titanic_df, x='Sex', y='Survived', hue='Pclass', kind='bar')
plt.show()

>>>

가설검정과 다양한 분석 방법은 데이터 분석의 핵심이며, 이를 통해 데이터로부터 의미 있는 정보와 결론을 도출할 수 있습니다. 올바른 가설 설정과 적절한 분석 방법 선택은 데이터 분석의 결과를 신뢰할 수 있는 수준으로 끌어올릴 수 있는 열쇠입니다.

'IT > AI' 카테고리의 다른 글

[통계] Day 3-2 분포 모형에 대한 이해 $0$	2023.08.16
[통계] Day 3-1 상관 분석 $0$	2023.08.16
[통계] Day 2-2 데이터 분석 방법 $0$	2023.08.11
[통계] Day 2-1 데이터 형태: 순서형, 이진, 시계열, 공간 $0$	2023.08.11

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 $권한 있는 경우$	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Savvy

[통계] Day 2-3 가설 검정과 분석 방법

'IT > AI' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역

[통계] Day 2-3 가설 검정과 분석 방법

'IT > AI' 카테고리의 다른 글

'IT/AI' Related Articles

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역