Akaike and Bayesian Information Criterion

"All models are wrong, but some are useful."

- George Box

One of the key problems in machine learning is knowing which model to use. Akaike information criterion (AIC), and Bayesian information criterion (BIC) are powerful tools used to perform model selection that can help us determine which model is best for our data. In this article we will see what AIC and BIC are and how to use them.

Important Background Information

Before diving into what AIC is, we must establish some foundational knowledge and terminology.

First, consider a collection of models, given by $$\mathscr{M}_j = \{f_j(z | \boldsymbol{\theta})|\boldsymbol{\theta}\in \Theta_j \}.$$ In this, we have that each model is parameterized by some set of distributions, $\boldsymbol{\theta}$. Let $\widehat{\boldsymbol{\theta}}$ be the maximum likelihood estimator for $\boldsymbol{\theta}$ in model $j$.

It is important to note that the true distribution $f(z)$ for the data not might be in $\mathscr{M}_j$. In this case, we want to identify the model that is closest to the true distribution. If other model selection criteria do not apply (such as train-test split or cross validation), then we can use AIC and BIC to help us determine which model is best.

Akaike Information Criterion

THe main idea behind AIC is that we want to choose the MLE $\widehat{\boldsymbol{\theta}}$ for model $M_j$ that minimizes the relative entropy (KL-divergence) between our selected distribution, and the true distribution $f(z)$. So, given some estimate $\widehat{\boldsymbol{\theta}_j}$ with $\widehat{f_j}(z) = f(z|\widehat{\boldsymbol{\theta}_j})$, we want to minimize $$\mathscr{D}_{\text{KL}}(f||\widehat{f}_j) = \int f(z)\log(f(z))dz - \int f(z)\log(\widehat{f}_j(z))dz.$$ Note, the first integral is completely independent of $j$ and $\widehat{\boldsymbol{\theta}_j}$. This means that in order to effectively minimize the relative entropy with respect to $j$, we need to minimize the second integral, given by $$\widehat{K}_j = - \int f(z)\log(\widehat{f}_j(z))dz.$$ Given this integral is intractable (and considering $f(z)$ is probably not known), we can instead use the following approximation: $$ \widehat{K}_j \approx -\frac{1}{n}\sum_{i=1}^n\log(\widehat{f}_j(z_i)).$$

Nice as this may seem, it is important to note that this estimator is biased. This is because the data $z_1, \dots, z_n$ is used to estimate $\widehat{\boldsymbol{\theta}_j}$, which is then used to estimate $\widehat{K}_j$. But, Akaike found and proved that the bias of this estimator is approximately $-\frac{k_j}{n}$, where $k_j$ is the dimension of the parameter space $\Theta_j$. So, we now have that the approximation for the unbiased estimator for $\widehat{K}_j$ is given by $$\widehat{K}_j \approx \frac{k_j}{n}-\frac{1}{n}\sum_{i=1}^n\log(\widehat{f}_j(z_i)).$$ If we multiply by $2n$, this give us the AIC. It is important to note that that the AIC is typically given by $$\text{AIC}(j) = 2k_j - 2\ell_j(\widehat{\boldsymbol{\theta}}_j)$$ where $\ell_j(\widehat{\boldsymbol{\theta}}_j) = \sum_{i=1}^{n}\log(f_j(z_i|\widehat{\boldsymbol{\theta}}))$ is the MLE, and $k_j = \text{dim}(\Theta_j)$ is the dimension of the parameter space. Also, it is important to know that the $2$ is present for historical reasons. Since we are minimizing the AIC, and constant really doesn't matter.

Bayesian Information Criterion

Next we talk about Bayesian Information Criterion (BIC). BIC is simply an alternative to the AIC and is similar in many ways. The key difference between AIC and BIC, however, is that in BIC, instead of minimizing the relative entropy between the true distribution and the selected distribution, we instead maximize the posterior probability of a selected model. Thus, we define the BIC to be $$\text{BIC}(j) = k_j \log(n) - 2\ell_j(\widehat{\boldsymbol{\theta}}_j),$$ again, where $\ell_j(\widehat{\boldsymbol{\theta}}_j) = \sum_{i=1}^{n}\log(f_j(z_i|\widehat{\boldsymbol{\theta}}))$ is the MLE, and $k_j = \text{dim}(\Theta_j)$ is the dimension of the parameter space. In this case, $n$ is the number of data points. (For those interested in the derivation and proof of the BIC, you can find it here. Note the notational difference between the proof and the formula presented here.)

Differences Between AIC and BIC

It is easy to see that AIC and BIC only differ by the first term of their formula. AIC's first term is $2k_j$, and BIC's first term is $k_j \log(n)$. Since $n$ is typically large, this means that, generally, $\log(n) > 2$. This tells us that BIC will penalize models with more parameters more than AIC will. This means that BIC will tend to select simpler models than AIC. Thus, AIC is more likely to choose a model that is too complex, and BIC is more liekly to choose a model that is too simple. So, which one you choose to use is dependent entirely on your situation and what you are trying to accomplish. For example, if you think there are unnecessary parameters in your model, then BIC might be a better choice. If you think that there are important parameters that you do not want to leave out, then AIC might be a better choice.

A Quick Python Example

Now that we have seen what AIC and BIC are, let's see how we can use them in Python. For this example, we will use the OLS model from statsmodels, and the California housing dataset from scikit-learn.

First, begin by importing the necessary libraries and loading in the data.

                            
# Import all the necessary things from sklearn.
from sklearn.datasets import fetch_california_housing
import statsmodels.api as sm

# Load in the data. Split into X and y, and make them Pandas dataframes.
X, y = fetch_california_housing(return_X_y=True, as_frame=True)

Briefly looking at the data, we see that there are 8 features, and 20,640 data points, with the $X$ looking like

                            
X.head()
   MedInc  HouseAge  AveRooms  AveBedrms  Population  AveOccup  Latitude  
0  8.3252      41.0  6.984127   1.023810       322.0  2.555556     37.88   
1  8.3014      21.0  6.238137   0.971880      2401.0  2.109842     37.86   
2  7.2574      52.0  8.288136   1.073446       496.0  2.802260     37.85   
3  5.6431      52.0  5.817352   1.073059       558.0  2.547945     37.85   
4  3.8462      52.0  6.281853   1.081081       565.0  2.181467     37.85   
                                
   Longitude  
0    -122.23  
1    -122.22  
2    -122.24  
3    -122.25  
4    -122.25

and the $y$ looking like

                            
y.head()
0    4.526
1    3.585
2    3.521
3    3.413
4    3.422
Name: MedHouseVal, dtype: float64

We now want to add a constant to the $X$ data (since statsmodels doesn't do that naturally), and then fit the model.

                            
# Add a constant to the X matrix, as statsmodels doesn't do that by default.
X_sm = sm.add_constant(X)

# Initialize our OLS model, and fit the data.
model_sm = sm.OLS(y, X_sm)
results = model_sm.fit()

Now that we have fit the model, we simply use the aic and bic attributes of the results object to get the AIC and BIC values.

                            
# Get the AIC and BIC
results.aic
>>> 45265.54161
results.bic
>>> 45336.95649

Now, these numbers by themselves are more or less meaningless. However, if we begin to perform model selection (such as stepwise feature removal), then we can use these numbers to help us determine which model is best.

Conclusion

And there you have it! AIC and BIC are powerful ideas that can help us build useful models for our data. I hope that you can now see how useful and important these ideas are and how they can help you with your next machine learning project. To see the code use in this article, you can find it on my Github.

Dylan Skinner Blog

Math, Data Science, and Machine Learning