Empirical Risk Minimization
When an animal agent interacts with its environment, it is privy to a subset of observable components of the environment as opposed to complete information about the environment. Its understanding of its environment is constantly updated based on new evidence and its relevance to the animal, and, at all times, it is exposed to only the observable. This means the animal has to form a model of the environment, such that it captures the underlying distribution over possible events in the given environment as closely as possible. The animal has to make relevant inferences and the costs of error are often detrimental. Risky.
So what?
In machine learning, we face the same problem. In all scenarios where we deal with data from the real world, that is, the data has not been generated by the experimenter according to a known distribution, we only have partial access to the data that makes up the data we are interested in. We don’t have access to all images ever, we only have access to a subset of all images ever. Whatever claim we desire to make about the nature of the set of all images ever will be purely based on what can be learned from the subset of available images. Consequently, any claim we can make about the set of all images ever has a margin of error. Most often, the objective we optimize is this error margin, which we want to minimize. How do we quantify this error? What are the bounds of the optimization that can be performed? Are there ‘good’ datasets for this task? How does it contribute to the error in our estimation of the underlying distribution? These are some of the questions we attempt to tackle here using the statistical learning framework.
First, we define relevant terminology and state assumptions about a formal model that can learn tasks. Then we move on to defining what we mean by empirical risk and how we go about minimizing it.
The learner has access to the following:
- \(\mathcal{X}\), the domain of the learner, or the input space. For example, this is a subset of all images ever. Each point \(x_i\) in this space can be thought of as a vector of pixels of the image in grayscale. Thus, the dimensionality of the input space is defined by the number of features the input has.
- \(\mathcal{Y}\), the set of labels, or the label space, to which \(\mathcal{X}\) can be mapped. Note here that the set of labels (or classes) may be much smaller than \(\mathcal{X}\).
Our aim is for the learner to be able to learn this mapping from input space to label space. The learner does not have direct access to the entire input space. Instead, it has access to a subset \(S\), which is can train on to make a hypothesis \(h\) about the relationship between the input space an the label space.
-
The setAbuse of the word set; Sets by definition have unique elements, but training sets are not restricted that way. The better word would be a collection or a sequence, depending upon whether the order is relevant to the algorithm. \(S\) is a finite set of pairs of the form \(\mathcal{X} \times \mathcal{Y}\), i.e, \(S = \{(x_1, y_1), (x_2, y_2), ..., (x_m, y_m)\}\), where \(x_i \in \mathcal{X}\) and \(y_i \in \mathcal{Y}\).
-
The hypothesis or prediction rule or classifier, \(h : \mathcal{X} \rightarrow \mathcal{Y}\) is the function that the learner outputs; The function is a hypothesis of the true mapping between the input space and the label space. This predictor can be used to predict the label of new points from the input space, even those that are not in the training subset. \(h\) can also be represented as \(h = Alg(S)\), notating that the learning algorithm \(Alg\) upon training on the set \(S\) outputs the hypothesis that the underlying mapping is the function \(h\).
Here, we note that it is important to state out assumptions about how the training data is generated. We assume that \(\mathcal{D}\) is the underlying probability distribution over \(\mathcal{X}\), the input space. Observe that this generative process is independent of the learner’s training. Our hope is that the learner produces a hypothesis such that it captures the underlying distribution correctly, all the while having no complete, direct access to it. Consequently, we can now understand the learner’s task as trying to approximate the true underlying mapping between the input and label spaces, which is, the oracle function \(f : \mathcal{X} \rightarrow \mathcal{Y}\). Therefore, the true labels are given as \(y_i = f(x_i)\) \(\forall i\).
To summarise:
- The training data \(S\) is assumed to be produced by a generative process.
- Points in \(S\) are generated by samplng points \(x_i\) from \(\mathcal{X}\) distributed according to \(\mathcal{D}\), then labelled using the oracle \(f\).
- The task of the learnerIt is important to remember that the learner can ONLY see \(S\); it does not have access to the true distribution \(D\) or the true mapping \(f\). is to figure out what this \(f\) is as closely as it can. This means we ideally want \(h(x_i) = f(x_i)\) for all \(i\).
How do we measure error?
The error of a classifier is defined as the probability that it does not predict the correct label on a random data point generated by the process described above. In other words, the error of the hypothesis \(h\) is the probability that for an instance \(x\) drawn randomly from \(\mathcal{X}\) according to \(\mathcal{D}\), \(h(x) \neq f(x)\).
Formally, the error of a classifier, \(h: \mathcal{X} \rightarrow \mathcal{Y}\) is defined as
\[L_{D, f}(h) := \mathbb{P}_{x \sim \mathcal{D}}[h(x) \neq f(x)]\]This is commonly referred to as the generalization error, or the risk, or the true error of \(h\). Here, \(L\) is used to represent the fact that the error contributes to a loss for the learner, and the subscript is meant to specify that the loss is with respect to the given distribution and oracle function.
This however, is a sort of wishful thinking. How can we measure this error if we do not have access to the true distribution in the first place? We now try to define an error that captures the loss that is resultant from the observable data, that is, data from the training subset \(S\). We call this error training error and denote it using \(L_S(h)\) to emphasize that it is dependent on the training set, and
\[L_S(h) := \frac{|\{i \in \{1,...,m\} : h(x_i) \neq y_i\}|}{m}\]This is the error that the classifier incurs over the training sample. Since it dependes on the observable, it is also called the empirical error or empirical risk. We now slightly rephrase our understanding of the goal of the learner: the goal of the learner is to find the hypothesis \(h_S\) that minimizes the empirical risk, i.e, the risk over learning what is observable. This learning paradigm is naturally, what we call Empirical Risk Minimization (ERM).
Overfitting, and how not to.
The ERM rule is susceptible to overfitting the training set. This is expected since we are not guaranteed that \(S\) is a representative dataset (with respect to \(\mathcal{D_X}\)). To try to rectify this, we need to find conditions such that we are guaranteed (with a margin of error) that the hypothesis output by the algorithm following the ERM rule is not overfitting on the training set. Typically, one can spot overfitting by looking for situations where \(h_S\) is such that \(L_S(h) < L_{D, f}(h)\).
One solution is to restrict the hypothesis search space. The learner restricts itself to a hypothesis class \(\mathcal{H}\), i.e, a set of predictors that will likely reduce the extent of overfitting.
For a given \(H\) and \(S\), the \(ERM_{\mathcal{H}}\) learner uses the ERM rule to choose a hypothesis \(h \in \mathcal{H}\) such that it has the lowest trainint error. That is,
\(ERM_{\mathcal{H}}(S)= h_S \in\) argmin\(_{h \in \mathcal{H}}L_S(h)\)
By doing this, we bias the learner towards a particular set of predictors. This restriction is also referred to as an inductive bias. And such restrictions are made based on some prior knowledge about the problem at hand.. We note here that we do not require that only one such minimum error hypothesis has to exist. We want to find any such \(h\) in our hypothesis class such that it minimizes the training loss. However, restricting the hypothesis search space is simply not enough to guarantee that overfitting will not happen, we need to somehow incorporate the dependence of \(h\) on \(S\), and most importatnly, find out over which hypothesis classes the ERM rule will not result in overfitting.
A ‘mild’ restriction to add is to restrict classes to being finite. This means that the number of predictors per class is finite. Provided that the learner trains on a sufficiently large training sample (what we mean by this is explained below), we can show that for finite hypthesis classes, \(ERM_{\mathcal{H}}\) will not overfit.
The Realizability Assumption
We define the Realizability Assumption as follows.
There exists \(h^* \in \mathcal{H}\) such that \(L_{\mathcal{D}, f}(h^*) = 0\). This assumption also implies that we have \(L_S(h^*) = 0\) with probability \(1\) if the above holds true.
What this assumption implies is that for every hypothesis picked according to the ERM rule, acting on the training set \(S\), the training error is \(0\). However, we are interetested in minimizing the generalization error of these hypotheses, not just the training error (a.k.a empirical risk). So, given what we are trying to guarantee, we need to tease out the relationship between the sample \(S\) and the underlying data distribution \(\mathcal{D}\). It is quitessential in our case to do this since the training sample is the learner’s only window to the world. In lieu of this, we make the additional assumption that the instances in the training set are independedntly and identically distributed. Let’s say that there are \(m\) instances in \(S\), we write \(S \sim \mathcal{D}^m\) to denote that a sample set \(S\) with \(m\) elements has been drawn in an i.i.d manner from \(\mathcal{D}\).
We can now say that the generalization error is a random variable, since it depends on the training set, which is picked by a random process. This simply means that it is not necessary that the sample picked is representative of the true distribution. Thus, we introduce a confidence (rather, diffidence) parameter \(\delta\), which denotes the probability of picking a nonrepresentative sample; then, \((1 - \delta)\) is the confidence parameter, i.e, the probability that we picked a representative sample. We can restrict \(\delta\) to be arbitrarily small.
Another thing we cannot guarantee is the accuracy of label prediction, so we introduce \(\epsilon\) as an inaccuracy parameter. The higher the inaccuracy, the higher the loss incurred. Essentially, if \(L_{\mathcal{D}, f}(h_S) \leq \epsilon\), the algorithm’s output is an arbitrarily approximately correct predictor. On the other hand, if \(L_{\mathcal{D}, f}(h_S) > \epsilon\), it indicates that the learner has failed.
Therefore, fixing an oracle \(f\), we want to find an upper bound on the probability of finding bad hypotheses as a consequence of picking misleading samples. Let \(\mathcal{H}_B\) be the set of bad hypotheses.
Then, let
\[\mathcal{H}_B = \{h \in \mathcal{H} : L_{(\mathcal{D}, f)}(h_S) > \epsilon\}\]and
\[M = \{S : \exists h \in \mathcal{H}_B, L_S(h) = 0\}\]this is just the mathy way to say that \(M\) is the set of misleading sample sets, which are identifiable by there being at least one hypothesis in the set of bad hypotheses such that the training error is zero. The samples are said to be misleading as, if you look at just the training error, the error on learning from it is zero, but the generalization error is above threshold \(\epsilon\), and therefore the hypothesis is a bad one.
be the set of misleading samples. Formally, we have shown that all bad samples are a subset of \(M\). Rewriting \(M = \cup_{h \in \mathcal{H}_B}\{S:L_S(h) = 0\}\) we can write,
\[\mathcal{D}^m(\{S: L_{(\mathcal{D}, f)}(h_S) > \epsilon \}) \leq \mathcal{D}^m(M) = \mathcal{D}^m(\cup_{h \in \mathcal{H}_B}\{S:L_S(h) = 0\})\]Applying the union bound, we get
\[\mathcal{D}^m(\{S: L_{(\mathcal{D}, f)}(h_S) > \epsilon \}) \leq \sum_{h \in \mathcal{H}_B}\mathcal{D}^m(\{S:L_S(h) = 0\})\]Since the event \(L_S(h) = 0\) is the same as the event where all predictions are correct, we can write the last term in terms of \(h(x_i)\) and \(y_i\) for all \(i\).
Thus, we can write
\[\mathcal{D}^m(\{S:L_S(h) = 0\}) = \mathcal{D}^m(\{S: \forall i, h(x_i) = f(x_i)\}) = \Pi_{i=1}^m\mathcal{D}(\{x_i:h(x_i) = y_i\})\]In the case of each individual sample of an element of \(S\) we have
\[\mathcal{D}(\{x_i:h(x_i) = y_i\}) = 1 -L_{\mathcal{D,f}}(h) \leq 1-\epsilon \leq e^{-\epsilon}\]Therefore, for the \(m\) elements in the set, we can write
\[\mathcal{D}^m(\{S: L_S(h) = 0 \}) \leq (1-\epsilon)^m \leq e^{-\epsilon m}\]Now we can rewrite in terms of the generalizaion error: \(\mathcal{D}^m(\{S: L_{(\mathcal{D}, f)}(h_S) > \epsilon \}) \leq |\mathcal{H}_B|e^{-\epsilon m} < |\mathcal{H}|e^{-\epsilon m}\).
From this, we get that if \(m \geq \frac{\log(\|\mathcal{H}\|/\delta)}{\epsilon}\), then for any oracle labeller \(f\) and any distribution \(D\) such that the realizability assumption holds, with a confidence of at least \(1 - \delta\) over an i.i.d sample \(S\) of size \(m\), for every ERM hypothesis \(h_S\), the generalization error
\[L_{(\mathcal{D}, f)}(h_S) \leq \epsilon\]This tells us that for a sufficiently large sample size, the ERM rule hypothesis over a finite hypothesis class is probably approximately correct within a error margin upper bounded by \(\epsilon\), with a confidence of \(1- \delta\).
In the next post, we explore the model of Probably Approximately Correct learning (PAC learning) in more detail.