Learning From Data Final

2023-11-10T21:59:16-08:00

1. [e]

Use the formula for the transform of a $2$-dimensional feature space to one of dimension $Q$.

$$ \begin{aligned} \frac{Q(Q+3)}{2} &= \frac{10(13)}{2} \ \ &= 65 \end{aligned} $$

2. [d]

Consider a target function shaped like a bell curve, where points with extreme features were labeled -1 and more moderate points were labeled +1.

Logistic regression wouldn't be an appropriate model to include in any hypothesis set to learn this target, but for this question we consider the average $g^{\mathcal{D}}$. It's plausible the average would look nothing like a logistic regression curve.

3. [d]

We know overfitting occurs when we pick a hypothesis such that $E_{in}$ is minimized but we see a larger $E_{out}$ when that hypothesis is used out of sample, compared to other hypotheses. Using this fact as our guiding star, let's go through the answer choices.

[a]. In order for us to have picked on hypothesis over another, it must be the case that one of them had a lower $E_{in}$.

[b]. We must also be able to say that some other hypothesis has lower generalization error than the one selected.

[c]. We must have some estimate of $E_{out}$ for our hypothesis and some other hypothesis that we should have picked, but no time for regrets. This means even if both $E_{out}$ values are equal, because the hypotheses must have had different $E_{in}$'s for us to have selected the poor hypothesis, $E_{out} - E_{in}$ must be different for the 2 hypotheses.

[e]. We must have a choice between 2 or more hypotheses in order to say we overfit the data by selecting some hypothesis. Overfitting is like admitting you had a choice between 2 women to make your girlfriend and you picked the one who was deceptively sweet at first (low $E_{in}$) but not very sweet once you pick her (high $E_{out}$). Looking back you realize you should have picked the one who didn't put up false pretenses (slightly higher $E_{in}$) and would treat you like a king (low $E_{out}$).

After analyzing incorrect answer choices, we can make a statement about the correct answer choice.

[d]. Comparing $E_{out}-E_{in}$ values is not a principled indicator for overfitting. We may think that larger $E_{out}-E_{in}$ value corresponds to overfitting, but we might imagine a case where the hypothesis we pick has $E_{in} = 0$ and $E_{out} = 0.25$, while some other hypothesis has $E_{in} = 0.50$ and $E_{out} = 0.60$. Then if we go by difference, the other hypothesis looks better, but in reality it has higher out of sample error.

Using the difference between out of sample and in sample error would not clue us in to detect overfitting, since the hypothesis we pick may very well be the best one for that data set and available hypotheses, but have a larger difference.

4. [d]

Stochastic noise captures the probabilistic essence of real-world target functions, shifting the notion of target function to target _distribution_. It's what allows the same input point to have a different label, for instance two credit card applicants with identical application details, but only one is approved while the other is declined.

It does not relate to the hypothesis set.

Why not the other options?

[a]. There is always inseparable deterministic and stochastic noise for real-world problems.

[b]. Deterministic noise depends on the hypothesis set very much. More complex hypotheses would have less deterministic noise than simple hypotheses when fitting a complex target.

[c]. Deterministic noise captures the intricacies of the target function that cannot be approximated by the hypothesis set.

[e]. Stochastic noise is generated by the probabilistic target distribution, that is why it is a target _distribution_ and not a target _function_.

5. [a]

If $\vec{w}_{lin}$ is in the constrained hypothesis set $\mathcal{H}(C)$, then there is no need for regularization and $\vec{w}_{reg} = \vec{w}_{lin}$.

6. [b]

Being able to define an augmented error allows us to solve an unconstrained optimization problem.

$$ \vec{w}_{reg} = (Z^TZ + \lambda{I})^{-1}Z^T\vec{y} $$

Why the other's are incorrect:

[a]. The reverse is true, hard order constraints can be written as soft order constraints using extreme $\gamma$'s on each weight.

[c]. I'm not aware of any relation to the VC dimension besides the fact that regularization can decrease the VC dimension since we shift the notion to effective number of parameters. Validation is used to determine constraints.

[d]. Regularization concedes increases in $E_{in}$ for decreases in $E_{out}$.

For 7. through 10. use this output.

❯ python3 final/regress.py --digit=1 --other=5
0 versus all.   E\_in: 0.22946  E\_out: 0.22770
1 versus all.   E\_in: 0.13770  E\_out: 0.13104
2 versus all.   E\_in: 0.10026  E\_out: 0.09865
3 versus all.   E\_in: 0.09025  E\_out: 0.08271
4 versus all.   E\_in: 0.08943  E\_out: 0.09965
5 versus all.   E\_in: 0.07626  E\_out: 0.07972
6 versus all.   E\_in: 0.09107  E\_out: 0.08470
7 versus all.   E\_in: 0.08847  E\_out: 0.07324
8 versus all.   E\_in: 0.07434  E\_out: 0.08271
9 versus all.   E\_in: 0.08833  E\_out: 0.08819

K = 0.01 Digit 1 versus 5. E\_in: 0.03011  E\_out: 0.06840
K = 1.00 Digit 1 versus 5. E\_in: 0.03011  E\_out: 0.06132

❯ python3 final/regress.py --digit=1 --other=5 --transform
0 versus all.   E\_in: 0.10232  E\_out: 0.10663
1 versus all.   E\_in: 0.01234  E\_out: 0.02192
2 versus all.   E\_in: 0.10026  E\_out: 0.09865
3 versus all.   E\_in: 0.09025  E\_out: 0.08271
4 versus all.   E\_in: 0.08943  E\_out: 0.09965
5 versus all.   E\_in: 0.07626  E\_out: 0.07922
6 versus all.   E\_in: 0.09107  E\_out: 0.08470
7 versus all.   E\_in: 0.08847  E\_out: 0.07324
8 versus all.   E\_in: 0.07434  E\_out: 0.08271
9 versus all.   E\_in: 0.08833  E\_out: 0.08819

K = 0.01 Digit 1 versus 5. E\_in: 0.00448  E\_out: 0.02830
K = 1.00 Digit 1 versus 5. E\_in: 0.00512  E\_out: 0.02594

7. [d]

8 versus all has $E_{in} = 0.07434$.

8. [b]

1 versus all has $E_{out} = 0.02192$ when the transformation is applied.

9. [e]

The transformation decreases $E_{out}$ for the 5 versus all classifier from 0.07972 to 0.07922, a marginal improvement.

10. [a]

K = 0.01 Digit 1 versus 5. E\_in: 0.00448  E\_out: 0.02830
K = 1.00 Digit 1 versus 5. E\_in: 0.00512  E\_out: 0.02594

11. [c]

First transform each data point using the described transformation, then graph in Desmos, with $x_2$ as the $y$-axis, such that $x_1 = x$ and $x_2 = y$.

The only hyperplane that separates the data correctly is the vertical line $0 = 1 \cdot x_1 + 0 \cdot x_2 + (-0.5)$ which corresponds to choice [c].

12. [c]

❯ python3 final/svm.py
libsvm: 5
[0.02 0.01 0.01 0.04 0.   0.   0.  ]
Using threshold of 0.001
Dual:   4

Refer to the output below for 13 - 16.

RBF Model versus RBF Kernel SVM with $\gamma = 1.5$ and 9 clusters.

❯ python3 final/rbf.py --centers=9 --gamma=1.5

Data was inseparable in Z space 0.0%

SVM Kernel beat RBF Model 84.9%

SVM E\_in:       0.0000
RBF E\_in:       0.0352  and was zero 3.1%

SVM E\_out:      0.0319
RBF E\_out:      0.0545

RBF Model versus RBF Kernel SVM with $\gamma = 1.5$ and 12 clusters.

❯ python3 final/rbf.py --centers=12 --gamma=1.5

Data was inseparable in Z space 0.0%

SVM Kernel beat RBF Model 79.2%

SVM E\_in:       0.0000
RBF E\_in:       0.0231  and was zero 8.5%

SVM E\_out:      0.0314
RBF E\_out:      0.0447

13. [a]

Practically zero of the data sets generated were inseparable in the Z space.

14. [e]

15. [d]

16. [d]

Both error values go down.

17. [c]

❯ python3 final/rbf.py --centers=9 --gamma=1.5

Data was inseparable in Z space 0.0%

SVM Kernel beat RBF Model 85.8%

SVM E\_in:       0.0000
RBF E\_in:       0.0347  and was zero 3.3%

SVM E\_out:      0.0318
RBF E\_out:      0.0539

❯ python3 final/rbf.py --centers=9 --gamma=2

Data was inseparable in Z space 0.0%

SVM Kernel beat RBF Model 87.5%

SVM E\_in:       0.0000
RBF E\_in:       0.0399  and was zero 2.3%

SVM E\_out:      0.0336
RBF E\_out:      0.0594

18. [a]

Referring to the output from 9., the RBF Model has zero in sample error approximately $3%$ of the time.

19. [b]

We have a Bayesian prior in this problem. Let's use Baye's rule:

$$ \begin{aligned} P(h=f|\mathcal{D}) &= \frac{P(\mathcal{D}|h=f)\cdot P(h=f)}{P(\mathcal{D})} \ \ &\propto P(\mathcal{D}|h=f)\cdot P(h=f) \end{aligned} $$

We sampled one person and that person ended up having a heart attack. Let's see what information we can draw from this.

We knew long before that $P(h=f)=1$ because $h \sim \textrm{Uniform}(0,1)$. From the data set of size 1 that we sampled, we know that $P(\textrm{1 Heart Attack Patient} | h=f) = h$, since $h(\mathbf{x})$ outputs the probability of a heart attack given some features. Ultimately, we get a modified expression for the posterior

$$ \begin{aligned} P(h=f|\mathcal{D}) &= \frac{P(\mathcal{D}|h=f)\cdot P(h=f)}{P(\mathcal{D})} \ \ &\propto P(\mathcal{D}|h=f)\cdot P(h=f) \ \ &= h \times 1 \end{aligned} $$

This means the posterior is increases linearly with $h$ over $[0,1]$.

20. [c]

$g(\mathbf{x})$ has a prediction equal to the average of the predictions outputted by $g_1$ and $g_2$, therefore it makes sense that the deviation of $g(\mathbf{x})$ cannot be worse than the average of the deviations of $g_1(\mathbf{x})$ and $g_2(\mathbf{x})$.

In fact, $g$ is better than the average of $g_1$ and $g_2$ when $g_1(\mathbf{x}) \cdot g_2(\mathbf{x}) < 0$, and is exactly equal to the average when $g_1 \cdot g_2 > 0$.