Chapter 9 – “The Man Who Set Back Deep Learning (Not Really)”
This chapter is all about the Universal Approximation Theorem and what it actually says versus what people later claimed it implied. Ananthaswamy uses George Cybenko’s 1989 result to show that a neural network with one hidden layer and sigmoid activations can approximate any continuous function on a suitable domain, provided you allow enough hidden units.
We first go back to Minsky and Papert’s critique of single-layer perceptrons: they proved that simple perceptrons can’t solve problems like XOR because they only implement linear decision boundaries. That critique was sometimes (over)extended to multi-layer networks, even though the math never actually ruled those out. Cybenko enters the story wanting to formalize what a single-hidden-layer network can do, not to shut down deeper architectures.
Mathematically, Cybenko’s proof treats functions as vectors in (possibly infinite-dimensional) spaces. If you discretize a function by sampling it at many points, you can think of it as a long vector. Then a network with a single hidden layer effectively builds up complex shapes by linearly combining many shifted and scaled sigmoid curves, a bit like approximating a function using a flexible basis expansion. The theorem is an existence result: it guarantees you can approximate a target function with such a network, not that doing so is efficient or practical.
The chapter also notes that in high dimensions, the number of required neurons may explode because of the curse of dimensionality. So, in principle, a shallow wide network is universal, but in practice you might need a ridiculous number of units. That’s why the idea that this theorem “delayed deep learning” is unfair—what really mattered later was data, compute, and better training algorithms, not just the theoretical possibility of shallow universality.
Chapter 10 – “The Algorithm That Put Paid to a Persistent Myth”
Chapter 10 shifts from expressive power to trainability, focusing on backpropagation and the myth that Minsky and Papert somehow “killed” neural networks. Ananthaswamy follows Geoffrey Hinton’s trajectory: he becomes obsessed with how brains represent and store information, and ends up defending the idea that neural nets were never truly dead—just waiting for the right math + hardware combo.
The chapter walks through the delta rule for a single neuron: define a loss (like squared error), take gradients with respect to the weights and bias, and update via gradient descent. This is standard calculus: power rule, chain rule, partial derivatives in higher dimensions. The XOR problem reappears as the canonical example of why single-layer models aren’t enough—you need hidden layers to combine multiple linear boundaries into a nonlinear decision surface, and you need differentiable activations (e.g., sigmoid) so gradients make sense everywhere.
Backpropagation is presented as a clever way to apply the chain rule efficiently across layers. Instead of manually computing partial derivatives for every parameter in a huge network, you do a forward pass (compute all activations and store intermediates) and then a backward pass where you propagate the error gradients layer by layer. This reuses intermediate quantities, so the cost of computing all gradients is essentially on the order of one extra forward pass. Random initialization of weights breaks symmetries so different hidden units learn different features.
Historically, the algorithm has a messy lineage: Werbos describes it in the 1970s, and Rumelhart, Hinton, and Williams popularize it in the 1980s. Backprop enables early successes like LeNet for handwritten digit recognition, but progress stalls for a while because computers are too slow and other methods (like SVMs) are easier to use. Once GPUs and massive datasets arrive, backprop-trained deep networks (e.g., AlexNet in 2012) suddenly dominate, and the earlier “neural nets are dead” narrative looks pretty silly in hindsight.
Key Takeaways (Chapters 9–10)
- Expressivity vs. Efficiency
- Cybenko’s Universal Approximation Theorem says a single-hidden-layer network with sigmoids is theoretically universal, but it says nothing about how many neurons you need or how long training takes. Existence ≠ practicality.
- Functions as High-Dimensional Objects
- Thinking of functions as vectors in large or infinite-dimensional spaces makes neural networks feel a lot like basis expansions: sigmoids (or other activations) are building blocks, and the network finds coefficients to approximate a target function.
- Depth Is About Structure, Not Just Power
- Even if a huge shallow network can approximate any function, deep architectures encode useful compositional structure (features built on top of features), often with far fewer parameters. The math of approximation doesn’t automatically tell you the best architecture.
- Backprop Is Just the Chain Rule at Scale
- Conceptually, backprop is not mystical—it’s just a systematic way to apply the multivariable chain rule through a computation graph. If you’re comfortable with calculus, you’re already most of the way there.
- Training Is as Important as Theory
- Universal approximation is nice, but useless without an algorithm to actually find good parameters. Backprop is the bridge between “this network can represent the function” and “this network actually learns the function from data.”
- History Is Messy and Nonlinear (Just Like the Models)
- Neural networks didn’t die because of one book or one theorem. Progress depended on a mix of theoretical insights, empirical results, hardware advances, and cultural trends in the research community.
- For Math-Oriented Readers
- These chapters are a reminder that core deep learning ideas rest on very familiar math: linear algebra, analysis of function spaces, and gradient-based optimization. The fancy stuff sits on top of tools you probably already know from undergrad math.