Chapter 11 – The Eyes of a Machine
Chapter 11 felt like the “oh wow, CNNs are basically math-ified biology” chapter. Ananthaswamy walks through Hubel and Wiesel’s experiments with cat visual cortex—tiny receptive fields in retinal ganglion cells, then simple cells that like specific edge orientations, then complex and hypercomplex cells that pool over space. It’s basically a biological version of feature hierarchies: edges → shapes → objects.
What I liked is how cleanly this story maps to convolutional neural networks. Fukushima’s Neocognitron copies the hierarchy with S-cells and C-cells (feature detection + pooling), and then LeCun’s LeNet turns that into something trainable with backprop and an explicit loss function. Same core idea, but now you can actually optimize it instead of hard-coding all the structure.
From a math point of view, it demystifies CNNs a lot: a convolution is just a sliding dot product with shared weights, and pooling is a simple downsampling operator that trades spatial resolution for invariance and lower dimension. Stack enough of those linear-ish operations plus nonlinearities and you get a super expressive function class that still respects the geometry of images. It’s kind of satisfying that something as “black-boxy” as modern vision models boils down to linear algebra with some clever symmetry assumptions.
Chapter 12 – Terra Incognita
Chapter 12 zooms out and basically says: “Okay, these models work absurdly well, but the theory is kind of freaking out.” It starts with grokking—tiny networks that first memorize the training set, then, after more training, suddenly line up with the underlying rule and generalize perfectly. That feels like watching gradient descent move from one basin in function space into another, but we don’t really have a clean theory of what that landscape looks like yet.
Then the chapter attacks the standard bias–variance intuition. In the classical picture, you get a U-shaped test error curve in model complexity. Here, deep nets blow that up: you can massively overparameterize, interpolate the training data, and still see test error go down again (“double descent”). It’s like the usual finite-dimensional, fixed-hypothesis-space theory was only telling us a small part of the story, and now we’re operating in some weird infinite-capacity but implicitly regularized regime.
The self-supervised part also hit me: predicting masked tokens or masked image patches is just another loss function, but in practice it gives rise to these gigantic foundation models that feel qualitatively different from “classical” supervised setups. LLMs, masked autoencoders, and models like Minerva are all just doing gradient descent on cleverly chosen objectives, yet they start to look like they’re reasoning—or at least faking it convincingly. The chapter doesn’t pretend to resolve whether that’s “real” reasoning, but it makes it clear the math we have isn’t yet strong enough to answer the question cleanly.
Finally, there’s the whole social side: bias, opacity, and the worry that people will over-trust models that are both confident and occasionally very wrong. Plus, the brain analogy is still shaky—backprop isn’t biologically realistic, and the energy/embodiment gaps between brains and GPUs are enormous. The chapter ends up feeling like a status update from the frontier: the experiments are racing ahead, and the theory (and society) is scrambling to catch up.
Takeaways from Chapters 11 & 12
CNNs are just structured linear algebra with priors
Convolutions and pooling can be understood as linear maps plus simple nonlinearities, designed to respect spatial structure and translation invariance. The biology story is inspiring, but the implementation is firmly mathematical.Hierarchy is doing the heavy lifting
Both in visual cortex and in CNNs, building complex features out of simpler ones is the main trick. Depth isn’t just about more parameters; it’s about composition of functions in a way that matches how data is structured.Our old generalization story is incomplete
Grokking, benign overfitting, and double descent all point to the same idea: classical bias–variance intuition doesn’t fully describe what’s happening in high-dimensional, overparameterized neural nets trained with SGD. The math here is still very much “terra incognita.”Optimization dynamics matter as much as architecture
The path gradient descent takes through parameter space—learning rates, noise from minibatches, training time—seems to affect whether a model just memorizes or actually “discovers” structure. That suggests we need more dynamical-systems-style thinking about learning, not just static capacity bounds.Self-supervised learning changes the data game
Using the data to generate its own training signal (masked tokens, masked patches) is mathematically simple but conceptually huge. It shifts the bottleneck from labeled data to compute and raw data, and it’s clearly the engine behind modern LLMs and large vision models.Brains vs. deep nets is still an open math problem
There are tantalizing similarities (CNNs predicting neural responses, etc.) and big differences (energy efficiency, learning rules, embodiment). The book’s suggestion that the same “elegant math” might underlie both feels plausible—but also like a challenge problem for the next few decades.
“Living With” Why Machines Learn – Semester long Reflection
Spending a whole semester “living with” Anil Ananthaswamy’s Why Machines Learn was a very different experience from just powering through a textbook in a week. I liked that the assignment forced me to keep coming back to the same book as my classes progressed, but it also made me notice that this is not an easy book to dip in and out of casually. Some sections are very narrative and historical, and I could read those on the train or between classes. But other sections, especially the ones that really dig into the math behind learning algorithms, demanded a lot more focus. When I tried to read those parts in short bursts during random free pockets of time, I often had to reread pages (or entire subsections) to get my bearings again.
One thing that made the book fun to “live with” was how well Anil weaves math and history together. He doesn’t just drop a theorem or an algorithm on you out of nowhere; he walks you through the people, the context, and the intellectual drama around it. Seeing theorems framed by the struggles, misunderstandings, and rivalries of the researchers made the math feel less abstract and more like part of a long, messy conversation. As someone who’s interested in mathematics, I really appreciated how he treated the math seriously while also giving enough narrative to show why the results matter.
In terms of difficulty, I’d say the math was right at my level as a college student who’s comfortable with calculus and linear algebra. I wouldn’t recommend this book to someone who wants a completely math-free story of AI, but I also don’t think non–math-inclined readers should be scared off. The key, in my opinion, is patience. This is not a skim-friendly book: if you just glance at the equations and keep going, you miss a lot of the core ideas. I found that pausing on the formulas, cross-checking the text, and sometimes re-deriving the steps myself made a big difference.
Marking up the book turned out to be one of the best things I could have done whie reading it. Normally I’m hesitant to write too much in margins, but here, underlining definitions, drawing little arrows between related equations, and jotting down quick explanations in my own words helped me understand the mathematical concepts more deeply. Over the semester, my copy basically turned into a hybrid between a book and a notebook, and when I had to revisit topics for reflections or class discussions, those notes made it much easier to re-enter the material without starting from scratch.
Overall, Why Machines Learn was a challenging book to “live with” in the sense that it never fully let me stay on autopilot—but that was also what made the experience valuable. The mix of rigorous math, historical storytelling, and conceptual depth meant I had to slow down, reread, and actively engage. By the end of the semester, I felt like I hadn’t just read about machine learning; I’d actually wrestled with its underlying ideas in a way that connected formulas, people, and history into one continuous thread.