
Why This Book?
I picked up Anil Ananthaswamy’s Why Machines Learn because I was stuck in that weird middle ground with LLMs: I use systems like ChatGPT all the time, I see “AI will change everything” headlines constantly, but I didn’t actually know what math was behind the scenes. What was actally happening. I didn’t want another vibes-only “AI will save/destroy the world” book; I wanted something that actually walks through the equations without feeling like a grad-level textbook.
From the very beginning, Ananthaswamy makes it clear he’s trying to do two things at once: tell a human story about LLMs, and unpack the math that makes those stories possible. The prologue frames the book as a tour through linear algebra, calculus, probability, statistics, and optimization, all in service of understanding how machines learn patterns from data and why that matters for society and policy.
For me, that combination of narrative + math + ethics was the sweet spot. In addition, the prolouge gives you a good sense of how the book is going to be laid out. Ananthaswamy briefly describes the ways in which humans learn and preps the reader to see concepts repeated throughout the book- while this may sound redundant, he’s actually helping us understand the mathematical concepts deeper by offering different perspectives on them and their uses.
Chapter 1 — Desperately Seeking Patterns
Chapter 1 dives into what you could call the “origin myth” of machine learning: the perceptron. Ananthaswamy starts with Frank Rosenblatt’s Navy-funded work and the famous press conference where the perceptron was hyped as the embryo of a machine that would one day basically do everything a human can do. He’s pretty honest that the hype was wild and overblown, but he also treats the perceptron as legitimately foundational, not because it actually delivered sci-fi robots, but because it crystallized the core idea of machines learning patterns from data.
The chapter explains the perceptron as the first artificial neural network: a single layer of artificial “neurons” that take in inputs, multiply each input by a weight, add a bias, and then pass the result through a threshold function to decide on a binary output. You can think of it as a very simple “yes/no” decision system. The magic is in how it learns: if it misclassifies an example, it nudges its weights and bias in a direction that would have made the correct answer more likely. Do this repeatedly over many examples, and the model gradually carves out a decision boundary that separates one class from another.
Ananthaswamy also brings in the perceptron convergence theorem: if the data are linearly separable—meaning you can separate the two classes with a single straight line (or, more generally, a hyperplane, yes think back to those calc days with prof Ghrist) then a perceptron will eventually find that separating boundary just by following its learning rule. That’s a big conceptual moment: you start to see learning as a geometric process. You’re not just throwing numbers into a black box; you’re tilting a line in some feature space until the points fall cleanly on different sides.
What I liked about this chapter is that it also quietly sets up the tension between hype and reality that still exists in AI. The 1950s headlines promised consciousness, what we actually got was a mathematically elegant algorithm that works beautifully on linearly separable data and fails hard when that assumption breaks. That mismatch between grand promises and clean but limited math is kind of the recurring theme of modern AI, and Ananthaswamy uses the perceptron story to show that this dynamic has been there from the very beginning.
Chapter 2 — We Are All Just Numbers Here…
Chapter 2 is where Ananthaswamy leans into the math more explicitly, but in a way that’s still pretty approachable if you’ve seen vectors before. This chapter reframes everything from chapter 1 in vector language: data points become vectors, and the perceptron’s decision rule becomes a geometric statement about angles and hyperplanes.
Here’s the basic move: instead of thinking of an input as “bedrooms = 3, square footage = 1500, etc.,” you represent that as a vector in a multidimensional space. Each feature is a coordinate. The weights that the perceptron learns are also a vector. Now, the core quantity is the dot product between the weight vector and the input vector. If that dot product plus the bias is positive, the perceptron says “yes”; if it’s negative, it says “no.” (He does a way better job of explaining it than I do, but hopefully you get the idea)
Geometrically, the dot product is tied to the cosine of the angle between two vectors ( this is also part of its definition). So the perceptron is basically asking: Is this data point pointing roughly in the same direction as the weight vector or not? The decision boundary, the “line” that separates the two classes, is the set of points where the dot product plus bias equals zero. That set is a hyperplane perpendicular to the weight vector. Training the perceptron becomes a matter of rotating and shifting that hyperplane in vector space until it cleanly separates the labeled data.
What’s nice is that the examples stay grounded. You can imagine a practical setup like classifying patients based on medical data to predict some risk outcome. Sorry I’m probably projecting too much here, but I’m really excited to see what LLMs will end up being able to do for actuaries (they do risk management for insurance companies), but conceptually it’s the same story: each patient is a vector; the model learns a weight vector that separates one group from another as best it can. Moving from abstract math to high-stakes decisions makes the “we are all just numbers” title feel intentionally provocative. How can we put data in its siplest form for these machines to take in when humans are such complicated creatures?
These chapters leave me with so many questions but the only way to answer them is to keep reading! :) I’ll keep you guys updated, I hope you liked my summarizes and interpretations