Chapters 5 and 6 of Anil Ananthaswamy’s Why Machines Learn felt like the point in the book where the math stopped being just “background” and started to really shape how I think about what modern ML is doing behind the scenes. Chapter 5 zooms in on algorithms that literally operate on the idea that “similar things live near each other,” and Chapter 6 jumps into matrices and dimensionality reduction as tools for dealing with messy, high-dimensional reality. Put together, they made me feel both more impressed by how much structure there is in ML, and more aware of how fragile it can be if we measure “similarity” or “structure” in the wrong way.
Chapter 5 – “Birds of a Feather”:
Chapter 5 is built around the idea that a lot of machine learning is basically formalized “birds of a feather flock together.” In math terms, that’s nearest-neighbor methods: if you want to classify a new point, look at the points nearby in feature space and let them vote. The book uses penguins as the running example—classifying species based on measurements like bill length and bill depth—and then brings in Voronoi diagrams and k-nearest neighbors (k-NN) to show how this intuition becomes an actual algorithm.
What I liked is that this chapter quietly attacks the myth that ML is always some mysterious deep network. k-NN is almost insultingly simple: store the training data, choose a distance measure (Euclidean, Manhattan, etc.), and when a new point shows up, find the k closest examples and take a majority vote on their labels. That’s it. There’s no training loop, no weights, no backprop. And yet, if you plot the decision boundaries—especially via Voronoi diagrams, where you split space into regions around each data point—you get surprisingly intricate, almost organic-looking shapes.
The interesting tension is between flexibility and overfitting. If you set k = 1, the algorithm memorizes the training set: every point gets its own little Voronoi region, and the decision boundary snakes wildly around every noisy detail. As you increase k, you smooth things out, and the decision boundary becomes more stable but potentially less sensitive to subtle differences. The chapter uses this to show, in a very visual way, what “generalization” actually means: it’s not just about getting a good score on test data; it’s about drawing boundaries that don’t freak out whenever one data point is slightly weird.
As someone who’s seen k-NN in a stats/ML context before, this chapter still felt useful because it connected the algorithm to geometry. Thinking in terms of Voronoi cells and distances makes it clearer that k-NN quietly bakes in a lot of assumptions:
- That the chosen distance metric actually captures the right notion of “similarity.”
- That the dataset is dense enough in the regions we care about.
- That the features have been scaled and chosen in a way that makes sense for distance-based methods.
If those assumptions fail, “birds of a feather” quickly turns into “random dots in a weirdly stretched feature space.”
I also appreciated the subtle human angle: the idea that a lot of ML systems we interact with (recommendation systems, simple classifiers, even some content filters) are basically doing nearest-neighbor-ish things in some high-dimensional embedding space. From the outside, it feels like “the algorithm knows me”; from the inside, it’s just “you are close to these other users who liked X, so here’s X.” This chapter made that feel less magical and more mechanical, but in a good way.
Chapter 6 – “There’s Magic in Them Matrices”:
Chapter 6 pivots from “who are your neighbors?” to “what do we even do with data that has way too many dimensions?” Ananthaswamy anchors the chapter in a real medical example: anesthesiologist Emery Brown and colleagues trying to understand patients’ brain states during anesthesia from EEG data. One electrode recording over hours can already produce a huge number of features; multiply that across electrodes and time and you’re suddenly in a terrifyingly high-dimensional space.
The chapter uses this to motivate Principal Component Analysis (PCA) and, more broadly, the role of matrices, eigenvalues, and eigenvectors in ML. The basic PCA idea, as the book (and some of the summaries I looked at) describe it, is:
- You have high-dimensional data with a ton of correlated features.
- You build a covariance matrix that captures how the features move together.
- You compute eigenvectors and eigenvalues of that matrix.
- The eigenvectors with the largest eigenvalues give you “directions” in feature space where the data varies the most.
- You project the original data onto a smaller set of those directions—your principal components.
Visually, the “baby PCA” example is nice: imagine 2D data made of circles and triangles; you rotate your axes so that one axis lines up with the direction where the data spreads out the most, then project everything onto that line. Suddenly a messy blob becomes a one-dimensional separation that a simple classifier can handle.
What hit me is that PCA is doing something similar to what k-NN does—finding underlying structure in data—but at a more global, algebraic level. Instead of asking “who are your neighbors?” it asks “if I step back and look at this cloud of points, what are the main directions in which it actually changes?” That feels almost philosophical: we’re treating the dataset like an object with its own geometry, and matrices are the language that lets us interrogate that geometry.
The chapter also brushes up against the “curse of dimensionality”: once you have too many dimensions, distances become less meaningful, and algorithms like k-NN start to break down. PCA is one response to that curse: compress the data down to the directions that actually matter, and throw away the noise. Of course, that comes with risk—you might discard something important—but the alternative is drowning in dimensions where everything looks equally far apart.
From a student perspective, I liked that this chapter basically gives a narrative reason to care about eigenvalues and eigenvectors beyond “they’re on the exam.” Seeing them show up in the context of real EEG data and clinical decision-making (like tuning anesthesia) made the math feel surprisingly consequential.
What I’m Taking Away
A few things I’m walking away with after Chapters 5 and 6:
- Similarity is a choice, not a fact. Whether two data points are “near” depends entirely on the features we pick, how we scale them, and what distance metric we use. That’s a design decision, not a neutral truth.
- Compression is opinionated. PCA feels like a neutral way to “summarize” data, but it optimizes for variance, not for fairness, interpretability, or any particular downstream goal. What gets compressed away might matter a lot, especially in high-stakes contexts.
- Math shapes ethics. It’s easy to think of bias and error in ML as social issues tacked on at the end, but these chapters show that a lot of the fairness and reliability questions start with very technical choices: which neighbors we trust, which directions we keep, which dimensions we ignore.
- Understanding the tools changes how I see AI hype. When people claim “the algorithm found patterns no human could see,” I now instinctively want to ask: okay, but what was the distance metric? What was the dimensionality reduction step? It makes the whole thing feel less mystical and more like a very elaborate set of coordinate choices.
Overall, Chapters 5 and 6 made the math behind ML feel more grounded and, weirdly, more human. They’re about how we formalize intuition—“things that are similar are close together,” “high-dimensional stuff actually lives on some simpler shape”—and then hand those formalizations to machines. The book keeps insisting that if we want to live in a world increasingly shaped by these systems, we can’t treat that math as someone else’s problem. After these chapters, I’m starting to believe that.