September | 2025 | Information Overload

I’m certainly not a scholar on the topic, but I am interested in the history of Machine Learning, and this morning, I discovered a concept known as the Fisher Information. This is the same Sir Ronald Fisher that developed the Iris Dataset in 1936, which is most certainly a Machine Learning dataset, though it predates the first true computer the ENIAC which was built in 1945. The point being that the Iris Dataset itself is way ahead of its time, using measurable characteristics of various flowers to then determine the species of the flowers. This is a deep idea, in that you have the mathematical classification of species, which I would argue goes beyond the anatomical, and brings biology into the mathematical sciences.

But on top of this, and what seem to be many other achievements I don’t know much about, he had a really clever idea regarding mutual information between variables. Specifically, how much does a given probability distribution $f(X,\theta)$ change as a function of $\theta$ . His answer was to look at the derivative of $f$ as a function of $\theta$ , though the specific formula used is a bit more complicated. Nonetheless, the basic idea is, how sensitive is a distribution to one of its parameters, and what does that tell me.

This is exactly what Machine Learning engineers do all the time, which is to test the relevance of a dimension. Just imagine you had a dataset with dimensions $1$ through $N$ , and that you have a prediction function on that dataset $F(x_1, \ldots, x_N)$ . Now imagine you add a set of weights $(\theta_1, \ldots \theta_N)$ , for $\theta_i \in [0,1]$ , so that you instead consider the function $F(\theta_1 x_1, \ldots, \theta_N x_N)$ . That is, we’ve added weights that will reduce the contribution of each dimension simply by multiplying by a constant in $[0,1]$ . This is one of the most basic things you’ll learn in Machine Learning, and the rate of change in accuracy as a function of each $\theta_i$ will provide information about how important each dimension is to the prediction function. This is basically what Fisher did, except almost one hundred years ago, effectively discovering a fundamental tool of Machine Learning.

The point is more than just historical, I think Machine Learning is a buzzword used to cover up the fact that a lot of this stuff was known a long time ago, and that Artificial Intelligence is, generally speaking, far more advanced than the public realizes, and that as a matter of logical implication, most of what we believe to be new and exciting breakthroughs are often mundane adaptations of existing methods and technology. The fact that so much money is being poured into the market is disturbing, because I have no idea what these people do all day.

Information Overload

Month: September 2025

Early Machine Learning Innovations