If people paid attention to simple statistics and not get carried away by machine learning, they will unravel quite a bit. A strong foundation in statistics makes for a firmer understanding of machine learning than otherwise possible
We live at a time when people gravitate to a high level without a corresponding foundation. I am talking specifically of the hundreds of thousands of people who are fascinated with machine learning (ML) algorithms but lack any foundation in statistics which really is the building block of ML algorithm. I have been disturbed by many students talking of multiple regression and related algorithms but lacking any understanding of basic statistical concepts and relationships.
One of the commonest of such gaps lies in the understanding of averages, so central to our attempts to grasp the world. One of the chief requirements of any average (measure) is that it be representative of the individual values of which it is the average value. It is obvious that, for this to happen, individual values should not be far apart from one another. This deviation of individual values from their average is measured by the concept of standard deviation, the average deviation, really speaking. Again, taking one step further, using simple arithmetic, we can express the standard deviation as a percentage of the average value. If this is high, clearly the average is not representative as the deviations (from it) are high. The contrary is true if the obverse is the case. In statistics, this is defined as ‘coefficient of variation’, a simple and fine expression conveying very well what it intends. This is a basic concept taught in the early part of any standard textbook on statistics. In the five years that I have been interviewing students for scholarships, I have yet to come across one student who knew this. This is appalling.
Regression forms a substantial chunk of machine learning algorithms but I have encountered so many students who could not explain the basic regression equation, let alone interpret it. A regression equation helps us to estimate the value of one variable given the value of another, with the first depending on the second. In statistical language, one is an independent variable which used to estimate the dependent variable. If there are many factors that play a role in estimating the value of the other variable, we call it multiple regression. Anyone using this equation ought to know this; else, it is vacuous understanding, despite the seemingly sophisticated machine learning language. Whether such an equation is effective (or not) in estimating the dependent variable is a function of how well constructed the equation is. And there are ways to decipher this. I keep getting the feeling that such students are on the higher floor, constructed without a foundation.
I am keeping this short but there is one more aspect that I simply have to talk about – sampling. You don’t have to a student of statistics to recognize that we cannot study an entire population, whatever that be – cancer patients, students taking to AI, companies employing women executives and so on. We study only a sample from which we derive conclusions about the entire population. Again, it is obvious that how the sample is chosen and studied is central to our understanding of the population behavior. The size of the sample is an absolutely critical factor and there is a basis formula which helps us in a wise use. It is distressing to note that no one (except of course specialized professionals) is aware of this.
This can be easily addressed in suitably designing the syllabus. Statistics is a mature discipline; relatively speaking, ML is not. Common sense dictates that you take the help of a mature discipline to help create a better foundation for the relatively new discipline. Will someone pay heed?
Takeaways
Simple statistics is good enough to discover intelligence
The statistics roots of machine learning need to be emphasized lest there be just a technology focus
Photo by Alex Knight from Pexels, Image by Tumisu from Pixabay