The first couple chapters of the ISLR deal with vocabulary and underlying concepts. I’ve chosen to summarize them in a series of hopefully more engaging and descriptive posts that break the information down topically. If you come from a STEM background, all of this should be at least passingly familiar, if not outright dull. If you don’t have a STEM background, then bear with me, cause the topic material (pretty much from here on out) is going to be dry and sometimes full of strange symbols that will frighten and enrage and confuse you at first. Rest assured, they are not sinister – they are in fact mostly chosen arbitrarily or aesthetically. They are merely tools, and you are here to learn how to make them do your bidding.
I’ll include links that explain various things in more detail if you’re like me and consistently find yourself having anywhere from 30 to 100 browser tabs open at one time that you never really fully read through. So if you’re ready to add to your tab count, or otherwise just want a brief overview of some important basic statistics concepts and vocabulary, then by all means, jump down the rabbit hole!
First, what is statistics? You could just read the Wikipedia Article. A good chunk of what I am going to say probably originated from there anyways. But it’s not really necessary to dive too deep into the world of statistics to at least start on our journey into data science–after all, don’t computers exist as a means to eliminate tedium? To answer the question, statistics in the most boiled down form is the study of populations.
Perhaps a better way of putting it is “What information can I know about a population given some number of facts or details about some members of the population?” In this context, think of population as a collection of, well, anything really. Apples. Rocks. People. Stars. Particles. You get to choose.
How statistics is connected to basically everything we know isn’t really the scope of this discussion, however. We’re here to talk about how to talk about statistics!
Let’s start with some basic vocabulary.
Statistical study always starts with a sample. Sometimes you are so lucky as to have all the relevant information for an entire population, in which case, the sample and the population are the same. Unfortunately, chaos cannot be eliminated from life if life is to continue, so typically we are limited to working with samples. Samples are just a subset of your population. Sampling is an art unto itself, and fortunately the more esoteric bits don’t come up that often in the domain of data science.
Samples have some relevant characteristics of their own, probably the most critical of which is N, or the number or size of the sample.
Of course, samples and populations are made up of individuals or items. Each item has at least one measured trait, or observation. These observations are collectively referred to as parameters or sometimes features. So if I randomly selected 100 people, collecting age, gender, and income, and asked them if they’d rather learn about statistics or cuddle a cactus, N would be 100, the items would be people, and the age, gender, income, and answers to the question would be the parameters or features.
Obviously, if I sampled 100 people at random, you’d expect that to be reflected in the data–aside from the question answers, which I expect to lean heavily in favor of the cactus. The differences between individuals for a particular parameter is conceptually known as variation. It’s a little hard to capture mathematically though, so we have a few different tools at our disposal to measure it. It’s also related to error, which I’ll get to in a little bit here.
Statistical Distributions are the various ways in which populations are shown to vary. Depending on what you are looking at, there is a really good chance that it will conform to a well known pattern. The most common of these is the normal distribution. Population distributions are hugely useful tools and stand-ins in statistics. It seems very likely that entire future posts will be devoted to the discussion of just a few of the more useful ones. There is one glaring problem with them though – no population ever measured has exactly matched a particular distribution. That difference between statistical ideals and the disheartening limitations of the material plane are one of the big things that prevent us from accurately predicting, well, just about anything really.
Another key concept is probability. Probability is pretty easy to understand intuitively. After all, it is both easy to grasp and demonstrate via one of humanity’s oldest vices: gambling. Probability is merely the chance, typically expressed as a percentage or a decimal or fraction, that an event will happen given other circumstances or conditions. The simplest example is a toss of a “fair” coin. Either face of the coin has a 50% probability of showing after the toss. But the math of probability can get exceedingly complex. More on this in a future post.
Finally, I’ll chat a bit about error and margins of error and uncertainty. Most people tend to think of these things as a single concept – and for good reason, seeing as how they are related. But they have individual names for better reasons.
Error is a problem that is fundamental to any and every measurement. It is, quite simply, the difference between a measurement and the “real” value. We do not know with absolute certainty the value of anything in the world–pure math/geometry doesn’t count because it is not in the world per se.
Measurement error results from two fundamental mechanisms: problems with measurement precision and problems with measurement accuracy. Precision is a concept that deals with measurement consistency. Errors in precision are random. It is actually a special case of variation. Accuracy is a concept that deals with how close a measurement is to the “real” value of what is being measured. Errors in accuracy are systematic. The study of how to measure stuff is called metrology. Unsurprisingly, it perhaps best considered as a branch of statistics. Also unsurprisingly, it is absurdly uninteresting and tedious even at the most basic, practical level. Entire fields look upon metrology as a speed bump to progress. Let us consider it no further–as data scientists, we’ll take a professional metrologist’s word as law when applicable, because to do otherwise is to stare into the abyss.
As a data scientist, you’re not typically responsible for dealing with measurement uncertainty and measurement error. However, the concepts of error, accuracy, and precision are the sweet, colorful fruit of the earliest metrologists and statisticians. We use them all the time as data scientists–they tend to be our most useful tools in a lot of situations. Both come in different forms and flavors, some more useful or meaningful than others, and the choice of which to use is often one of semantics or preference or situation.
Margin of error and uncertainty are related concepts that try–and dismally fail–to put some kind of idea of how meaningful a statistical result is. They are based on the concepts of population distributions and confidence intervals. Combined with the cognitive dissonance of how easy it is to grasp the fundamentals of probability with how hard it is to actually apply them in real life and some highly refined dopamine manipulation, a good handle on uncertainty is what allows Vegas casinos to always come out on top. The practical application of uncertainty is often seen in the fields of Quality Assurance and Risk Assessment, which, practically speaking, are fields that tend to be at least partially populated with metrologists.
Ultimately, when someone claims via statistics that they are 95% confident in something, understand they mean exactly what it sounds like, which is “I’m pretty sure.” The 95% itself is kind of arbitrary, and is the general opinion of a man which whom none should value for his opinions. The word “confidence” doesn’t have some magical meaning in the context of statistics, just a very specific one.
Basically, a 95% confidence interval is the logical equivalent of “our method gave us an acceptable result within this range 95% of the time in the past, so we’re pretty sure this result that we can’t otherwise verify should be acceptable too.” Related to this is statistical hypothesis testing and p-values. The confidence interval/p-value is a measure of how well observed data agrees with or disagrees with expectations–it is not a statement on whether or not the results or even the expectations are correct. Taken alone, a p-value or margin of error or confidence interval is meaningless. Comforting thought, isn’t it?
Even more fun is the fact that hypothesis testing and confidence intervals are some of the most crucial tools we use as data scientists. Used properly, they are effective and powerful. The important thing to know is what they mean and how to intelligently apply them.
Ok, I’m getting off the “hating confidence intervals” soapbox to go cry alone over my frustrations that coming up with something better continues to be beyond my grasp.