What’s with the new name?

I’ve recently made the decision to focus this blog on my journey in data science. With that, I wanted to give a brief explanation of the name change, what I’ve been up to recently, and what I plan on doing moving forwards.

For those of you who don’t know what a GINI impurity is, I’ll need to back up a bit. Data Science is pretty much the application of statistical learning with computers. Statistical learning is any number of techniques by which data is analyzed and mathematical predictions or conclusions can be made. 

Probably the most common example is linear regression. This is, along with quadratic regressions, a statistical tool that I have used every single day of my professional life. Regression basically means you plot the data and draw a line through it that best represents a general trend in the data. Regression lines are also commonly called trend lines. 

Linear regression.svg
A simple linear regression, image from Wikipedia

GINI impurity–not to be confused with GINI coefficient!–is relevant to a different type of statistical learning tool called a decision tree. Basically, a decision tree is a specific type of flowchart in which rather than basing decision pathways on domain knowledge, experience, or intuition, ramifications are determined using an optimization algorithm. Decision trees are mostly commonly used to predict what category or outcome an unknown might represent, given other information about the unknown. They use an algorithm to search for thresholds of features in the data that best correspond to proper identification of the target category or outcome.

A basic decision tree, also from Wikipedia

A quick and dirty example: Given data about many of the passengers of the Titanic, predict whether or not a passenger whose actual fate is unknown may have survived. You have two categories that you are trying to bin the passenger: “alive” or “dead”. You have a table of data for many other passengers that includes features such as port of embarkation, sex, age, cabin class, fare cost, and whether or not they survived the disaster.

Given the data, a decision tree can select a feature, set a threshold or separation criteria for that feature, then evaluate whether or not that corresponds to better characterization of the target trait. The GINI impurity is one way to to evaluate how well a decision tree can distinguish or predict the appropriate target classification. So if we fit a decision tree model to the Titanic data, GINI impurity is one way to judge how effective each “branch” of the tree is at properly categorizing the results. One branch of the tree might look at gender and see that women had a much higher survival rate than men. If the algorithm finds this to be a good predictor of survival, then it will create a branch to bin according to sex and then create new branches based on other features to further refine the previous separation and improve the model accuracy.

 I’ll do a more thorough description of decision tree models and GINI calculations at a later date. But if you came here for a long-winded explanation of why this is no longer The Blink Lab, then understand that it’s because one of my instructors included a promo photo from “I Dream of Jeannie” on a slide when he was delivering a lecture about decision trees. We all gave the obligatory chuckle at the bad pun. Three months later, it was literally the first thing that came to mind when I re-designed the blog. The domain was available, and cheap to boot, which just goes to show how humorless a lot most statisticians are (assuming sobriety). Thus the new page was conceived.

Since my last post, quite some time has passed. This is mostly because I have been spending a lot of time watching videos of AWS engineers gamely try to explain how they charge for so many seemingly identical services. I’ve pretty much hit my wall with that, and so I’ve decided to split my time between trying to stay awake through boring and poorly constructed training videos and going through the “Introduction to Statistical Learning” in order to teach myself R and reinforce my knowledge of stats. One good thing about this exercise is the opportunity to share code and topics on the blog while I do it. I’m already pretty comfortable with Python, so I may even write code in both languages. If it goes well and people seem to like it, then I will step up my game and go through a text that I am a little less comfortable with.

Oh, and I also have a new interactive piece that I am working on. It’s progressed beyond “cocktail napkin diagrams phase” and is now into the “negligently incomplete and highly suspect planning phase.” Rough cost estimates indicate that even under a best case scenario, I wouldn’t be able to afford it without some kind of professional income, but it’s complex enough that figuring out wiring and building a pattern emulator would probably be a good idea. . . .

5 thoughts on “What’s with the new name?

  1. So, to be sure I understand, is a decision tree an automated thing? Can you tell me more about “threshold or separation criteria”?

    1. In practice, they are typically implemented via code. You define the data and parameters and hyperparameters, and the algorithm crunches the numbers for you. The separation criteria, per the titanic example might look like dividing by sex, or finding an age threshold of some kind that does some improved separation. These are optimized algorithmically–you don’t have to study the data to determine those types of relationships manually.

  2. So, to be sure I understand, is a decision tree an automated thing?

    Can you tell me more about “threshold or separation criteria”?

  3. Can we create a decision tree about the humor of statisticians? What factors are there besides sobriety? 😉

Leave a Reply