What I learned from a Data Science Bootcamp

So those of you who may know me personally may be aware that I recently completed General Assembly’s Data Science Intensive Fellowship. I’ve had a few months to internalize the experience, figure out what other skills may help me get employed, and get a taste of the current job market. What follows is my deconstruction of the program:

First and foremost, my cohort started in mid March. The faculty had decided to move to a fully remote setting, which definitely made our experience unique, particularly in the sense that it was a very sudden transition. That being said, my experience may vary significantly to others, as I know they were making some big changes to how they delivered material.

As mentioned, classes started in mid March. I had taken a little time to learn some python prior to the class, so I wasn’t completely lost on day 1. It probably would not have mattered much, seeing as how the curriculum was designed to teach people who were python naive, if not programming naive. More important was my background in chemistry, which gives me some reasonable exposure to statistics.

The first lesson was that old standby of “hello world,” followed up by the classic “fizz-buzz” replacement exercise. From there, we covered data types, GitHub, and some basic command line stuff. That’s pretty much where the general programming education stopped, and the data-specific education began.

It’s important to note also that the environment most data bootcamps and fellowships work in is either Jupyter Notebooks or R Studio. Neither of these IDEs are particularly similar or relevant to command line/production programming. They are intended for research and exploration, which is a good chunk of what happens at the beginning of any data project. Jupyter is particularly well suited for Exploratory Data Analysis (EDA). Commenting can be done in separate cells via markdown for clarity, instead of inline commenting. But for production pipelines and real-time analysis, Juptyer and R Studio are maybe not the best options. If you want to go down this road, definitely get comfortable with text editors and command line.

The beginning of week 2, we were working with Numpy and Pandas. By week three, we were into SciKit Learn and Statsmodels for linear and log regressions. About three weeks in, I realized that I had been doing basically all of this same stuff, just with worse tools and in a more limited scope. Pandas in particular was a revelation – I can’t imagine ever working in Excel again. Each week focused on something new – Bayes, NLP, Web scraping, SQL Queries, Clustering, Decision Trees, Neural Nets. We covered examples from all of the most commonly employed statistical learning techniques (aka models).

Perhaps most revelatory was the fact that all of these techniques are hardly new. Even neural nets are older than I am. What has changed is the data engineering and warehousing side of things – technologies that make that data both available and usable in the context of complex statistical analysis. The former are technologies like Spark that allow for the real time access and manipulation of large data sets, and the latter are improvements to, or recently created, programming libraries that make all of that linear algebra and matrix math as streamlined and conceptually intuitive as possible, while also giving increased options for crunching those numbers via distributed hardware resources.

So what did I get?

For me, the value was in exposure to the tools and how to use them. A good chunk of the stats was already kind-of-sort-of-half-forgotten known to me before I took on the fellowship. The refresher was nice, and then we quickly went into deeper stats water than I have ever dared tread. Conceptually, I feel pretty good about it. Technically, well, let’s just say that NumPy and SciKitLearn and Keras/Tensorflow are awesome. I don’t think anyone is expecting that I start whiteboarding equations for eigenvectors of n-dimensional matrices. Being able to say what the eigenvector of a Principal Component represents and how it is applied is probably as far as I need to go down that road.

The pace of it was similar to working at a startup, and probably with good reason. The sheer volume of material for which I wrote working code was insane considering the 12 week timeline. Immersive is an accurate description. I lived and breathed Python, Scala, and SQL for three months. I even had dreams in Python, which is kinda weird and hard to explain, and the reason for the name of the blog.

One of the more key things provided is the job-seeker support. Data-banks of technical interview questions, career coaches, support resources, best practices. The coaches are responsive and knowledgeable about recruiting practices. They’ll help with mock interviews, advice, cheerleading, networking, and resumes. They’ll critique your LinkedIn profile for you. They’ll keep tabs on who is hiring and warn you off known bad actors. And the alumni tend to be pretty approachable as well, and frequently post openings at the companies at which they work.

So distilled down, I am comfortable with using any of the techniques that I learned, with the caveat of “so long as I have access to the documentation and the time to figure out what some of it means.” Conceptually, I feel that I can explain in lay terms almost everything we learned from the top of my head. I know how all the pieces work and how to put them together and explain what it all means. Technically, I am in way over my head in actually understanding all of the math that underlies everything, but I think that technically speaking, the whole point is that doing this kind of math by hand is pretty much so impractical that it might as well be impossible. Could I tell you step by step what the DBScan algorithm is doing? No. Can I tell you what it starts with and what it gives you and the implications and assumptions that come along with it? Absolutely.

More than anything, the fellowship made me realize that a career in tech is a pretty active and engaged thing – you may end up with a job (in fact, likely will if you put in the effort) that has reasonable work life balance, pays a good salary, and is comfortably paced. But if you want to keep that job, you really need to be thinking about your career, your skills, you truly need to reorient your identity to being a data scientist/analyst/whatever. Nothing in the field is particularly static, and staying connected, continuing education, and practice practice practice is needed to stay relevant. The program opened a door, but it led to a path, not a room. It made me think of a career as a process, not a series of jobs.

In the meantime, I have been learning new tools that seem to be coming up on most of my job applications, like Tableau and Microsoft BI. I’m spending time practicing SQL queries, since at least some knowledge of SQL seems to be pretty useful. I’m spending time on Codewars and HackerRank to keep up with Python and Scala. But more than anything, I spend time networking. LinkedIn seems to be the only game in town for recruiting during a pandemic. I try to add new contacts every day. I meet with friends in tech on Zoom to chat, get mentorship, see which way things are going and which new techniques are good for what purposes. I collect referrals like a terrier after rats. I stalk companies that I want to work for, try to make connections with their employees, follow their corporate officers. The more your name gets in front of people, the more you can control what people know about you and how they perceive you, the more that people see that you are engaged and interested in your field beyond just making a wage, the more offers and opportunities that will come your way.

Leave a Reply