This is a yearly update I send out to close friends and mentors. This installment is from March 2016.
This year, I've refined my academic and career focus and gotten very interested in using data science to tackle problems in biology. I'll share stories below from my summer at The New York Times, from my travel to Japan and elsewhere, and from my side project at school that is now being used by half the campus. Finally, I'd like to share links to some of the photos and music from 2015.
I've had another great year at Princeton. As a junior majoring in computer science, I've nearly completed my degree requirements. Now I have the chance to refocus my remaining class slots.
For the last five years, the theme behind my studies and work has been to apply computer science to other fields. I've narrowed my computer science interest to data science in particular. The popular definition of "data scientist" is someone who is better at statistics than any software engineer and better at software engineering than any statistician... so my software engineering experience has being quite helpful in this pursuit! And I've moved between several application fields -- from neuroscience, to economics, to politics, and now to tackling problems in biology and healthcare using data science methods.
A combination of conversations, classes, and readings sparked my recent interest in biology and healthcare. I took a wonderful seminar taught by Professor Shirley Tilghman, a renowned biologist and the previous president of Princeton, that focused on genetics and public policy. We discussed fascinating topics ranging from eugenics, to the role of genetics in the criminal justice system, to direct-to-consumer testing (e.g. the 23andMe--FDA saga), to gene therapy and genetic editing. At the same time, I was having conversations with computational biologists on campus about their work, and reading books like The Emperor of All Maladies, which presents the history of cancer, cancer research, and cancer treatment in such an interesting way; I highly recommend it.
The current semester is a crash course, accompanied by many conversations from which I am trying to ascertain a zeitgeist of biotech:
- On the biology side, I'm doing a deep dive into genomics, specifically learning about how sequencing works and about computational analysis methods tailored to biology; I think it's going to be very helpful to have the intuition for how biological data is actually produced from my genomics seminar.
- I'm also taking a class where we discuss and critique papers from the relatively young field of neuroimmunology, which investigates how the immune system affects the brain, and vice-versa. In the pursuit of learning to read scientific papers with a critical eye and practicing how to communicate them accurately to the public, I'm learning, as a byproduct, about the possible connection between the gut microbiome and autism, about the terrible science communication that spawned the anti-vaccine movement, and more.
- Finally, on the healthcare policy side, I'm taking a class on the history of health reform in the U.S. and the workings of the Affordable Care Act -- what better way is there to be introduced to the complexities of our political system than to jump right into the mess of a debate that is health reform? (Watching House of Cards has helped, too.) It's particularly interesting to consider the subtle challenges in and strategies for patching a big existing ecosystem.
(Other classes I took over the last year focused on machine learning and statistical theory, artificial intelligence, compilers, microeconomic theory, and linguistics.)
So far, this interest is still quite broad. But as I continue along this dive into biology and healthcare, I'm tracing the connections to technology, and specifically looking for the problems I can solve with a data science approach. This school year, I've been working on a research project with my good friend Andrew and with Professor Barbara Engelhardt, whose work lies in the intersection of computer science and biology. Specifically, we are focusing on improving experimental design for CRISPR experiments.
CRISPR is a new, revolutionary technique that makes gene editing straightforward. To edit DNA with CRISPR, all you have to do is design a guide sequence -- made of RNA -- that will guide CRISPR to the location in the genome that you want to edit. Basically, the guide RNA will bind with the piece of DNA that you want to edit, and then the system's Cas9 protein will cut the existing genome at that location. If you also put some new DNA nearby, built-in DNA repair mechanisms will incorporate it to fill the gap. Biology researchers are jumping all over this simple and general gene editing technique. Democratizing gene editing will transform the next 50 years the same way democratizing technology and the Internet have transformed the past few decades. Of course, CRISPR is now the subject of a huge patent war between Stanford, Berkeley, MIT, Harvard, and the Broad...
The challenging part in using CRISPR is designing a guide sequence to match the DNA portion you would like to target but not make changes elsewhere in the genome (where similar sequences might appear, perhaps). Researchers have to meticulously tune their RNA sequence construction until the CRISPR system is able to bind precisely to a piece of the gene to be edited. This problem is very similar to tuning parameter settings in many other experimental fields of science (or, for that matter, in machine learning).
Today, biology labs try to do a "grid search" through the parameter space, meaning they slowly try all possible guide sequences (parameter settings) until they find one that works well. A more intelligent technique, named Bayesian optimization, has been developed for the analogous parameter tuning problem in machine learning. Instead of naively trying all parameters, this technique decides which next experiment to run -- which parameter settings to try next -- by estimating the expected improvement of a new set of parameters or by trying to gain as much information about the problem as possible through the next experiment. While Bayesian optimization works quite well when you have a single-digit number of parameters, it does not scale. CRISPR guide sequences are 20 bases long, and physical experiments often call for even more parameters. Thus, our goal is to scale this intelligent experimental design algorithm to work for higher-dimensional parameter spaces. Moreover, we'd like to produce something that researchers could use interactively in their lab to accelerate science.
I spent summer 2015 working on the data science team at The New York Times, run by Professor Chris Wiggins, a professor at Columbia and a long-term mentor of mine. The team embeds out into different business and newsroom divisions to improve the ways in which content reaches readers and to keep The Times in business. My understanding of our job was to be evangelists of data-informed decision making -- an interesting and creative task at a 160-year old company like The Times -- and to introduce some more skepticism into the product design process, i.e. to design and run experiments to evaluate hunches and inform business decisions.
I was offered several projects when I joined but instead forged my own path slightly after chatting with many newsroom and business people from across the company. I focused my summer on a particular customer retention challenge as well as on a long-term strategy experiment.
Working with Chris Wiggins and his team was a lot of fun, and being on the data team at NYT was truly a great learning environment, in many ways. I practiced how to reframe business problems as machine learning challenges and how to make the output of statistical analysis interpretable to business leaders -- a task that involves careful evangelism of data science in an organization still adapting to the digital world. Chris Wiggins, whose research now focuses on biology, helped me understand how to do this reframing in the biology context as well, which I think will be helpful with my new interest. On a meta level, working cross-functionally / across the organization at NYT and chatting a lot with Chris about his philosophy in team structure and orientation gave me good mental models for how to organize data science teams and taught me about what powers team dynamics.
Besides all that, it was a thrill to spend 10 weeks at The Times. My hope was to immerse myself in the culture and understand the ethos of the people. It's hard to convey the subtleties I noticed, but there were a few particularly memorable moments:
- I visited the Morgue (it has its own Tumblr), a basement room the size of a large 1-bedroom that functions as the archive of the paper; it was fascinating to see how the journalists do historical research and how exactly a print archive is used, not just in writing obits but in tracking down photographs hidden away in an out-of-print book no one else has that are suddenly relevant to a story today, for example.
- There was a visit to the printing press -- I am a sucker for factory tours. (Some photos included at the end.)
- We had conversations with NYT lawyers describing exactly how the paper defends its freedom of speech, and how that can be much more difficult in countries like Russia.
- There's a beautiful Steinway concert grand piano hidden away in a storage room in the building -- not enough do I get to play on a piano like that! Sad to see it unused; it made for some fabulous lunch breaks, and makes me want to keep up my unused piano sharing project.
- Finally, Chris Wiggins introduced me to several journalists who invited me to sit in on their meetings. Behind the scenes at a paper like The Times, after the stories have been written, there starts an intricate game of when to pitch each story. News desks used to be separate, warring fiefdoms (there are several rich histories of The Times that have been published, if you are interested in reading more about this); today, they are much more collaborative, but there is still a struggle over which stories get front page -- or more importantly, home page treatment.
A colleague and I started a weekly group meeting to individually work through tutorials on technologies we wanted to learn, ranging from Git (which I had used for ages but never sat down and learned fully/correctly) to survival modeling. I'm now starting a similar "doing group" at Princeton to get my hands dirty and learn more about the following:
- Hadoop, Spark, ETL pipelines, database internals, message buses and queues
- Docker (more on this in a moment)
- Build tools, continuous integration
- Common stacks for high-throughput applications, and building systems at scale
- Load testing
- Monitoring best practices, and more
The highlight of the year was a trip to Japan in August-September. My girlfriend Shannon, who is in the same year at Princeton and studies environmental science and environment studies (a major she designed herself), speaks Japanese, so we were able to visit some fascinating places. After Tokyo and Kyoto, we traveled to some beautiful rural mountain villages like Takayama. I recently took a class on 20th century Japanese history, so it was particularly interesting to see Japanese lifestyles first-hand. We even witnessed an anti-rearmament protest -- another cultural subtlety I would have missed without a translator like Shannon!
I also visited friends in Boston and Toronto over school breaks. Most recently, Shannon and I went on a beautiful road trip from the Bay Area to Ashland and Portland, Oregon over Intersession (aka ski week in late January, which exists because Princeton is still in the stone age and has finals after the winter holidays). I'm including pictures at the end!
With a couple of friends, I launched a webapp at Princeton this year that took off. It's called ReCal, and it helps students pick their courses and design their schedule in a sleek and intuitive way. It so happened that the university decided to pay a vendor an ungodly amount of money to make its own version, called TigerHub, that, frankly, is incredibly confusing and annoying to use. In fact, leading with "Frustrated with TigerHub?" has gotten us to over 3,000 users making schedules in ReCal (and has made some administrators bitter!).
ReCal used to be a class project for Brian Kernighan's COS 333. Then two of my friends from our class team isolated the class selection component and made that the new ReCal. I rejoined them to finish the project and especially to coordinate the launch. Having wanted to get some more operational experience, it was fun to handle PR and marketing and focus less on the technical aspects.
Now I'm broadening my involvement, for several reasons: I don't want to paint student creations in a bad light for the administration, my friends are graduating soon, and I'd like to make ReCal less of a hack and more of a killer app. Some institution needs to be involved for ReCal not to suffer the fate of nearly every other Princeton app and fade after I graduate. As a new side project, I'm playing developer advocate by designing a sustainable, long-term hosting and maintenance solution and brokering it between the undergraduate student government and the computer science department. This will keep ReCal alive and will make it far easier for students to launch apps on campus. Since I've been playing with Docker a lot recently, I'm building the system on top of Docker containers. A Docker container is just like a virtual machine but without the overhead -- you can run many Docker containers on the same machine and they will share resources nicely. That means we can containerize all student apps and standardize how we monitor, backup, and run maintenance on them all. Long story short, it's a fun project for me to train my dev chops a bit further, and it will help keep my side project alive.
Finally, earlier in the year I worked on a side project with Professor Sam Wang, who does autism research by day and political forecasting by night. I found his work when I noticed a Twitter war between him and Nate Silver of FiveThirtyEight. We decided to look for ways to detect and quantify gerrymandering that are so simple a judge/the legal system could understand them. Though we did not publish our data analysis work, Prof. Wang has separately published an interesting law journal paper with statistical metrics of gerrymandering that is summarized in NYT pieces (1) and (2). Worth a read!
Goals and habits
The excitement of being at Princeton has started to wear off, unfortunately. I'm focusing again on my routines to build a stable lifestyle and continue following my excitement. My Mastermind group with friends Andrew and Fiz is continuing -- we meet weekly to help each other think through long-term goals and build habits, and especially to continue living deliberately despite the treadmill that is the Princeton experience. I'm currently doubling down on my fitness, piano, and reading habits.
I made good progress on my goals from 2014. First, after a maximum-entropy search process, I've found a new specialty: biology. If that ceases to interest me, I'll continue following the path of things that seem most exciting, and may take a gap year (since I have two extra years working for me) to read broadly and travel. Second, I've continued to go deeper into machine learning and data science through classes, the summer, and by getting my hands dirty on my own. Finally, I've had more opportunities to hone my operational skills: it was nice to live on the business side at NYT and to handle product management and PR for ReCal.
I'm setting some new goals for 2016:
- Continue refining my specialization -- especially since I will need to decide what to do after Princeton. I want to work further to narrow my interest in biology and healthcare and identify specific research areas of interest. This will involve getting a good understanding for the landscape of biotech in industry and academia. I am working on this now by engaging in many diverse conversations and trying to formulate a mental model for the landscape. This summer I am hoping to design an internship or research experience that will help me choose an approach for my final year at Princeton (in terms of what to write a thesis on) and beyond.
- Dedicate a lot of time in my last year at Princeton to meet more people from diverse fields. The present transition to biology has had the side effect of exposing me to people from walks of life quite different from my own; I'd like to find a way to continue doing that on a regular basis. Here's one idea: I used to attend a discussion group at UCSD called "Dangerous Ideas", and I'm toying with setting up a similar group at Princeton where students with a diverse set of interests (at most one or two people from any department) can share the most interesting things they're learning about and the philosophies of their fields that others are not attuned to.
- Rekindle my involvement in music and my reading to become more cultured. It would be a shame to drop music after making it such a central part of my childhood, and I'm sorry to have focused less on it over the last two years.
I'm so happy to be where I am now, and I'm very excited to see what this next year holds in store. I am very grateful to all my close friends and mentors for their kind support and advice. If you have any feedback, I would appreciate it if you could please send it my way!
Thank you for being a part of this chapter of my life, and all the best.
P.S. I always include some multimedia at the end. Here are some pictures from 2015 (click the info button to see descriptions on individual photos). And here is a curated Spotify playlist with some of the music I've been listening to.
Finally, here's a selection of the books I read over the past year:
- The Wind-Up Bird Chronicle, Haruki Murakami
- Snow Crash, Neal Stephenson -- wonderful audiobook
- The Emperor of All Maladies: A Biography of Cancer, Siddhartha Mukherjee
- Stiff: The Curious Lives of Human Cadavers, Mary Roach
- The Man Who Mistook His Wife For a Hat, Oliver Sacks
- The Martian, Andy Weir
- Anthem, Ayn Rand
- The Box: How the Shipping Container Made the World Smaller and the World Economy Bigger, Marc Levinson -- featured by Bill Gates
- Blue Ocean Strategy, W. Chan Kim and Renée A. Mauborgne
- How to Read a Book, Mortimer Adler and Charles Van Doren