A hitchhiker’s guide to data science for molecular biology
You are a molecular biologist who stumbled into this whole data science thing out of genuine interest but really somewhat by accident. You enjoy data analysis and want to proceed but you don’t know where to go after finishing your basic courses. In a way it feels like you traveled to a foreign country and now you need to understand the language just a bit better for being able to roam the place on your own. Let’s get right to it – here’s the travel guide for the data afficionado molecular biologists.
a country called data
Now let’s imagine you’ve traveled to this foreign country called data. Once you arrive everything is new, shiny and exciting. But after a week of sightseeing and new dishes to try you somewhat get tired of the precooked version presented to you by tour guides and the ever same brochures for day-trippers. You want to know more. Unfortunately, all the other materials you can find are written in the strange local language and clearly assume you’ve grown up here. This makes it difficult for you to identify all the implicit knowledge you need to catch up on in order to make sense of it all and to get a grasp of what this country called data is all about. One evening when you try to figure out if this nearby city called ‘the elements of statistical learning’ is worth a visit this slightly older, slightly more broke but apparently confident and happy older-version-of-yourself person sits down next to you in the common area of your hostel.
“So you’re this biologist who’s about to visit all of statistics on his own?”, he says, “And your current approach seems to be to ‘comprehensively understand’ every aspect there is to it? I’m impressed you haven’t given up yet. And sorry to say but you must know by now that’s not gonna work.”
You dislike this guy on the spot and your expression must have shown it.
“Ok, ok, relax!”, he smiles, “Seems we gotta manage some expectations here. Bad news first: you’re not going to get to the same level of fluency everyone else here has. That takes half a life of growing up here. But there’s still a lot of interesting stuff to learn about that will allow you to do important and fulfilling things connecting the cultures. Also you’re already on the right track. So, if I may, let me point you to a few key sights to visit which will make you independent enough to explore further on your own.”
You’re still not sure why this guy should know any more about this stuff than you do. But in a way it seems as if he’s been in your place no too long ago. You look up from your pile of destinations and decide to give it a try.
“May I?”, he grabs a pen and one of the sheets of paper you’ve been reading, turns it around and begins to draw.
linking biology to data science
“The first thing to keep in mind is that you’d want to play to your strengths a bit. Data science projects can require extensive domain knowledge of which molecular biology is one of them. For now, it’d make sense to narrow your search down to places which have some connection to molecular biology.”
“As you can see there’s still so many places in the bio related space you can visit. If I were about to start in this country again I’d probably start in a place called ‘applied stats town’. This is where most of the data intensive methods used in molecular biology flow together. In fact there are some a renown tour guides I’d recommend getting in touch with if you want to find your way around town. Here are their contacts:“
- Josh Starmer, in his guided tour you get a fundamental understanding of the main concepts behind statistical and machine learning methods. With many examples from biology and of course with multiple *bam!*s. Highly recommended to get an overview and identify some gaps to fill.
- Richard McElreath has written the equivalent of Lonely Planet’s travel guides – just for statistics. Both the book and the lecture series do a great job in making the transition from descriptive statistics to model based statistics. This means you will treat your statistical models like little machines with clear engineering principles which define how they work rather than an obscure collection of tests. Don’t be discouraged if you don’t know what use all this strange talk about ‘Bayesian statistics’ is. Even if you never use Bayesian methods in your work, his tour will still teach you how to swim. So far, this is the resource from which I have learned the most about this place.
- There’s another travel agency in ‘applied stats town’: Modern Statistics for Modern Biology. They provide a unique collection of highlights for the molecular biologist interested in the statistical underpinnings for analyzing the data generated by modern experimental methods. I’d recommend skimming their program and booking only the tours you’re most interested in the most. It’s good stuff, their main audience just seems to be people who already know quite a bit about the topics.
You begin to realize that this is gonna take some time. You glance at the counter and discretely order another round of coffee before putting on a smile and nodding to the guy appreciatively.
“Ah right, I see you’re interested“, he says relaxing his focused expression into a smile as well.“That’s great and haven’t event started yet! The next place I’d highly recommend visiting is ‘clean code national park’. This place emerged from a camp of software developer dropouts who never fully let go of their cultic obsession with coding conventions. These people spend most of their time worshiping the almighty unit test and contemplating about the cleanness of code. It’s a strange and fascinating place! But even if you don’t sign up for their beliefs there’s a ton of practical advice on how to make your programming both more efficient and more effective which can be learned from their religious texts. And if you talk to them nicely they’re always happy to share their wisdom. I’d start looking for someone who is fluent in one of the languages you’re already familiar with. I assume you mostly used R in your work so far because this is what biology and statistics people usually teach their kids. Maybe python and bash too? These three dialects are what is spoken in the entire country of data. But the people in the bio related space still use R a lot. So I’d recommend at least exploring it a bit further so you’re able to talk to people. The area is huge but I can get you in touch with a few people over there if you want:”
- one thing to realize is that the inhabitants of clean code national part regard themselves as authors. This means they care a great deal about style and readability and organization of code in general. There’s a plethora of things to learn but I’d recommend keeping it simple at first. William Stafford Noble and Vince Buffalo are great to talk to first. As people from the Turing way project if you want to learn about how to set up your work in a reproducible way.
- if you were raised in the base R world I’d highly recommend talking to Hadley Wickham first. He’s one of the leaders of the tidyverse tribe. A lot of the concepts forming their beliefs may seem strange and obscure at first. But under the hood these principles are a full blown domain specific language how to ‘think data’. These collections of ‘grammar of’, such as ‘grammar of graphics’, teach you how to think your data objects and scripts with the end in mind. Wickham also founded the advanced R monastery which is always worth a visit.
- Edouard Mathieu runs the cozy little library in clean code national part and can point you out too some good resources if you want to advance your R skills.
- There is no need to be dogmatic about the language per se. People in clean code are a tolerant bunch. Personally I haven’t had too many encounters with the python tribe yet but they can teach you valuable skills relevant if you want to visit machine learning beach or the deep learning mountain resort later on. Here are some people I talked to: Jose Portilla, Wes McKinney, Eric Matthes and some guy who introduced himself simply as google.
You begin to wonder if this guy receives some sort of commission from the travel agencies he’s mentioning… or if he’s ever going to stop.
“One of the key places I’d highly recommend visiting is the ‘communicate your work to others camp’. These people also tend to obsess about the quality of your writing. Although, in contrast to the inhabitants of the ‘clean code national park’, they specialized in the proper use of our natural languages, mostly English, and the use of graphics for communication. But the main souvenir to take away from this is your ability to think clearly about your analyses. You’ll acquire this simply by practicing the areas they teach over there. Here are some people I know from the camp:”
- Claus Wilke runs an exciting theme park on data visualization. I’d also want to point out his annotated bibliography which can be found at the end if you’re interested in learning more. For example he advertises the school of thought put forward by Alberto Cairo. If you have never thought about the syntax composing your visuals these two people are a great start. Also some people from nature have put together a lighter collection of useful data vis principles for biological data.
- They also have a lively market place which they, quite preposterously, call ‘the agora’. Nonetheless, if you stay away from the snake oil salesmen there is good stuff to be found. You could talk to Josh Schimel, Stephen Pinker and Benjamin Dreyer to get started.
“On your way to the ‘communicate your work to others camp’ you pass by the ‘mathematical concepts rebel district’ and the ‘probability’ theory state. I haven’t spend too much time there yet myself and these areas tend to be populated by immigrants from other regions in the country of data not necessarily from the bio related space. Recently, there have been some political uprisings claiming a more rigorous treatment of the theoretical constitutions of the country of data in the curricula of students from the bio sphere. But don’t worry, as long as you navigate around the sensitive political issues the people over there are super helpful and always keen on describing the ideas underlying the methods you’ve met so far with rigorous and powerful abstractions. Three people I’d highly recommend getting in touch with to get familiar with the place:”
- 3blue1brown publishes shiny graphical videos explaining important mathematical concepts in a intuitive way. A great place to start exploring the district.
- Keith Delvin runs a small NGO helping students from the bio sphere making the shift from regarding mathematics as a collection of recipes and weird symbols as a particular and powerful way of thinking. He calls this a program to develop ‘mathematical thinking skills’.
- Joe Blitzstein offers a thorough tour through the world of probability. You wont need all of this content to find your way around the biological areas in the country of data. But if you really want to understand how your methods work from the ground up I’d highly recommend getting in touch with him.
Slowly, very slowly you begin to grasp the geographical logic of this strange country and you become curious how this all fits together in the end.
“You must have wondered by now how this all fits together in the end and how it links to your initial field of study, right?”, he asks. “In essence, science is all about asking and answering questions about the natural world. And the people in ‘lake statistical learning’ are skilled craftsmen in building computational tools which will allow you to ask and answer questions. I’d recommend visiting this place only after you roughly know your way around applied stats town. The best thing about lake statistical learning they have a lot of open workshop sessions where you can watch how real masters of the trade perform their work:”
- The most famous school of martial arts in ‘lake statistical learning’ is run by Trevor Hastie and his colleagues. Although their most famous course will be a bit tough for tourists from molecular biology, they also have a very nice introduction for beginners.
“And there are a bunch of places close by which are gradually less located in the molecular biology sphere but which are also very interesting to visit. Especially, since people back home in molecular biology have become aware of them only recently:”
- In ‘causal inference cliff’, you can learn about how to tell apart causal statistical associations from non-causal ones a topic the fellows from genetics ville became interested in a lot. There is one person I know over there: Jonas Peters.
- ‘Machine learning’ beach is a beautiful place full of data and sun. Andrew Ng runs a batch bar there which seems the place to be. You can also have a chat with Andriy Burkov who usually hangs out there and is always happy to welcome newcomers.
- Finally, you could climb up all the way to the ‘deep learning mountain resort’. The mountain is home to a whole bunch of world famous mountaineers who are all professionals. But you don’t have to compete with them in order to learn a bunch of useful skills. I’d recommend starting with Rachel Thomas‘ and Jeremy Howard‘s fast track program. The skills you learn there are widely applicable but especially relevant if you’d want to stay in ‘omics integration county’ for some time afterwards.
your ticket home
He stops talking and looks out of the window lost in thought and with some lofty gaze on his face. Then, suddenly he seems to remember where he was and continues.
“You know but in the end, data science is really just some sort of hat we put on. The names of the methods are all fashion and most of the principles have been around for some time. They’re incredibly useful tough – but still a tool. So my last piece of advice is really to come back to your motivation why you came here in the first place. While exploration and learning new skills is important, in the end you probably where motivated by genuine interest in asking biological questions. Don’t forget to get a ticket home whenever you’re ready.”