Q&A with Tianyin Xu, visiting scientist at Fb Core Methods and assistant professor at UIUC
In this monthly series of interviews, we put members of the academic community and their critical research in the spotlight – as thinkers, collaborators and independent contributors.
We nominated for May Tianyin Xu, a visiting scholar from the University of Illinois at Urbana-Champaign (UIUC). Before Xu began his professorship at UIUC, he joined Facebook’s Core Systems Disaster Recovery team to explore real-world system applications. Visiting researcher positions are short-term employees (STEs) sponsored by research teams and published on the Facebook career page.
In this Q&A, Xu shares his experience as a visiting scholar on Facebook, discusses the research projects he’s worked on so far, and offers advice to academics considering spending some time in the industry.
Q: Tell us about your role at UIUC and the type of research that you and your department specialize in.
Tianyin Xu: I am an assistant professor in the Computer Science Department at UIUC. My research interests are primarily in computer systems with an emphasis on software and system reliability. I am particularly interested in computer systems that are operated on a cloud and data center scale.
UIUC has a very strong, active computer science department with more than a hundred faculty members. With such a large department, we are strongly represented in almost all areas of IT.
Q: What inspired you to spend some time at Facebook Core Systems at the beginning of your professorship?
TX: A 6–12 month stay (a so-called prebbatical) is now a common practice for new assistant professors in computer science. I liked the idea too – it would help me take a break to be physically and mentally ready and, more importantly, give me time to think about the type of research I would like to do for my faculty job.
As a graduate student with a faculty job, I was looking for an environment that was drastically different from that of an ivory tower. In particular, I looked for opportunities that would enable me to get into real large-scale systems and understand the important issues that really matter. I believed such experiences would be invaluable to my growth as a systems scientist. A key question to which I am always looking for answers is, for example: “Why do existing systems still fail in practice despite the strict software development process and the wide acceptance of reliability technologies?” Answers to such questions open doors for me to think clearly and make relevant technical contributions. However, it is difficult to answer such questions accurately and comprehensively in a purely academic setting.
Facebook Core Systems provides a fantastic environment in which I can gain first hand experience of large production systems and develop a deep, comprehensive understanding of real-world challenges. The open culture gives me access to almost all resources and encourages me to connect with researchers and engineers with different levels of expertise and experience. One very special thing I find is the incredibly flat organization – everyone sits in the same open space and is close to each other, whether they’re Vice President, Director, or Level 3 engineer. I was always checking on people, going to their desks, asking them questions, and having great conversations.
Q: What is it like to be a visiting scholar on Facebook?
TX: The position offers the luxury of understanding large distributed systems from the inside out and at the same time thinking about fundamental research problems. Very few jobs offer both at the same time. I had a wonderful experience – learned a lot (many of which can never be learned in an ivory tower), did really interesting research, had great fun, made strong connections, made very close friends, and ate too much gourmet (and for free! ) Eat.
Q: What research projects have you been working on?
TX: I’ve worked on two infrastructure systems, Maelstrom and Taiji. Maelstrom is a data center-level disaster mitigation system by safely and efficiently inferring interdependent traffic, and Taiji is a global user traffic management system for major Internet services on the fringes. We later released the two systems at leading computer systems conferences, with Maelstrom released at OSDI 2018 and Taiji released at SOSP 2019.
One question I got a lot is why I didn’t choose configuration management systems. Configuration management was my doctoral thesis topic and connected me to Facebook researchers (I met CQ Tang and the configurator team at SOSP 2015, where they published the paper “Holistic configuration management on Facebook”). In fact, I always thought I’d join the Configurator team.
When I finally showed up at Menlo Park in October 2017, CQ suggested meeting a few teams at Core Systems to explore more ways to work together. At one of these meetings, I spoke to Kaushik Veeraraghavan and Justin Meza from the Disaster Recovery (DR) team. Kaushik threw me an incredibly fascinating research problem: What can we do if an entire data center fails (due to fiber cuts, for example)? I had no answer as all of the reliability techniques I thought of could not handle such widespread errors of this magnitude. That was the problem Maelstrom was trying to solve.
When I joined the DR team, my original plan was to move to a new team after six months (so I could see different systems and research problems). In the end, however, I spent my entire prebbatical on the DR team because I enjoyed the work and my colleagues so much.
Q: What impact does your STE experience have on your research and teaching?
TX: This is also a common question I get! There are too many ramifications that will flood this interview in trying to list them all. Let me give you a few examples.
A doctorate is more about depth. I was working on a research problem (misconfiguration detection and prevention) and delved into the subject in depth to earn a PhD. After graduation, however, I was very focused because I only knew my subject. I asked myself, “How can I be a professor who is supposed to have broad knowledge?” Yes, I’ve read articles on many other system topics, but I often find it difficult to get to the bottom of problems reading articles.
The STE experience helped me to develop a direct, holistic understanding of many real world problems and to answer many of my questions / doubts. Additionally, my work on the Disaster Ready team has led me to understand different types of production systems and how each system fits into its place (our mission was to prepare every system on Facebook for disaster). Based on my understanding and experience, I later created a new course at the UIUC with the title “Reliability of Cloud-Scale Systems”. It was a success. The course was rated “excellent” and great praise was given for the relevance and importance of the materials.
The STE experience is also very beneficial to my research. In particular, it helps me to think much more deeply about the practicality of my work, which is very close to my heart. For example, it took me some time to rethink my doctoral thesis on configuration management based on the configuration-related errors on Facebook. The rethinking and thinking led to my most recent project, the configuration test (also known as ctest), a more practical technique to prevent misconfigurations and prevent production downtimes. The work will be published at OSDI 2020 and supported by the Facebook Distributed Systems Research Award.
Q: What advice would you give university researchers who want to become visiting scholars on Facebook?
TX: I shared my experience transitioning from PhD to Facebook engineer internally because I learned a lot. It wasn’t easy at first. At the time, I suddenly stopped feeling good and lacked many skills. I later changed the way I worked on Facebook and started being effective and having fun. Here’s what I learned:
- Don’t work alone. Many doctoral students usually work alone because independence is required in the graduate school. However, independence does not mean working alone. Working alone is a common hazard, and teamwork is a key to success. If you want to understand something, don’t try to read the code for two weeks and document yourself. Instead, talk to a coworker and you will likely get things done a lot faster. You will be much more effective knowing how to work with people.
- Focus on the effect, not the papers. If you do a great job and make an impact on Facebook, you will have no problem getting your work published in top academic locations. This may not work the other way around. Note that the impact is always much harder to achieve than a paper and is heavily weighted when looking for a job (for both academic and industrial jobs).
- Learn more than your project. Facebook has a very open culture and you can access almost all of its many resources. My advice is to take the opportunity to learn more than just your own project and fully understand the important systems and problems.
- Establish connections. There are many high-ranking and young, well-known and aspiring researchers on Facebook. Build your connections! I still keep in touch with my peers and mentors on Facebook and keep pestering them for feedback and advice.
- To make friends. I made a lot of personal friends on Facebook. The year I started at UIUC, my technical director, David Chou, flew all the way from Menlo Park to Champaign to visit and check on me.
Q: Where can people find out more about your research?
TX: Please visit my website for more information. If anyone ever wants to discuss anything about my research, they can always reach out to them.