Q&A with Stanford Professor Balaji Prabhakar, longtime college collaborator with Fb

In this monthly series of interviews, we put members of the academic community and their critical research in the spotlight – as thought partners, collaborators, and independent contributors.

For June we have nominated Balaji Prabhakar, a professor at Stanford University. Through the Stanford Platform Lab, Prabhakar, along with several other Stanford faculty members, has been an active member of the Facebook research community for over a decade.

In these Q&A, Prabhakar shares what the early days of research on Facebook were like and how he sees their collaboration going forward. He also explains his latest project with Facebook, On-Ramp (NSDI 2021).

Q: Tell us about your role at Stanford and the type of research you specialize in.

Balaji Prabhakar: I am a professor of electrical engineering and computer science; my title is VMware Founders Professor of Computer Science. I also have a courtesy appointment at Stanford Graduate School of Business. My research area is data centers and cloud computing – mainly these. I worked on network theory, algorithms, and protocols early in my career, and now I’m making a lot more systems and building bigger systems instead of algorithms that sit in a few subsystems.

I also worked with Facebook on reducing road traffic. This was around 2011 with Facebook’s leading transportation company. We discussed using incentives to encourage Facebook commuters to travel or take shuttles during off-peak hours – basically to reduce traffic on the street, a different type of congestion than computer network congestion.

Q: When and how did your relationship with Facebook begin?

BP: It started over a decade ago. Facebook had an office in downtown Palo Alto – likely Facebook’s first office after the house where Facebook started. At the time, Facebook was growing rapidly and the social graph was growing almost ten-fold, literally, every few months – a gigantic rate of growth. So the question was what kind of systems (databases, computers, network infrastructure) are needed to support this growth? What will the backend look like on a scale? The initial networks were all one gigabit per second, but what happens when we move to higher speeds like 10 or 40 Gbps? That was the original conversation.

At the time, we had the Stanford Experimental Data Center Lab, of which I was the faculty director. The Stanford Platform Lab (SPL) developed from this. We spoke to Jeff Rothschild, who at the time was overseeing much of the tech that was going on on Facebook. I think Mike Schroepfer was just hired, so it was just the beginning. On the Stanford side, my colleague John Ousterhout began formulating the RAM cloud project. Our colleagues in the laboratory at the time were Mendel Rosenblum and Chris Kozyrakis. Chris has been the Faculty Director of SPL since this year.

So that was the very first conversation that led to our first collaboration, and pretty quickly it led to a couple of PhD students being interviewed. I remember that Berk Atikoglu, my former PhD student, was offered an offer to work on Facebook. What he learned during this experience later became the basis for his dissertation.

Read the newspaper

Workload analysis of a large key value store

At the time, Berk and his mentor were looking for ways to improve Memcached performance, which is important to Facebook’s infrastructure. Fortunately for Berk, Facebook’s office was on Cambridge Ave. near his dorm room so he could easily work on Facebook while staying in touch with Stanford. His diploma thesis, which he published as a conference contribution, dealt with the analysis of Memcached.

We have had fruitful interactions with Facebook like this one on several fronts since the early days. Because of its size and the breadth and depth of the technical issues the company deals with, many of my faculty colleagues also work with Facebook.

Q: Tell us about the last time you worked with Facebook, On-Ramp.

BP: On-Ramp is described in the USENIX NSDI 2021 paper “Breaking the Transience-Equilibrium Nexus: A new approach to datacenter packet transport”. The project with Facebook started after the paper was approved by the NSDI in 2021 but before it was published. We worked with Facebook to test our solution in a real test bed. the results were incorporated into the final thesis. There was quite a bit of work to be done on the Facebook engineering team’s side to bring something on our side and try it out, for which we’re really grateful.

Read the newspaper

Breaking the Transitional Equilibrium Nexus: A New Approach to Packet Transport in the Data Center

What is on-ramp? It is a way of reducing congestion on the network by using very accurately synchronized clocks. With precisely synchronized clocks, which we developed in Stanford, you can measure exactly how long a data packet has been on the network. This led us to a very accurate way of measuring congestion down to the sub-microsecond range. The challenge is that as Facebook’s networks grow faster and faster, the time you have to react to congestion is decreasing.

It’s basically the same concept on the road: the faster vehicles go, the shorter the reaction time for the driver to stop if necessary. The problem in data networks is exacerbated by the fact that as the network speed increases, the buffer memory in switches actually shrinks because it is too resource-intensive to have a lot of buffering at these very high speeds. It’s kind of a double punch – the speed has increased and the space to store packets in switch buffers has decreased. So traffic jams have to be recognized quickly and accurately and stopped immediately.

Because On-Ramp measures congestion on network paths so quickly and accurately, it can prevent traffic from entering the network as soon as it detects congestion, hence the name On-Ramp. It’s like measuring on motorway entrances: the traffic lights allow cars to enter the motorway more slowly when there is a heavy traffic jam, and they let them in more quickly when there is less traffic. It’s pretty much the same idea: if I can measure the network path congestion very quickly because I have accurate clocks, then I can stop or leave things at the edge of the network depending on whether there is congestion.

The first tests were done by Facebook engineers for the newspaper, and the project was so encouraging that now an intern is coming to Facebook this summer to work with Facebook’s network engineering team. He will run tests in different scenarios and in different parts of the Facebook infrastructure, examine the effectiveness of on-ramp and figure out how to solve some of the delicate problems that arise in field operations.

Facebook is very, very good at absorbing scientific ideas and trying them out to scale – and, if they work, implementing them. Often times, students go to companies and come back with really interesting ideas, but not necessarily with a delivery route like the ones that might be found on Facebook.

Q: How do you see Facebook and Stanford working together to solve the challenges for future data center needs?

BP: In the beginning, the relationship was mostly about major infrastructure design challenges. Then when we had some ideas, Facebook was ready to let us try things – even on their production network, which was great. And then Facebook joined our industry partner program and funded our research. Internal champions are critical to ensuring that university collaborations really work and work in both directions, so I want to acknowledge these people like Jay Parikh and Omar Baldonado. Manoj Wadekar has opened the latest engagement between us. When the direct relationship between engineering teams at Facebook and researchers at universities fails, the glue that binds the collaboration together sometimes loosens.

In general, Stanford is in a fortunate position considering where it is at. We work with a number of companies. We are also fortunate that some of our former students are now working in these companies. Going forward, I’m interested in Facebook engineers spending time at Stanford on a part-time or sabbatical basis. Omar Baldanado of the Facebook networking team recently spent a few weeks in Stanford and it was a huge success. With Stanford so close to Facebook headquarters, I hope this is easy to do. Facebook people can attend group meetings and discussions at Stanford, describe key next generation challenges Stanford researchers could address, collaborate with them, and possibly feed back successful results back to Facebook for implementation.

In terms of new projects to work on in the future, many of the infrastructure challenges are in the direction of either machine learning or data. Several of my colleagues are working on problems related to building machine learning systems or, conversely, machine learning to build better systems. These are going to be big topics in the near future, and I think it’s interesting for universities as well as companies like Facebook.

The pandemic has revealed a few other things as well, including that video communications will stay here. That’s the future, and as a big tech company, Facebook has a role to play in that future. After all, a core component of Facebook’s mission is getting people to connect and communicate, and if communication shifts to video then I think it makes sense for Facebook to focus on that. Large-scale video communication at the group level presents some very interesting technical challenges.

Q: Where can people find out more about you and your work?

BP: People can find out more about my research on my website. To learn more about the Platform Lab, visit our website.

Comments are closed.