From 5 PB of daily logs to thousands of ML models - Story of a ML pipeline

Scale
06/11/2018 - 11:00 to 11:40
Moon Lounge
long talk (40 min)
Intermediate

Session abstract: 

Machine Learning is at the heart of the Criteo platform. We use ML to determine which ads to display to the right user at the right time. Models on user browsing history and advertiser attributes are trained offline and used online to determine which ad to display to a given user.

In this talk, I will present Criteo's ML pipeline allowing us to read 5 PB of logs to train and deploy several thousands models a day, continuously improve the features of our models and use these models to predict which ads we need to display online. Displaying 3.5 billions daily ads online requires 500 000 predictions per second.

I will expose the problems we had and how we resolved them. This includes open source technologies such as HDFS, Yarn, Spark and home built technologies for our ml algorithms, scheduling, testing and monitoring.

 

Video: 

Slide: