Headquartered in New York City, Dataiku was founded in Paris in 2013 and achieved unicorn status in 2019. Now, more than 1,000+ employees work across the globe in our offices and remotely. Backed by a renowned set of investors and partners including CapitalG, Tiger Global, and ICONIQ Growth, we’ve set out to build the future of AI.Â
Internship goal
Enhance the Data Drift computation feature to tackle client’s production use cases : data drift on big data, drifting images, drift on LLM, etc.
Detailed description
Once deployed in production, AI models need to be monitored in order to ensure their consistency over time. In Dataiku, such monitoring is done through the Evaluation Recipe.
In the Evaluation, we compute :Â
Drift metrics : Is the input data distribution different from the data distribution of the training dataset? What about the data distribution of the prediction? Which feature is drifting? How so? Drift metrics quantify those questionings.
Performance metrics : Given the ground truth of a prediction, we can score the model and have metrics such as Accuracy, F1 Score, Precision, etc.
In a real production environment, clients are most likely not to have the ground truth. Therefore, only the drift metrics will be computed. Those metrics are then essential in MLOps.
As of today, drift computation is almost always made on samples of the training and production data. However, the samples might not be representative of the data and you can end up with fake results. On the other hand, taking all the data to do the computation might be way too expensive in terms of computation time. We need to find a smart solution to tackle this : eg. adaptive sampling, metric to evaluate the relevance of a sample, allow drift on whole data and disable some model evaluation capabilities, etc.
Also, only numerical and categorical data are supported for drift computation. Other feature types become important such as images for DeepLearning models, text for LLMs, etc. Their support is needed.
During this internship, you will :
Get familiar with Dataiku and the Evaluation code base
Research of a way to make efficient drift computation on big data
Suggest ways to integrate this research to the existing evaluation feature
Develop the suggested feature, implementing both frontend and backend
Eventually enhance data drift on other feature types : image, text, etc.
Stack
Headquartered in New York City, Dataiku was founded in Paris in 2013 and achieved unicorn status in 2019. Now, more than 1,000+ employees work across the globe in our offices and remotely. Backed by a renowned set of investors and partners including CapitalG, Tiger Global, and ICONIQ Growth, we’ve set out to build the future of AI.Â
Internship goal
Enhance the Data Drift computation feature to tackle client’s production use cases : data drift on big data, drifting images, drift on LLM, etc.
Detailed description
Once deployed in production, AI models need to be monitored in order to ensure their consistency over time. In Dataiku, such monitoring is done through the Evaluation Recipe.
In the Evaluation, we compute :Â
Drift metrics : Is the input data distribution different from the data distribution of the training dataset? What about the data distribution of the prediction? Which feature is drifting? How so? Drift metrics quantify those questionings.
Performance metrics : Given the ground truth of a prediction, we can score the model and have metrics such as Accuracy, F1 Score, Precision, etc.
In a real production environment, clients are most likely not to have the ground truth. Therefore, only the drift metrics will be computed. Those metrics are then essential in MLOps.
As of today, drift computation is almost always made on samples of the training and production data. However, the samples might not be representative of the data and you can end up with fake results. On the other hand, taking all the data to do the computation might be way too expensive in terms of computation time. We need to find a smart solution to tackle this : eg. adaptive sampling, metric to evaluate the relevance of a sample, allow drift on whole data and disable some model evaluation capabilities, etc.
Also, only numerical and categorical data are supported for drift computation. Other feature types become important such as images for DeepLearning models, text for LLMs, etc. Their support is needed.
During this internship, you will :
Get familiar with Dataiku and the Evaluation code base
Research of a way to make efficient drift computation on big data
Suggest ways to integrate this research to the existing evaluation feature
Develop the suggested feature, implementing both frontend and backend
Eventually enhance data drift on other feature types : image, text, etc.
Stack
We are hiring an MLOps Engineer in Cluj-Napoca to ensure the successful deployment, maintenance and optimization of Machine Learning and...
Apply For This JobWhile technology is the heart of our business, a global and diverse culture is the heart of our success. We...
Apply For This JobMLOps Engineer – NLP Scale-up – London (remote) – £100k Have you designed, developed and managed Machine Learning infrastructure? Are...
Apply For This JobSr Data Scientist Forecasting and ML Ops Location: Durham, NC or Pleasanton, CA Hybrid: 3 days per week Job DescriptionDeliver...
Apply For This JobJob Description: Why We Work at Dun & Bradstreet Dun & Bradstreet unlocks the power of data through analytics, creating...
Apply For This JobAbout AlphaSense:Â AlphaSense provides an AI-based search engine for market intelligence, used by the largest and fastest-growing firms globally. Our...
Apply For This Job