Nom du Projet logiciel: Scikit-Learn
Code source :
Porteur(s) du projet : Gaël Varoquaux, Inria, Olivier Grisel, Inria
Nom & Prénom du Candidat : MENSCH Arthur
Laboratoires : Inria, Neurospin
Laboratoire gestionnaire : UPSud
Équipes impliquées : Parietal, Neurospin
Durée et dates de la mission : octobre 2016 - octobre 2017
Context and Objectives :
This project proposes a set of improvements and code auditing for scikit-learn.
scikit-learn is an open-source Python library published under BSD license, that provides
non-expert API for running state-of-the-art machine learning algorithms on arbitrary datasets. It relies on well-known numpy and scipy toolboxes to perform linear algebra routines and to represent data in memory. It aims at being easy to use, cheap to maintain and efficient. This is made possible by the use of Python, an easy language to learn for non-proficient programmers and a concise, readable language for codebase maintainers. The library performance is maintained high and comparable to compiled-code libraries through the occasional integration of compiled code, relying on its own cython code or open-source non-copyleft libraries. The work done within the frame of this project will improve the quality of several key functionalities of the library, in term of both performance and API. Specifically, this project focuses on incremental and stochastic learning algorithms.
Project Content :
This work involves three direction for improving scikit-learn library, that have been audited to be crucial for its development.1 Stopping criteria A vast homogenization operation will be performed on scikit-learn estimators that involve a stopping criterion. At the moment, many estimators based on stochastic solvers propose a different criterion to determine convergence, without providing much insight on such criterion to the user. This is unwanted as it incites users to see estimators as blackboxes and to consider estimator output without interrogating the convergence of the underlying model. Some estimators track change in iterate value, other in objective function value (ie training cost) and gradient value (eg stochastic average gradient solver in Ridge / LogisticRegression estimators),
and other use the duality gap (eg Lasso/ElasticNet estimators) as a indicator of convergence. No clear indication is provided to the user on how the stopping criterion is built from the specified tol keyword, and how it relates to the estimator model. Worst, some optimizers (eg SGDClassifier estimator) only allow to specify a number of epochs to perform on data, and offer zero control on algorithmic convergence. Early stopping Recent work on online solvers (especially in the field of deep learning) have shown the benefit of early stopping, namely evaluating a stopping criterion on a validation set instead of a training set, in order not to loose time on learning parameters that are sufficiently good for generalizing a model to unseen data. Estimator control based on validation data is currently not implemented on scikit-learn, while it would be greatly beneficial for large-scale applications and algorithmic performance (for stochastic solvers, it has indeed been shown that learning rate scheduling based on validation data allows better goodness of fit in many cases). Memory management Online solvers typically relies on keeping a few small arrays in memory and updating them as data is streamed. In scikit-learn, this logic is concretized by the partial fit method. Using this method typically requires the user to set his/her own pipeline to stream a data source to an estimator, and to control himself the transfer of data from disk to memory. At the moment, some work is in progress to better handle objects that are key to such transfers, namely memory maps in numpy, and to provide automatic loading of data on disk for online algorithm (eg MiniBatchDictionaryLearning). The present proposal involves continuing this work and providing new tools for handling data flows from disk to core memory. Method and organization The student selected for this project will be expected to review existing pull requests related to this project and to provide fresh code and documentation. He/she will work following the rules of scikit-learn community, will involves providing readable, tested and maintainable code, as well as relevant documentation and examples for the new features that will be designed. The successful applicant will be required to have good knowledge of scientific Python and experience of incremental/stochastic algorithms in machine learning. He/she will be allowed to work
in relative autonomy, but will provide regular feedback to his/her mentors. Bi-weekly meetings will be set up to assess project advancement. The student can expect help and support from both his/her mentors and the large scikit-learn community. He/she will work using the Github interface to submit work to the community and to review submissions himself.
Expected results and impact for the community :
Stopping criteria The student selected for this project will audit estimators that relies on online solvers, and list all different stopping criteria that are used throughout the library. A unified API with a secure max iter keyword – necessary in time-constrained setting – will be provided, as well as a well-defined tolerance criterion. This criterion should be related to function value by default and could be set to be related to iterate value by the user – it is usually a harder problem. Moreover, the documentation will be enriched to expose clearly how online solvers behave, what their stopping criteria consist on, and how to select the most appropriate solver1. This will imply creating a consequent and didactic section in the user guide on how and when to use these solvers for typical machine learning models (OLS/Logistic loss with convex / strongly convex penalty).
Early stopping A consistent API will be designed to allow the user to specify early stopping
criteria to estimators based on online solver. This will require to specify an interface for providing validation data to a scikit-learn estimator. Being able to track validation and train scores within the execution of stochastic/incremental algorithms will also be of great interest to provide the user with insights on how difficult is the data he/she wants to mine. A consistent API for tracing these scores will therefore be provided.
Memory management First, the successful applicant will complete the work that has been started to make scikit-learn fully compatible with memory map input. More importantly, this proposal involves the development of new tools for automating data streaming from a limited number of sources. Currently, many users relies on the same source of data (SQL database, HDF5 files, S3 Buckets, etc.) for their application, and are required to write some boilerplate code to stream array chunks. A few lightweight tools will be designed to help developers in these tasks. For maintainability consideration, the intention is not to provide a framework for loading and extracting data, but to design helpers for generating data streams compatible with scikit-learn estimators, from a few generic data sources.
Plus-value apportée par ce financement :