[Show/Hide Left Column]

Librairie Python open-source dédié au machine learning 

Project : Scikit-learn
Web page : http://scikit-learn.org/stable/
Source coding :
  • https://github.com/scikit-learn/scikit-learn/
Axis: DataSense
Coordinator : Hervé Bredin
Candidate : Tom DUPRE LA TOUR
Laboratoires : Inria, LTCI
Administrator laboratory : IMT
Team : Parietal & S2A
Engagement : Janvier 2018 - Décembre 2018

Context :
Scikit-learn is a Python open-source library dedicated to machine learning. Extremely popular in both academic research and industry, it is developed by a very active international community, with over 900 contributors, and its number of users is always growing, with over 400,000 unique visitors per month on online documentation, and over 2,200 citations per year in research articles.

Objective :
Within the framework of this doctoral mission, I wish to improve the preprocessing algorithms as well as a certain number of estimators already present in scikit-learn. More precisely, I want to extend feature engineering tools by implementing new methods already proven in data science: feature binning, supervised and unsupervised categorical encoding, Box Cox transform and Yeo Johnson transform. I also want to improve estimators already present in scikit-learn, by working on their computing speed and their ability to process large data. In particular, the DBSCAN and T-SNE algorithms could be greatly accelerated by improving parallelization in the calculation of the closest neighbours. The T-SNE algorithm could also benefit from better parallelization with OpenMP, thanks to the upcoming integration in scikit-learn of loky, a more robust multi-core parallelization library. Finally, I would also like to dedicate part of my mission to performing in-depth code reviews. This component is indeed essential for a quality library, and often constitutes the bottleneck of software development.

Expected results :

Added value :