1 Technical and scientific description of the project
1.1 Program description, vision, ambition and scientific strategy
DigiWorlds builds on the observation that software, networks and data become ubiquitous and that for the digital revolution to continue over the next decades, research and education in ICT must grow faster to guarantee the sustainability of present and future digital worlds.
Software is ubiquitous. More and more control is delegated to embedded software: Computer programs are controlling aircrafts, trains and are taking over automobiles; Software is also present in domestic appliances, e.g., TVs, microwave ovens, coffee machines, as well as in telephones, smart cards or medical devices. Software also manages communications, wired and wireless networks, and the transactions that occur through these media. Computer programs are constantly downloaded, upgraded and executed without notice by the end user. In fact, the computer management of information lies at the heart of almost all organizations.
Networks are ubiquitous. The Internet has gone from research curiosity to fundamental infrastructure in a fairly short period of time. In terms of societal impact, the global interconnection brought by the Internet has changed the way we live, work and play, and altered our notions of democracy, education, health-care, entertainment and commerce. This global interconnection has been a powerful engine for technological innovation and societal evolution. The global trend is that these interconnections are still growing, involving more diverse technologies: fixed and mobile networks, or the power grid each encompass to hundreds of millions if not billions of nodes. They exhibit complex and often dynamic patterns of links between the nodes that need to be better understood, modeled and designed.
Data is ubiquitous. It is becoming gradually impossible to remember how the world was before the Information Age and before Google. Most scientific, social, medical, engineering, commercial processes now generate massive amounts of data along, and most processes also use data to guide their decisions. E. Schmidt, former CEO of Google, claims that the amount of data now generated in two days is comparable in volume to all the data generated by mankind before 2003. The Information Age aims at a principled, flexible and innovative way of taking advantage of the data present everywhere, and of its interaction with users, decision makers, engineers and scientists to achieve strategic goals.
Computer components are essential in many crucial societal issues such as health and energy. They participate in complex systems which require a multi-disciplinary and integrative approach. However, because ICT is a new discipline, less than one century old and still in its infancy, we believe it is important to build on the Campus de Saclay a strong research program in ICT, focused on the core subjects of Computer Science and Telecommunications: software, data and networks.
From the research point of view, the main activity of DigiWorlds is to imagine new models, languages and algorithms that can tackle the distributed and heterogeneous nature of software and information and the growing size of data and networks. DigiWorlds covers a fairly broad set of subject that are interrelated in intricate ways: For example, the issue of security, which used to be studied mostly for software, is becoming essential in the management of data and networks; Game theory has proved to be a powerful tool to model complex systems and is relevant to security, data and network; Probabilistic methods are also proving to be useful to model and analyze systems running in uncertain environments. These are examples of transverse activities that will be strongly supported by DigiWorlds: the program on emerging projects will be used to initiate or reinforce important research topics by funding a small group of researchers for 3 years. Summer schools and mini-courses given by invited professors will also be used to develop trans-disciplinary awareness.
The main actors in ICT research and development in France are present on the Campus de Saclay or will be there soon, including all the actors of “Allistene”, the French “alliance” for digital sciences and technologies. DigiWorlds is an essential instrument for these actors to work together by sharing problems, methods and tools in order to bring solutions to the challenges of a sustainable digital age.
The general theme on distributed programs, data and architectures is quite broad. However, the project is structured along three action lines: SciLex, ComEx and DataSense. Within each action line, we have identified specific tasks that correspond to challenges identified for at least the next four years (see section 1.2 for details). These tasks are focused enough to be the topic of effective daily collaboration between partners but are also reasonably open to foster larger interactions. Integration will be achieved through cross-team collaborations within each task, but also via common events such as the DigiWorlds conference, summer school and industry days. The participation in a common education program is also an effective way to build a strong community despite the large number of institutional partners.
With about 340 researchers, DigiWorlds will act as a well-connected and visible core group within the larger ICT department of the IDEX Paris-Saclay. In the long term, DigiWorlds will contribute to the emergence of new ICT research units on the Campus de Saclay. The emergence of the new Université Paris-Saclay will create a collective identity for our discipline and will consequently facilitate interactions with industry partners and multi-disciplinary projects with other departments.
A recent study by the French ministry of research and education  showed that Campus de Saclay represents 17% of the national research activity in ICT, mainly located at U. Paris-Sud, INRIA and Institut Télécom. More than 800 researchers belong to research units that were ranked A and A+ by the national evaluation agency AERES. Overall, ICT represents more than 1000 researchers and is the second largest cluster on Campus de Saclay after Biology, Medicine and Health.
DigiWorlds represents a unique concentration of higher-education institutes: the most selective engineering schools in France, including Ecole Polytechnique, Telecom ParisTech, and Ecole Centrale Paris, two universities, and an Ecole Normale Supérieure focused on education by research (Cachan). By 2017, all but the nearby UVSQ will be located on the Campus de Saclay. The number of students graduating in computer science in these schools is still low and their general background in this discipline is heterogeneous. One important objective of DigiWorlds is to develop a coherent curriculum that will help raise the skills in computer science as well as the number of students choosing to graduate (and continue with a PhD) in computer science. While each school currently proposes its own set of courses depending on its resources (both students and teachers), DigiWorlds will be able to offer a larger and more coherent choice, taking advantage of the best teachers on campus for each subject. This will result in an attractive curriculum at the national and international level.
Research action lines
DigiWorlds addresses the challenges of a sustainable digital world by focusing on three topics: safety, scalability, and usability (see the detailed presentation in section 1.2):
- SciLex (see 1.2.1): the long-term goal is to create robust digital worlds in terms of software reliability and data security. Digital robustness must make as few assumptions as possible on the physical environment: software and data resources may be outsourced; program may run in noisy and/or malicious environments, etc. SciLex has three main milestones:
(i) Achieving software reliability and security through modularity, i.e. through the massive and distributed reuse of reliable ICT “building blocks”;
(ii) Using models and verification to prove software reliability with unified perspective on continuous and discrete systems, for modern applications increasingly involve both models;
(iii) Bridging the gap between high- and low-level certification, i.e. between “what” is to be designed and “how” to carry out the design).
- ComEx (see 1.2.2): the long-term goal is to create autonomous and scalable digital systems based on efficient, agile and transparent communication means and supported by reconfigurable and heterogeneous mobile networks. ComEx has three milestones:
(i) Extending the key concepts of information theory and coding to distributed settings;
(ii) Designing network-driven distributed architectures;
(iii) Investigating the complementary perspective of node-centered distributed architectures, e.g., swarm computing.
- DataSense (see 1.2.3): the long-term goal is to enable citizens and professionals to seize the multi-faceted opportunities offered by the exponential growth of digital data. DataSense has five milestones:
(i) Managing large-scale and complex data with a focus on the scalability and safety properties and on usability of the management tools, e.g., the expressiveness of the query language;
(ii) Uncovering the meaning of data from prior knowledge or from usage, e.g. by exploiting annotations and user traces;
(iii) Making (machine) learning from data widely usable outside research labs and to make its results transferrable, e.g. across domains or applications;
(iv) Facilitating data-supported decisions by considering that a data set describes several interacting perspectives (the “players”) and supports several goals (multi-objective optimization);
(v) Putting the user at the center of the data processing loop and developing a sixth sense, the sense of data, through (collaborative) visualization and interactions.
DigiWorlds will also use its programs for invited professors, emerging projects and PhD grants to actively support joint work with other disciplines. Continuing the long-standing and fruitful collaboration between ICT and Mathematics, DigiWorlds will interact with the labex Mathématiques de Saclay in the areas of Logic & Algebra (cryptography, verification, see SciLex 1.2.1); Probability (information theory, stochastic network theory, see ComEx 1.2.2 task 1); and Statistics (machine learning, see DataSense 1.2.3 task 3). Likewise DigiWorlds will continue the long-standing cooperation with bioinformatics, specifically in the area of Systems Biology (IDEX interdisciplinary program, DataSense action line). Techniques developed in the DataSense (task 3) will also be applied to the analysis of medical images. A new project on neuro-inspired hardware for computing will be developed at the interface between ICT and nanotechnology. Last, but not least, cooperations will be initiated with the Social Sciences and Humanities department with a project to create an Institute on “Designing the Technology Society Together”.
DigiWorlds is a unique opportunity for the institutions on the Campus de Saclay to design an integrated and comprehensive educational program in ICT, benefitting from the complementary strengths of all partners. The program will be set up at the pre-college, undergraduate, graduate and PhD levels, including:
- Popularization of ICT for pre-college students;
- A selective University curriculum (L1-L2) with a focus on computer science, taking into account the introduction of computer science in high school programs in 2012;
- An international curriculum (L3-M1) dedicated to fundamental computer science with an initiation to research, open to University and engineering schools students;
- International Masters dedicated to the DigiWorlds special fields. We shall take advantage of the existing Masters curricula, e.g., the MPRI and ICTLabs Masters programs, to foster the cooperation and integration of the DigiWorlds partners;
- One or two disciplinary doctoral schools consolidating the various existing doctoral schools (about 1000 PhD students in ICT).
The goal is to improve the skills of French students in ICT, the number of ICT graduate students and PhDs, and to build Masters curricula with an excellent international visibility. DigiWorlds will support this goal by initiating appropriate collaborations among the institutions and by supporting innovative educational projects, including early access to research labs and to platforms such as Digiscope, and development of entrepreneurship scenarios. Note that while DigiWorlds will initiate the curricula, they will be gradually supported by the partner institutions.
DigiWorlds will also provide grants that combine Masters grants (almost non-existant in the French system at the moment) and PhD funding, which will increase the attractiveness of DigiWorlds at the international level. DigiWorlds will support the additional administrative support for these programs, which typically require extra interactions among the partners and extra care to welcome foreign students. The education program and initiatives are detailed in section 1.3.
The Campus de Saclay already benefits from institutional networks designed to foster innovation and partnership between industry and academic partners in the ICT area. These include the Systematic cluster or the Digiteo maturation programs. In addition the mission of the proposed SATT technology transfer initiative is to organize knowledge transfer at the Campus de Saclay level. DigiWorlds is de facto part of these networks.
DigiWorlds will bring added value through its comprehensive education programs at both the graduate and PhD level, which will attract more students and higher-profile students. Innovation in ICT will also benefit from closer relationships between research-oriented and entrepreneurship-oriented people, facilitated by the cooperation between universities, research institutes and engineering schools. The main partners in the Campus de Saclay have initiated a joint programme called PEEPS to foster entrepreneurship among students. Specific programs are currently designed as part of the ICTLabs actions, including Master programs with a minor in innovation and entrepreneurship, a PhD+innovation and entrepreneurship diploma, and doctoral training centers. Through DigiWorlds, these programs will be available to a larger number of partners and will be offered to more students.
The innovation ecosystem has already expressed interest in DigiWorlds: the two ICT competitiveness clusters in the greater Paris area, Systematic and Cap Digital, have written letters of support, as well as major industrial actors such as Dassault-Aviation, EADS, IBM, Thalès (see letters in annex).
Regarding education, a major goal is to attract more students to ICT both at the national level, especially in the engineering schools, and at the international level. This goal will be achieved by offering fundamental and applied curricula in computer science at the pre-college, undergraduate, graduate and PhD levels, with early exposure to research activities. The proposed activities range from fundamental to applied research and to technology transfer, thanks to the thriving local ICT industry. Attracting more and better students is therefore also a means to further develop research and innovation activities in the Campus de Saclay.
By joining research forces and creating new interactions between teams working originally in different fields, DigiWorlds will produce innovative models, methods and algorithms to manage the distributed nature of digital worlds. This problem is notably difficult and new ideas have to be found “outside of the box”. DigiWorlds will create a privileged research environment, with easy access to good students and to companies, where new ideas can be transformed into new products and where problems can be identified, discussed and transformed into scientific questions.
The ICT landscape in the Campus de Saclay is going to change quickly with the arrival of major actors such as Institut Telecom and ENS Cachan (renowned for training the best French students for academic research). DigiWorlds will significantly contribute to their integration in the research and education landscape. More generally, DigiWorlds will organize joint events such as summer schools and conferences, structure joint activities in research, teaching and dissemination, and offer an efficient framework for resource pooling in order to address emerging challenges in ICT.
1.2 Scientific description of the research project
Programs are very helpful and greatly facilitate our lives. Unfortunately, they are not fully reliable. They often have “bugs”.
Software engineering and software verification have been a concern for many years. Development practices and software usage however have dramatically evolved in recent years. The main evolution comes from the explosion of communications. This is both an opportunity and a threat. It is an opportunity because we have access to many more resources; It is a threat because the execution environment of a program cannot be guessed beforehand. Also, in such an open world, malicious users can disturb the computations or get access to private information. How can we get some assurance of the security of the communications/applications? Task 1 addresses this issue.
Many devices involve both physical components and software (discrete) control: these are hybrid systems. The physical devices use and measure real (continous) values, while the programs compute with discrete domains. The computations are therefore performed on approximations. Furthermore, the operations that are performed by these programs may themselves be approximate, e.g., when dividing floating point numbers. Can we trust such programs, even when they are certified? This requires a robustness guarantee. Task 2 addresses this and other related issues.
Finally, while many companies, e.g., Airbus or Astrium, use verification and test techniques intensively at various levels in their critical embedded software, there is still a problem with the combination of the verification tasks that are performed at these various levels. For example, a theorem prover or a model-checker can be used to prove a specification, invariants can be computed or a static analysis can be performed and verified on some high-level implementation, finally tests and verifications can be performed at the bytecode or circuit level. One major concern from industry is to improve the consistency of these verification tasks. Task 3 addresses this issue and related ones.
Task 1: Safe and reusable distributed programs
Key people: Jean Goubault-Larrecq (ENS Cachan, LSV 2.2.6).
Research on this task in DigiWorlds will address the following topics:
Security: Several tools have been designed for the verification of security schemes and protocols. The scope of these tools is however mostly restricted to a formal (abstract) model of protocols and to some basic security properties, typically confidentiality and integrity .
Cryptography goes well beyond confidentiality and authenticity. A new class of applications, ranging from electronic voting to electronic passports, and including health information systems, requires privacy or anonymity instead, i.e., the impossibility to link two pieces of data, e.g., name and disease.
There is a need for automated verification tools for such properties. New probabilistic models and tools from information theory and semantics are also required for quantitative information leak. DigiWorlds will provide a unique blend of researchers and students on these themes.
(De)composition: The classical control-flow (resp. data-flow) analysis of programs is not well-suited for large concurrent applications. First, it does not scale up well. Second, there are new issues, such as information flow, that are specific to distributed programs.
For subtle control-flow properties such as deadlocks or serializability, DigiWorlds plans on using recent geometric characterizations of (abstractions of) spaces of runs up to so-called directed homotopy equivalence. This will have the extra benefit that these techniques scale up much better than classical interleaving-based methods.
For data-flow properties, in particular for quantifying information leakage in implementations, not just specifications, of anonymity protocols such as Crowds, DigiWorlds will complement this with an analysis of the information flow through each component abstracted as communication channels, using an information-theoretic perspective.
Web services: Web services include mashups, SOAP and WS-* based services. Their purpose is to integrate and reuse Web applications, e.g., Google Maps. Currently, research has mostly focused on the orchestration problem: given a specification of component services, how to combine them in order to reach the desired goal. But there is currently no guarantee that a service reaches its goal or does not violate access rights, for instance through embedded mashups.
Another issue that has also received little attention is the security of the combination: can we guarantee that the composed service will behave as expected, even in the presence of a malicious user, or if a component service does not follow its specification?
Mobility: Network topology changes over time, as processes enter or leave the network. Verifying the properties of programs that are executed on such networks is a challenging new problem.
Verifying highly non-deterministic processes that model mobile and reconfigurable architectures requires new insights. DigiWorlds will capitalize on previous work on fault tolerant systems to address these issues.
Task 2: Continuous and discrete systems: models and verification
Key people: Eric Goubault (CEA, LIST 2.2.3).
Collaborations: S. Gaubert (INRIA, labex Maths).
State of the art and its limitations.
- Static analysis of numerical programs:
- There are results and tools, e.g., Fluctuat, for the analysis of floating point arithmetic . There is no theory or tool, however, that considers the combination of numerical programs and physical devices such as sensors. This is important, if only to assess embedded software in planes, cars or trains.
- Real time systems:
- There are models of real time systems and verification tools, e.g., Uppaal ). There is no theory or tool, however, that considers approximate data/delays with a robustness guarantee.
- Hybrid systems:
DigiWorlds will develop hybrid models that include both the semantics of the embedded programs and the behavior of physical devices connected to it, as well as tools for the static analysis of such hybrid systems. One particular challenge here is in solving the equations defining the physical part of the system. Doing this on a computer involves solving them up to some degree of accuracy; It is then important not only to compute a solution, but also to guarantee upper bounds on the resulting discretization error. Note that DigiWorlds has among its members the founders and the current coordinator (F. Lamnabhi-Lagarrigue) of the HYCON2 NoE1, whose complementary expertise will be key to this action.
- Robust verification:
- In analyzing programs connected to physical systems, time, probabilities and other physical quantities are never known exactly. DigiWorlds will explore techniques for robust verification, i.e., such that, if verification succeeds, then one not only knows that the system (program plus physical environment, timed automaton, probabilistic automaton) behaves correctly under the specified parameters, but will continue to behave correctly even if the parameters are changed slightly.
- Quantitative verification:
- Quantitative properties such as “a car brake feedback loop is guaranteed to bring the car to a full stop in less than 3 seconds, from an initial speed of 100 km/h” are often more relevant than a “yes/no” answer. DigiWorlds will explore such questions for each of the three themes: programs that interact with the physical world, timed automata, probabilistic automata; also for other quantities such as costs and weights.
Task 3: From high-level to low-level certification
Many techniques are known today to certify programs: refinement, theorem proving, model-checking, static analysis, testing, and so on. None is enough in isolation to verify large pieces of software, from specification to machine code. Moreover, each technique has its own strengths and weaknesses, but cooperation between existing tools and with tools of different natures is limited.
Currently, verification is done at each stage of the development. This involves useless repetition of tasks at various abstraction levels (specification, model, source code, machine code), using different techniques and tools.
One way to combine verification techniques is to implement them inside the same logical framework: proof assistants such as Coq2 and Isabelle/HOL3 have been successfully used as environments to support tools for program analysis, including abstract interpreters, model checkers, test case generators . However, because they are interactive and based on a general mathematical language, these tools require a great sum of expertise before being able to address real-life problems. The next step is to move towards program verification platforms (for example Why4  or FramaC5) to handle the computer memory model more directly and to use recent advances in automated deduction (especially SMT solvers).
- Combination of verification techniques.
The main goal is to get the best of different worlds, enabling for example the use of domain-specific methods within a program verification platform. DigiWorlds features an unusual number of expert teams in proof technologies, both interactive ones (Coq, Isabelle) and automated ones (alt-ergo , Bedwyr ). They study applications both to program verification, including floating point computation  and to mathematical problems , in particular real analysis  which is essential for task 2. DigiWorlds also features expert teams in algorithmic methods for program verification that will benefit from access to general purpose proof systems. DigiWorlds plans to deliver program verification platforms that consider realistic memory models as well as compilation and architecture dependent features, which provide access to advanced algorithmic verification techniques developed in the project.
- Cooperation of verification techniques
- In practice all tools: proofs, model checking, static analysis and testing are used to increase our confidence in the programs. The main challenge is to transfer information that is collected by one of the methods to improve the other methods. For example, test suites can be extracted from specifications and proofs. Conversely, concrete execution traces obtained from test suites help infer good candidates for invariants that will then be submitted to proof assistants. DigiWorlds plans to (also) investigate similar transfers between static analyzers and provers.
Computer networks, and especially the Internet, have become a fundamental infrastructure, with huge impact on our everyday life. Fixed and mobile networks as well as power grids are becoming ever larger, encompassing hundreds of millions of nodes. These networks involve various technologies and interact recursively with one another. As a result, the infrastructure is so large and diverse that it is now of a distributed nature: each component has many parameters that should be tuned to achieve a proper behavior of the whole system. Hence, a problem is how distributed algorithms and decisions can result in a fair equilibrium.
Moreover, new communication systems are constantly proposed for standardization, with the goal of using the available bandwidth better. Large improvements are expected, provided that the design becomes more global, with a possible relaxation of the separation between layers. Cross-layer design is a first step in this direction, but more flexibility is needed, which requires new tools and paradigms. Cooperation and auto-reconfigurability of distributed networks are such paradigms that are already widely studied. Finally, full benefits will be obtained if the distributed software framework (middleware) is designed with functionality and performance in mind. This is addressed via 3 tasks: Understanding the ultimate performance of good models of the situations of interest (Task 1); Global design of network architectures (Task 2); Terminal-centric design of networks (Task 3).
Task 1: Network information theory and coding
Originally, Information Theory (IT) addressed point-to-point communications. It has found applications in many areas (statistical inference, natural language processing, cryptography, neurobiology, evolution, ecology, quantum computing, and many forms of data analysis). IT has also been generalized to multi-agent communications and became what is now called “Network Information Theory”. Network IT is also a tool to compute a function, make a decision, or coordinate an action based on distributed information. How much communication is needed to perform such a task is still an open problem. Security issues can also be addressed, e.g. through the model of the Wiretap Channel.
Network Coding/Distributed Coding. IT establishes fundamental limits whereas Coding provides a means to achieve these limits. As for Information Theory, Coding was first designed for either channel coding or source coding. Now, we find coding in almost all areas where Network Information Theory applies. A popular example is Network Coding (NC).
For Wireless Networks, some schemes of Physical Layer NC, such as “Lattice NC”, establish a bridge between distributed computation and communication on autonomous interference networks. NC (at the packet level) can also be used to offer high layer reliability in network settings where traditional protocols, e.g. TCP, achieve poor performance due to “impairments” on wireless links (TCP misinterprets errors and long delays as a congestion). NC is also promising for broadcasting, multicasting and dissemination of information as sub-packets. Reliability can be obtained without the need for acknowledgements.
Distributed Computation and distributed coding over (wireless) networks. Our goals are to (i) study how distributed computation behaves on a realistic network of nodes whose point-to-point links are not perfect, (ii) determine what is achievable in terms of rates, power, delay, etc. (iii) propose distributed protocols and coding schemes. Network IT is a basis to evaluate the achievable performance of such networks, whereas distributed coding for multi-agent compression (sensor networks) or computation codes (communicating results of local computations over the network) are bricks for their design.
Network Coding to provide quality of service and low consumption in distributed IP networks. IP networks are composed of many shared radio links and they frequently forward multicast traffic from multiple sources to multiple destinations. Those conditions are ideally suited to NC schemes which should be able to increase the bitrates and reduce the energy consumption while respecting the quality of service requirements of the network applications.
Network IT and Stochastic geometry. The capacity of a wireless network clearly depends on the location of the nodes. Based on this spatial point of view, stochastic geometry and the theory of point processes can be used for the analysis of large-scale Self Organized Networks (SONs) in order to model and quantify interference, outage probability, etc. Specifically, we propose to develop a framework combining IT and stochastic geometry in order to investigate fundamental limits on information flow in SONs that take into account dynamics over the time scales of interest.
Task 2: Network centric design of distributed architectures
Collaborations: F. Baccelli and B. Błaszczyszyn (TREC team, INRIA & ENS, labex SMP6).
Today’s applications such as the Internet of Things, smart grids or cloud computing rely on more and more dynamic and distributed architectures. This evolution increases the need for self-organized networks and raises many issues regarding network scalability, network management, data and infrastructures security, trust domains management and the use of scarce resources (energy or bandwidth). All these issues need to be addressed in a distributed manner. For example, cellular networks are now facing bandwidth scarcity, with solutions such as dynamic spectrum allocation (DSA) and advanced mobility management. In non-stationary architectures, the network must also reshape to distribute resources geographically according to the users’ traffic demands in order to scale. If such adaptation was until now limited to radio networks, it will soon reach wired networks, thanks to cloud computing.
Among the classical techniques, network planning aims to design the network in order to provide resources at appropriate locations on the basis of long-term predictions over statistical data. On the other hand, traffic engineering exploits existing resources as efficiently as possible using local optimization. Both approaches have issues that need to be addressed in distributed architectures.
Network engineering (NE) was born in the years 2000 through the scheduled traffic concept. Its goal is to fill the gap between network planning and traffic engineering by dynamically providing bandwidth where data is expected to have high reactivity. It relies on scheduled demand that corresponds to dynamic, yet predictable, traffic. Virtual Private Networks (VPN) provisioning is a typical example of such a technique. In this situation, NE exploits the time-space correlation between the elements of the traffic matrix through global optimization tools (e.g.: ILP / MILP).
NE now needs to face the multi-tenant nature of networks. In cloud computing, the cloud service provider aims at satisfying its end-users and sub-contractors expectations with an imprecise and incomplete knowledge of the network, as network operators do not provide a detailed view of their architecture and traffic matrix to the CSP. In wireless distributed network, and more generally in complex networks, the traffic matrix and topology change too rapidly for information to be reliable. Network engineering therefore needs to exploit the concept of abstracted topologies and resources, using advanced graph theory and complex networks techniques.
Mixed optimization evolutions When looking at the various global objectives (security, performance, energy, etc.) optimization problems often combine multiple objectives over a large set of constraints. Variables to be optimized are a mix of discrete and continuous values, which can be addressed by analogy with a fully continuous situation, which constitutes an approximation. Any advances in the algorithm design for mixed optimization would result in much more efficient and realistic algorithms. Connections with what is called hybrid control in control theory might be obtained. The wireless networks case, through adaptive modulation and coding, tuning of the power levels (discrete in 3G systems), etc. constitutes a perfect field of experimentation for such techniques.
Abstract objectives If making a more efficient or fairer usage of the network resources is an objective that is easily characterized, considerations such as security, interoperability, heterogeneity management, genericity or reliability are more difficult to express. Satisfying them in a distributed manner requires the design of new models, of adaptive and distributed algorithms and of generic middleware.
Task 3: Terminal centric design of networks
Paradigms such as game theory and networked optimization appear to be fully relevant to analyze distributed systems and design appropriate algorithms at the terminals. This is complementary to network IT, which does not capture concepts such as equilibrium, coordination, selfishness, or bounded rationality. The focus is here on terminals, which is complementary w.r.t. operator-centric design.
Game theory and networked optimization In networks where nodes share a common resource, good local decisions are interdependent. In this case, game theory provides useful concepts for designing distributed algorithms and characterizing their equilibria. It seems to be a dominant paradigm when the network is poorly (or not) coordinated, such as ad hoc or cognitive networks. Game theory has also strong links with learning theory : it may allow to predict the convergence of algorithms with little information about the environment. However, its application to networks opens several challenges. Namely, when some coordination is implementable, it is possible to move either to advanced equilibrium concepts, e.g., bargaining solutions, or to networked optimization (assuming a sufficient degree of coordination).
Game-theoretic modeling of communication networks. In order to model distributed communication networks with non-stationarities, appropriate tradeoffs remain to be found between a precise description of the interactions and implementation aspects. Existing advanced dynamic game modes are accurate but lead to many practical problems, such as strong information assumptions. On the other hand, learning-based approaches, relying on the “automaton versus environment” model, correspond to games with reasonable assumptions and complexity but often lead to inefficient strategies.
Green cooperative networks. The volume of traffic is increasing exponentially as well as energy consumption. However, many devices are not used constantly and sometimes exchange only control traffic. If they could be shut down in a distributed manner and the network dynamically reconfigured itself, it may result in important energy savings. However, in the envisioned situations, devices are not only network users, but also cooperate to improve performance. Hence putting a device in sleep mode has specific drawbacks which drastically change the games involved.
Devising globally efficient strategies. Assuming a certain degree of coordination, we will address networked optimization by exploiting swarm intelligence (SI). SI is the collective behavior of self-organized systems, natural or artificial. Several algorithms such as “Ant Colony Optimization” have revealed their efficiency. When network intelligence is used to change the nature of the assigned resources to a given area, algorithms such as “River Formation Dynamics” can be used; this appears to be a promising tool for the planning of large, multi-rate and non-stationary networks. A challenge ahead is to enrich this approach to make it relevant to wireless networks and connect it with game theory. The design of the middleware that sits in the terminal will be done according to autonomic computing principles.
Pervasive, overwhelming information is gradually reshaping the way individuals and societies think, learn, decide and interact. The so-called “perils and promises of big data” call for integrated research efforts as they raise unprecedented scientific, ethical and cognitive  issues. First, the feasibility of many tasks radically changes when sufficient data is available. Information finding through the Web is a prime example, and so is machine translation: the wealth of multi-lingual corpora enabled a new computational approach to translation, through statistical alignment of text fragments . Second, new goals become reachable, through a smart exploitation of casual data. The early detection of flu outbreaks from the analysis of Google queries offers an example of such opportunistic uses of existing data. Third, the data deluge might bring into question some of our values or practices, such as data privacy and freedom. Likewise, the ability to analyze massive amounts of data provides experimental scientists with unprecedented opportunities ; but does this modify the nature of scientific methodology?
DataSense targets five out of the many questions raised by the explosion of data: How to handle larger and larger amounts of data; How to make sense of, learn from and decide with data; How to leverage human expertise in data-intensive tasks. Principled, well-grounded, unfoldable models are needed to harness data and master its ever increasing volume and complexity. These models must accommodate the primary requirements of the digital New World: robustness and effectiveness, e.g., through massive distributed processing including cloud environments (see ComEx), and safety, e.g. in terms of data privacy and access control (see SciLex). Further, data should support the production of new knowledge through the enabling technology of statistical machine learning, benefitting from the strong interactions between ICT and Mathematics on Campus de Saclay. Going one step further, while the production of new knowledge indeed is a goal per se, it is also a means for optimal decision making, at the cross-road again of ICT and Mathematics.
Finally, the bandwidth of interaction between human users and machine-hosted data must increase, requiring significant advances in two regards. On the one hand, visual and non-visual rendering of massive data must be improved to better support human expertise in human-machine interaction and human-human communication. On the other hand, users’ expectations, profiles and capabilities must be modeled to support the social intelligence of the machine.
Task 1: Scalable, expressive and secure tools for large-scale data
Today’s computers store massive and growing volumes of data, spanning storage and computing clusters and more recently thousands of computers in the cloud. This growth affects economic actors but also individuals, whose personal data becomes scattered and shared over multiple-ownership, remote-storage systems. This topic is at the core of the “Computing in the Cloud7” action line of ICTLabs, in which DigiWorlds members participate.
An important application domain for large-scale data management techniques is science. In particular, in life sciences, Bioinformatics relies on biological data in order to model, simulate and predict biological processes at several scales: molecular, intermolecular, cellular, with an important focus on both structural biology and systems biology, in close collaboration with biologists.
Scalable data management in the cloud: The advent of cloud infrastructures and the MapReduce-style of parallel programming has fostered higher-level frameworks that provide program abstractions such as tuples and operators to be compiled into MapReduce [7, 15, 50]. These frameworks, and the commercial cloud database servers , do not reach the expressivity of complex structured data. We will devise cloud-based platforms for efficient and expressive management of Web data, taking advantage of cloud elasticity and reliability. This requires models for multi-layered, indexed cloud-based stores, as well as efficient algorithm for processing queries and updates.
Data privacy: Human activity (health, vehicle movement, banking, e-commerce, etc.) is tracked by more and more digital data sources and ends up on remote storage servers. This results in unprecedented threats to privacy. Existing approaches to safeguarding data privacy assume that servers are fully trusted , which does not always hold. We plan to exploit new devices such as SIM cards or secure USB sticks that combine portability, secure processing and massive storage (Tera-bytes NAND Flash chips) . The goal is to embed in these devices software components capable of acquiring, storing and managing personal data and to enable user control over the sharing conditions related to their data and with tangible guarantees about their enforcement.
Multi-scale life science data: If the principles of Systems Biology are now clear, the computer tools necessary for completing the necessary steps are still lacking . Notably, we are involved in an ambitious project whose aim is to build G-Protein Coupled Receptor signaling networks. For that purpose, we have to design an innovative pipeline of computational methods encompassing all the tasks from the numerous heterogeneous data sets to a predictive dynamic model. We also plan to investigate methods for using the model to generate new hypotheses, leading to the discovery of new therapeutic opportunities .
Task 2: Making sense of complex, heterogeneous data
Key people: Serge Abiteboul (INRIA, Saclay 2.2.8).
As more and more data gets produced, precious information lies hidden within that data and efficient tools are required to extract it. A first important source is the Web, the world’s premier information repository. Exploiting the Web requires formal models for its data, interactions, and evolution; highly dynamic systems such as social networks, wikis or blogs change at a tremendous pace. Extracted data comes with uncertainty and context, such as its source and trust, which must be preserved for further processing. Another challenge is crossing the semantic gaps between independently conceived, heterogeneous data sources. All these themes are at the core of projects that we coordinate, such as the ERC Advanced Grant WebDam (http://webdam.inria.fr) on Web data management, and the EIT ICTLabs “Data Bridges” on data integration for digital cities.
Another important class of data streams originates from sensors that hide a topological or geometrical structure, which computational geometry attempts to identify, as in the DARPA project “Topological Data Analysis”8. Massive and multidimensional data often appear to be concentrated around low-dimensional geometric structures. The inference and analysis of these structures is a fundamental problem that has led to the development of new topological and geometric approaches to data analysis [21, 31]
Models for Web data: The formal analysis of Web models based on XML languages [183, 256] and distributed data exchange models [2, 27, 345] is an active area. Our goal is to develop a formal framework for describing complex and flexible interacting Web applications featuring data exchange, sharing, integration, querying and updating.
Web knowledge extraction and integration: Within the Web corpus lies knowledge in many forms, which our research seeks to extract and to integrate. First, common-sense knowledge is extracted as large ontologies [66, 57], tables , class attributes , propositions , descriptions of situations  etc. An example from the medical domain is the i2b2/VA 2010 challenge [71, 28], focused on the extraction of medical concepts and relationships. Second, public sentiment on the Web can be gathered from Web documents, classifying them according to their polarity, estimating the sentiment and identifying the opinion target and the holder [63, 77]. Third, the dynamics and structure of the Web (and in particular of social networks) is rich with information.
Our goal is to build unsupervised, scalable systems for extracting semantics from the Web by exploiting: linguistic information resulting, e.g., from syntactic parsing , linked data  and domain-specific knowledge. We envision long-term, incremental processes in which the acquired knowledge and the extraction process interact closely in a bootstrapping fashion . We seek to understand the structure and dynamics of a Web 2.0 information collection (with social networks as use cases) and to learn how to predict future properties and behavior based on its past evolution. Web-extracted data and knowledge is heterogeneous and thus must be reconciled [29, 49, 73, 78]; mappings among sources must be automatically maintained as new facts are discovered, and new connections understood.
Scalable techniques and uncertainty: The available techniques for handling Web-extracted data involve provenance management  and probabilistic databases . Complexity results show that these techniques scale poorly with the amount of data. We plan to introduce scalable techniques for management of the lineage and uncertainty in data, relying on approximation techniques and algebraic simplification of provenance, while allowing a wide range of manipulated operations, from traditional database queries to cloud-based MapReduce workflows .
When extracting geometrical structures, most of the existing topological methods are challenged by data corrupted by noise and outliers. Building on our recent works  we intend to develop new effective methods that take into account the statistical nature of data.
Task 3: Machine learning : meta-learning and multi-task
Key people: François Yvon, (UPSud, LIMSI 2.2.2).
Collaborations: Lab. Maths Orsay (P. Massart, P. Pansu); CMLA-Cachan (N. Vayatis), AgroParisTech (A. Cornuéjols), Laboratoire Accélérateur Linéaire, UPSud (B. Kégl).
Knowledge is perhaps the most scarce resource in the information age. Through pervasive computer technologies, traces of knowledge however abound within, e.g., databases, user logs or customer/patients records. Machine learning (ML) studies the principled extraction of knowledge (models, hypotheses, classifiers, regularities, anomalies) from whatever digital evidence is available, guided by expert prior or posterior background knowledge. ML thereby defines a new programming paradigm suited to pattern recognition . For such tasks as information retrieval, machine translation or fraud detection, to name a few, efficient programs can hardly be specified in the sense of the SciLex action line; they can instead be built from the available evidence. DigiWorlds has a strong research record in the area of principled adaptive systems design, specifically in information retrieval and language technologies [354, 288, 368], medical  and other  imagery, autonomic systems [381, 359] and games [317, 296]. Notably, both LRI and the Maths Lab participate in the famed PASCAL Network of Excellence in Machine Learning (Pattern Analysis, Statistical Modelling, Computational Learning 2003-2013).
Two ML applications are particularly relevant to the Saclay scientific ecosystem. The first one, e-Science, is about exploiting the huge amounts of data gathered by experimental sciences (including ML itself ), to extract conjectures/models consistent with the available data and new experiments best able to confirm/infirm these conjectures . The second application aims at simplified models. The goal is to sustain new design processes on top of the massive software systems that encapsulate the know-how of many companies and research labs accumulated over decades. Whereas such systems are usually computationally heavy and demand a high level of expertise to be used, e.g. in Numerical Engineering, ML can provide an approximation of these systems within a fraction of their computational cost. Such an approximation can be used to speed up the design cycle by one or several orders of magnitude . Both applications raise two core challenges related to the algorithm control and the use of prior knowledge:
Meta-Learning. Fundamental research is required to address the main ML bottleneck, dating back to the late 1990’s and referred to as Meta-Learning . While ML algorithms are efficient and versatile, they often involve quite a few hyper-parameters and their performance critically depends on carefully tuning them. How to self-adjust these hyper-parameters is required to “cross the chasm” and let ML algorithms be used outside (computer) research labs. Benefitting from the critical mass of ML applications on the Campus de Saclay, our goal is to go beyond the mere empirical comparison of algorithms on problems9 and build descriptive features for ML problems, paralleling the principled approach of the Constraint Programming community . An effective description of ML problems implies a Meta-Learning breakthrough.
Multi-task Learning. A second fundamental challenge is the representation of the problem at hand, e.g. the design of relevant descriptive features. Feature selection  and construction are acknowledged to be critical to the success of ML. The proposed approach will reconsider Feature Selection and Construction in the perspective of multi-task learning (when several learning goals are defined w.r.t. same data and/or same representation) . The multi-task perspective should drive the sought representation toward a greater and sound level of abstraction, away from ad-hoc features.
Task 4: Distributed decision making: partially observable dynamic game and multi-objective policy optimization
Key people: Marc Schoenauer (INRIA, Saclay 2.2.8).
One of the primary goals of information and knowledge is to support optimal decision making. While optimization is pervasive in sciences, from Mathematics to Physics to Computer Science, DigiWorlds will specifically focus on distributed decision making and optimization (DDMO). DigiWorlds is acknowledged to be one of the first players worldwide in the field of MINLP (LIX) [135, 101, 173], bio-inspired algorithms for optimization (INRIA, CEA) [266, 276, 265], and games (L2S, LRI) [317, 341]. Notably, strategic games10 epitomize optimal decision in front of uncertainty; as such, their handling is relevant to problems such as optimal energy policy. Interestingly, game theory and optimization also are at the core of distributed networks (see ComEx), albeit with a complementary perspective. One strategic research initiative of DigiWorlds is to investigate the transfer and sharing of fundamental results and algorithms developed in both areas.
DigiWorlds is specifically interested in three challenges raised by DDMO. Distributed optimization, often motivated by computational (large scale) or physical (swarm robotics, sensor networks) settings, faces a critical trade-off between the quality of the overall solution and the amount of communication between computational agents or nodes; The first challenge is how to control this trade-off depending on the agents autonomy and the criticality of the problem.
The decision making process involves partially observable information: On the one hand, each agent may have its own agenda; On the other hand, the agent’s environment is subject to constant stochastic fluctuations, e.g., energy policies consider climatic and financial variables. Optimal policy finding thus raises challenges such as estimating the computational complexity thereof, or the “price of anarchy” and the loss of performance incurred compared to a global decision making process.
Finally, the agents themselves can evolve along the process, e.g. by adapting to the game. The learning dynamics and its impact on the volatility of the system calls for new developments.
These challenges will be considered through two milestones, a fundamental research one and an application-driven one:
Partially Observable Dynamic Games. Partial observation is pervasive in real world applications and some games (dark chess, phantom Go, kriegspiel originated from military operational research training) provide extensive testbeds. Among the fundamental issues involved in partially observable games are the un/decidability of optimal strategies, the computational complexity thereof, and building/updating probabilistic belief states. Tackling highly visible challenges, e.g. phantom Go or poker, can be viewed as a major asset for the attractiveness of Université Paris-Saclay.
Multi-objective policy optimization; applications to sustainable development and power management. Energy management and more generally sustainable development are deeply revisited due to the distributed nature of sources, e.g. photovoltaic units, and sensors, raising a huge number of interdependent local decision making problems. New algorithmic approaches have been proposed in the last decade to yield optimal sequences of local decisions with delayed rewards (early decisions can significantly affect the policy outcome hundreds of moves later); strategic games constitute an optimal workbench in this respect. New developments are needed to achieve multi-objective policy optimization, at the crossroad of policy optimization and multi-objective optimization, enabling policy makers to ponder short and long-term costs and benefits.
Task 5: Interaction and Visualization
While it is now possible to manage very large and complex data sets with computers, human users often have limited abilities to understand such data. However, many problems, such as testing hypotheses, detecting unexpected patterns or supporting creativity, require the combination of human intelligence and computing power through rich interaction (between humans and computers) and collaboration (among humans using computers) [34, 16]. As demonstrated by initiatives such as the NVAC Center11 or the FODAVA program12 in the US, novel tools are necessary to address these challenges by combining human-computer interaction , visualization  and collaboration .
DigiWorlds gathers a critical mass of researchers in data visualization, virtual reality and human-computer interaction who recently received funding for the 22 million euro DIGISCOPE “equipment of excellence” project. This project will provide the partners with a unique network of high-performance interactive and collaborative visualization rooms to tackle the above issues.
Interactive Data Visualization. The first milestone will be to develop new methods for visualizing and interacting with complex data , taking advantage of the capabilities of the DIGISCOPE platform. This includes the visualization of very large networks, including social networks, and of 3D models of objects and phenomena on advanced display surfaces. It also involves the development of interaction techniques based on large touch surfaces, free-hand gestures and multimodal interaction to manipulate this data. Applications include assisting scientists of all disciplines in their data analysis tasks, e.g. , and supporting the design, engineering and manufacturing of complex products, e.g. .
Collaborative Data Management. The second milestone will be to facilitate collaborative tasks where multiple users, either locally or remotely, can share data and capitalize their expertise to solve complex, multidisciplinary tasks [19, 10]. This will take advantage of the communication facilities of DIGISCOPE to support remote collaboration while visualizing large data sets. In particular, we will extend traditional approaches to telepresence by supporting the collaborative interaction of multiple remote users with shared data. Applications include those in the previous milestone, e.g. [47, 55] as well as decision support in distributed environments , in particular in crisis situations, and distance teaching and training .
1.3 Impact on training
In France, education in computer science is under-developed in many curricula: most students enter engineering schools two years after high school without any training in computer science except as end-users of various computer tools. Not all engineering schools offer well-identified curricula in computer science. For example, out of 350 students graduating each year from Ecole Centrale, 60 have a major in ICT and only 6 continue with a PhD; at Telecom ParisTech, out of 300 students graduating each year, 120 graduate in ICT and about 12 continue with a PhD. This is in sharp contrast with the dominance of ICT in many tasks and processes of our professional and personal lives.
Our goal is to increase the number of good students choosing computer science as their major. To that end, DigiWorlds will devote a large part of its resources to education. DigiWorlds will form an education board in charge of preparing, in collaboration with partner institutions, new programs for the next campaign of diploma assessment (to be ready in 2013, starting in 2015 or earlier for pilot operations). We will develop actions and curricula in computer science and telecommunications at the pre-college, undergraduate, graduate and PhD level along the following lines:
Raising awareness for Computer Science in high school.
The partners will leverage and coordinate their actions at events such as “Fête de la Science” to better explain the challenges and achievements of ICT to a wide audience and attract more high-school students into Computer Science. The topics addressed by DigiWorlds in programming, communications and data management raise deep scientific questions but also relate to concerns that can easily be understood by the general public about security, knowledge retrieval or social networks. We will recruit PhD students to contribute to this program as part of the “non-research” activity of their contract.
In 2012, the senior year in French high schools will offer computer science as elective. DigiWorlds researchers are already contributing to the training of high school teachers, and will continue to do so. providing another channel to reach out to future students.
A selective undergraduate curriculum (L1-L2) with a focus on Computer Science.
Today, college freshmen cannot take a major in Computer Science during their first two years. This creates a gap for students who are interested in these topics, especially given the new elective course in high school described above. We plan to create at U. Paris-Sud a selective curriculum with a major in Computer Science, building on the existing tracks in Physics and Chemistry. The goal is to start in 2014 and offer a program that not only leads to a University undergraduate degree (at level L3), but also to the selective exams to enter Grandes Ecoles. This program will be designed to mesh with the international curriculum described below (L3-M1) and with the common curricula between university and the Grandes Ecoles on Campus de Saclay.
An international curriculum (L3-M1) in fundamentals of Computer Science.
Several university Computer Science programs already exist in years 3 and 4. Our goal at these levels is to attract an additional 50 to 100 students from the partner Grandes Ecoles and from other French and foreign universities. Recruitment will be selective, and the program will offer scholarships to the best (mainly foreign) students. Telecom ParisTech, U. Paris-Sud and ENS Cachan have already agreed to work together to launch such a program by 2015 while the other partners will study the integration of this track into their own curricula. This coordination will establish a common basis of knowledge that will facilitate the mobility of students among institutions.
The core courses (two to three half-days a week) will be taught in English and will focus on the foundations of computing. The rest of the curriculum will be specific to each partner school/university. The curriculum will include research internships every year, and conferences and mini-courses given by invited professors and industry partners. After this program, students will be able to continue into a research or professional Masters at Université Paris-Saclay or elsewhere in France and abroad.
International Masters (M2) in the main areas of the labex.
At the Masters level we plan to leverage the success of the existing programs in the areas covered by DigiWorlds: The joint Parisian Research Master in Computer Science (MPRI) already matches the topics of SciLex; U. Paris-Sud will offer, starting in 2012, special tracks as part of the international Masters program of the KIC ICTLabs in two areas of the the DataSense action line: Distributed Systems & Services and Human-Computer Interaction. For the ComEx action line, the partner institutions offer several successful programs in different locations, given that the specialized schools in Telecommunication will not move to the Saclay area before 2015. We will work to ensure better consistency among these programs and improve their visibility and attractiveness, and will study possible further integration after the move. Finally, we will continue to contribute to multi-disciplinary Masters such as the Masters in BioInformatics and Biostatistics (BIBS) and the Masters in Complex Systems Engineering (COMASIC).
Currently, PhD students associated with DigiWorlds are enrolled in 7 different doctoral schools, some of them disciplinary, some of them associated with an institution. Since there are approximately 1000 PhD students in ICT on Campus de Saclay, we plan to reorganize these into 1 or 2 disciplinary schools. A proposal will be put forward in 2013 in order to be operational by 2015.
As described in section 1.4, tracks will be developed to increase the innovation and entrepreneurship awareness of PhD students. Finally, DigiWorlds will organize an annual summer school, which will allow PhD students to keep up with advances and results in the areas of DigiWorlds as well as progress and new issues regarding societal challenges.
1.4 Socio economic impact
The coordinating partner, FCS, will provide communication tools to each Labex project as well as support for national and international events. Particular attention will be devoted to the disclosure of projects content and results to the media and the general public. As mentioned in section 1.3, special care will be taken to popularize computer science.
Technology transfer.The scientific topics of DigiWorlds correspond to areas where industry is expecting progresses and is keen to undertake collaborative activities with academic teams. Several companies have already expressed their interest in the project (see their letters in the appendix).
In order to organize partnership with industry, DigiWorlds can rely on a very active innovation network in ICT in the Saclay area: the Systematic cluster, bringing together more than 500 players (140 large companies, 310 SMEs-SMIs and 90 research institutions) working in the greater Paris area on software-intensive systems, and the “Carnot Institutes13” of CEA, INRIA, Institut Telecom. Moreover, DigiWorlds partners are involved in the French node of ICTLabs together with industry with the goal of boosting innovation in ICT, and they form the academic basis of the SystemX IRT which will develop new kinds of collaborative projects with industry. They will also be founders of the SATT, which will develop a Technology Maturation program at the level of the Université Paris-Saclay, extending a process that has been initiated in ICT to select promising technologies and support additional development to demonstrate their commercial potential.
In addition, recognizing that one of the greatest challenges is to strongly increase the number of start-ups, DigiWorlds will develop specific actions in its education programs:
Education modules in innovation and entrepreneurship in ICT,including awareness sessions on IPR and specific features of software, will be developed in collaboration with the “PEEPS” project and with support by ICTLabs. The emphasis will be on learning by practice. Junior Entreprises14 already exist in several schools and universities. DigiWorlds will help them coordinate their activities in ICT and DigiWorlds members will be committed to propose them activities.
Doctoral training center.
This center, similar to those in the UK, is currently planned as part of the ICTLabs Doctoral School and will be designed in collaboration with the SystemX IRT. It will be a dedicated space where PhD students working in an industrial context can meet on a regular basis and be tutored by industry and academic researchers. In the long run, it will help companies, especially SMEs, identify the right researchers for developing PhD projects that match their R&D needs.