Page 11/09/2017 14:55:30

Project 39: High-efficiency AI and ML distributed systems at Big-Learning scales

Suitable Majors

Computer Science

Research Area

Machine Learning, distributed systems

Internship Description

The position will be in the context of a project whose goal is to make distributed AI and ML more efficient. Enabled by the rise of big data, today’s AI and ML solutions are obtaining remarkable success in many fields thanks to their ability to learn complex models with millions to billions of parameters. However, these solutions are expensive because running AI and ML algorithms at large scales requires clusters with tens or hundreds of machines to satisfy the high computation and communication costs of these algorithms. Many systems exist for performing AI and ML tasks in a distributed environment. Yet, the performance requirements and input data sizes are steadily growing. The next level of efficiency is required to address key challenges like network communication bottlenecks and uneven cluster performance. Moreover, the fidelity of ML models is very sensitive to many hyperparameters. To produce accurate models, it is of great importance to tune these hyperparameters well. However, this requires exploring a large space of possible configurations, which must be done efficiently.

The internship work will generally integrate in the current challenges faced during the project whether that is to investigate the trade-offs between reduced communication and model precision or to identify the bottlenecks and develop new algorithms to overcome them.
Ongoing directions are (1) exploring the use of new networking hardware and architectures to make network-based communication more efficient and (2) designing new search algorithms that can make a better use of resources and determine optimized hyperparameters more efficiently.

We build real systems. That is, we build prototypes that directly improve the lives of real users. We learn general principles and valuable lessons of what works in practice. Why? The big problem is that there is really no theory of building systems. It appears these artifacts are way too complex to have a rigorous, constructive approach to dealing with them. In some sense, it is like dealing with a biological organisms. They are some of the most complex systems that exist. We understand how tiny disconnected pieces operate on their own but we know little about how they interact. In the absence of a constructive theory the typical systems person will formulate a problem, get an idea, build a prototype, measure and analyze it, tweak it, go back to building, measure again and repeat these steps over time.

Prerequisites

Candidates should be motivated to work on research-oriented problems in a fast-paced and tight-knit team. They should have a strong computing or engineering background with a good background in algorithms, machine learning, distributed systems, and networking. Ideally, they would have experience in building and working with large software systems and tools, and proven knowledge of C++/Java.

Further background readings:
https://sands.kaust.edu.sa/daiet/
https://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-159.html
http://research.baidu.com/bringing-hpc-techniques-deep-learning/

Deliverables/Expectations

The students are expected to study the existing solutions and devise theoretically-sound approaches (with the assistance of the supervisor) to improve their performance. The students will be able to also collaborate with other team members and to evaluate the mechanisms on real-world datasets on a state-of-the-art testbed. The above results, if completed, are considered novel and can result into a publication (with the agreement of the supervisor).  

Other Comments

​Internship dates: 19 May to 26 July or 23 June to 29 August or 7 July to 14 September​

Division

Computer, Electrical and Mathematical Sciences and Engineering

Faculty Name

Marco Canini