Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
FfDL: a flexible multi-tenant deep learning platform
Jayaram K., Muthusamy V., Dube P., Ishakian V., Wang C., Herta B., Boag S., Arroyo D., Tantawi A., Verma A., Pollok F., Khalaf R.  Middleware 2019 (Proceedings of the 20th International Middleware Conference, Davis, CA, Dec 9-13, 2019)82-95.2019.Type:Proceedings
Date Reviewed: Feb 3 2021

Cloud computing is becoming the preferred choice in many domains due to availability, flexibility, and scalability, as well as many other reasons. However, this also brings challenges, that is, adapting cloud computing to specific domains like deep learning (DL), at large scale, for maintaining and executing learning jobs in the cloud environment.

Precisely written, this paper covers a flexible DL platform used at IBM. This platform has been open sourced to the community as well. The authors clearly outline the challenges faced during installation, configuration, and fault tolerance for DL infrastructure. At the same time, they highlight key shortcomings for managing workloads and the need for a middleware platform, specialized “to support the distributed training of DL models in the cloud.”

This paper captures the architectural design and components of FfDL, which is “a cloud-hosted and multi-tenant dependable distributed DL platform used to train DL models at IBM.” A study of the performance overhead on running bare metal hardware versus a cloud-hosted environment and failure analysis reveals details of various components in the running platform.

This study should interest those involved in DL infrastructure setup and enhancement. It clearly sets the foundation for further study and research to make specialized domains more robust, reliable, and distributed with the highest performance hosted in a cloud environment.

Reviewer:  Brijendra Singh Review #: CR147177 (2107-0185)
Bookmark and Share
 
Learning (I.2.6 )
 
 
Cloud Computing (C.2.4 ... )
 
 
Distributed Architectures (C.1.4 ... )
 
 
General (H.0 )
 
 
General (C.0 )
 
 
Information Systems Applications (H.4 )
 
Would you recommend this review?
yes
no
Other reviews under "Learning": Date
Learning in parallel networks: simulating learning in a probabilistic system
Hinton G. (ed) BYTE 10(4): 265-273, 1985. Type: Article
Nov 1 1985
Macro-operators: a weak method for learning
Korf R. Artificial Intelligence 26(1): 35-77, 1985. Type: Article
Feb 1 1986
Inferring (mal) rules from pupils’ protocols
Sleeman D.  Progress in artificial intelligence (, Orsay, France,391985. Type: Proceedings
Dec 1 1985
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy