Computing Reviews

FfDL:a flexible multi-tenant deep learning platform
Jayaram K., Muthusamy V., Dube P., Ishakian V., Wang C., Herta B., Boag S., Arroyo D., Tantawi A., Verma A., Pollok F., Khalaf R.  Middleware 2019 (Proceedings of the 20th International Middleware Conference, Davis, CA, Dec 9-13, 2019)82-95,2019.Type:Proceedings
Date Reviewed: 02/03/21

Cloud computing is becoming the preferred choice in many domains due to availability, flexibility, and scalability, as well as many other reasons. However, this also brings challenges, that is, adapting cloud computing to specific domains like deep learning (DL), at large scale, for maintaining and executing learning jobs in the cloud environment.

Precisely written, this paper covers a flexible DL platform used at IBM. This platform has been open sourced to the community as well. The authors clearly outline the challenges faced during installation, configuration, and fault tolerance for DL infrastructure. At the same time, they highlight key shortcomings for managing workloads and the need for a middleware platform, specialized “to support the distributed training of DL models in the cloud.”

This paper captures the architectural design and components of FfDL, which is “a cloud-hosted and multi-tenant dependable distributed DL platform used to train DL models at IBM.” A study of the performance overhead on running bare metal hardware versus a cloud-hosted environment and failure analysis reveals details of various components in the running platform.

This study should interest those involved in DL infrastructure setup and enhancement. It clearly sets the foundation for further study and research to make specialized domains more robust, reliable, and distributed with the highest performance hosted in a cloud environment.

Reviewer:  Brijendra Singh Review #: CR147177 (2107-0185)

Reproduction in whole or in part without permission is prohibited.   Copyright 2024 ComputingReviews.com™
Terms of Use
| Privacy Policy