Limitations of KD-based Device Heterogeneous FL
Knowledge distillation (KD) in device-heterogeneous FL facilitates knowledge transfer by using locally updated client models from different device prototypes—collectively termed as ensembles—as teachers to distill their knowledge into each device prototype's server student model, using an unlabeled public dataset.
While existing works primarily focus on same-size devices with similar capabilities, they often overlook the significant variation in device prototypes, ranging from small IoT devices to large workstations.
As shown in this figure, different-sized ensembles' logits are being averaged and used as the target distillation to transfer knowledge to each server-side student model.
Unfortunately, existing methods struggle to establish effective knowledge transfer in these challenging, real-world device-heterogeneous settings, primarily due to two reasons:
1 Existing methods often disregard the individual strengths and information quality of each device prototype's ensembles and integrate their logits into a single distillation target. This approach dilutes the richer, more informative logits from larger, more capable devices with less informative logits from smaller, less capable ones.
1 Additionally, these methods employ this single integrated distillation target to transfer knowledge across all different-size student models. This one-size-fits-all approach fails to provide customized knowledge integration based on the unique learning capacities of each student and the specific helpfulness of each device prototype’s ensembles.
TAKFL Overview
This figure provides an overview of the TAKFL approach, which transfers the knowledge of different-sized device prototypes' ensembles into each server student model. The scenario consists of three device prototypes: Small (S), Medium (M), and Large (L). TAKFL treats knowledge transfer from each prototype’s ensemble as a separate task and distills them independently. This ensures that the unique knowledge and contributions of each prototype’s ensembles are effectively distilled, avoiding dilution, information loss, and interference from other prototypes’ ensembles. Additionally, a KD-based self-regularization technique is introduced to guide the student through the noisy and unsupervised ensemble distillation process. Finally, the heterogeneously distilled knowledge is strategically integrated using an adaptive task arithmetic operation, allowing for customized knowledge integration based on each student prototype’s specific needs.
Theoretical Results
We present the first theoretical framework using the concept of capacity dimension to understand the effectiveness of KD in device heterogeneous FL.
Our theoretical results demonstrate that vanilla KD leads to inefficient and inaccurate use of model capacity, and thereby information loss and suboptimal knowledge transfer (Remark 1 in the paper).
Moreover, our theoretical results show that under the case that device prototypes have different capacities,
TAKFL smartly transfers the most informative knowledge to each prototype’s student model based on its own intrinsic capacity
leading to effective knowlege transfer.
Acknowledgement
This research was partially supported by a grant from Cisco Systems, Inc. We also gratefully acknowledge the use of the computational infrastructure provided by the OP VVV funded project CZ.02.1.01/0.0/0.0/16 019/0000765, “Research Center for Informatics,” which enabled us to conduct the experiments presented in this work.
BibTeX
@misc{morafah2024diversedeviceheterogeneousfederated,
title={Towards Diverse Device Heterogeneous Federated Learning via Task Arithmetic Knowledge Integration},
author={Mahdi Morafah and Vyacheslav Kungurtsev and Hojin Chang and Chen Chen and Bill Lin},
year={2024},
eprint={2409.18461},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2409.18461},
}