Login| Sign Up| Help| Contact|

Patent Searching and Data


Title:
COLLABORATIVE TRAINING WITH PARALLEL OPERATIONS
Document Type and Number:
WIPO Patent Application WO/2024/102137
Kind Code:
A1
Abstract:
Neural networks are collaboratively trained with parallel operations by performing operations in a plurality of consecutive time periods including a first plurality of consecutive time periods during which the server receives a set of activations, applies the server partition to a set of activations, applies a set of output instances to a loss function, and computes a set of gradient vectors, and a second plurality of consecutive time periods during which the server transmits a set of gradient vectors.

Inventors:
ZHANG ZIHAN (GB)
VARGHESE BLESSON (GB)
RODGERS PHILIP (GB)
SPENCE IVOR (GB)
KILPATRICK PETER (GB)
Application Number:
PCT/US2022/049639
Publication Date:
May 16, 2024
Filing Date:
November 11, 2022
Export Citation:
Click for automatic bibliography generation   Help
Assignee:
RAKUTEN MOBILE INC (JP)
RAKUTEN MOBILE USA LLC (US)
International Classes:
G06N3/04; G06N3/08; G06F18/214
Attorney, Agent or Firm:
PRITCHETT, Joshua L. (US)
Download PDF:
Claims:
WHAT IS CLAIMED IS:

1. A non-transitory computer-readable medium including instructions executable by a processor to cause the processor to perform operations comprising: training, collaboratively with a computation device through a network, a neural network model by performing, during each of a first plurality of consecutive time periods, operations of receiving, from the computation device, a current set of activations output from a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, applying the server partition to each activation among a preceding set of activations to obtain a current set of output instances, the preceding set of activations received during a preceding time period among the first plurality of consecutive time periods, applying each output instance among the current set of output instances to a loss function relating activations to output instances to obtain a current set of loss values, and computing a set of gradient vectors for each layer of the server partition, including a current set of gradient vectors of a layer bordering the device partition, based on the current set of loss values, and during each of a second plurality of consecutive time periods, operations of transmitting, to the computation device, a preceding set of gradient vectors of the layer bordering the device partition computed during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods and a preceding set of loss values of the loss function obtained in the preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods.

2. The computer-readable medium of claim 1, wherein the first plurality of consecutive time periods overlaps with the second plurality of consecutive time periods, such that the operations of both the first plurality of consecutive time periods and the second plurality of consecutive time periods are performed in at least two consecutive time periods among both the first plurality of consecutive time periods and the second plurality of consecutive time periods.

3. The computer-readable medium of claim 1, wherein each set of activations corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini-batches, and the server partition is applied to the activation corresponding to each data sample in the batch once during the training.

4. The computer-readable medium of claim 3, wherein the operations further comprise: transmitting the neural network model to the computation device; training the neural network model; receiving a plurality of estimation values from the computation device; estimating an overall collaborative training value representing a duration of time for collaboratively training the neural network model with the computation device; and determining the partition value and the mini-batch value based on the estimating. The computer-readable medium of claim 4, wherein the plurality of estimation values includes a plurality of device forward pass values, each device forward pass value representing a duration of time consumed by the computation device for activation computing of a corresponding layer among the plurality of layers of the neural network model, and a plurality of device backward pass values, each device backward pass value representing a duration of time consumed by the computation device for gradient vector computing of a corresponding layer among the plurality of layers of the neural network model; the estimating the overall collaborative training value is based on the plurality of device forward pass values, the plurality of device forward pass values, a plurality of activation volume values, each activation volume value representing a volume of data output by activation computing of a corresponding layer among the plurality of layers of the neural network model, a plurality of gradient vector volume values, each gradient vector volume value representing a volume of data output by gradient vector computing of a corresponding layer among the plurality of layers of the neural network model, an uplink bandwidth value representing an uplink bandwidth between the device and the server, a downlink bandwidth value representing a downlink bandwidth between the device and the server, a partition value representing the number of layers of the device partition, and a mini-batch value representing the number of mini-batches in the plurality of minibatches.

6. The computer-readable medium of claim 4, wherein the operations further comprise: updating, after the training, the weight values of the server partition based on the set of gradient vectors for each layer of the server partition computed during each of the first plurality of consecutive time periods.

7. The computer-readable medium of claim 6, wherein the operations further comprise: partitioning the plurality of layers of the neural network model into the device partition and the server partition based on the partition value; transmitting, before the training, the device partition to the computation device; performing a plurality of iterations of the training and the updating the weight values; receiving the device partition from the computation device; and combining the device partition with the server partition to obtain an updated neural network model.

8. The computer-readable medium of claim 1, wherein during the receiving of each of the first plurality of consecutive time periods, receiving a current set of labels from the computation device.

9. A non-transitory computer-readable medium including instructions executable by a processor to cause the processor to perform operations comprising: receiving, from the server, a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition; training, collaboratively with a server through a network, a neural network model by performing, during each of a first plurality of consecutive time periods, operations of applying a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, to each data sample among a current set of data samples to obtain a current set of activations, transmitting, to the server, a preceding set of activations of a layer bordering the server partition obtained during a preceding time period among any of the first plurality of consecutive time periods, and during each of a second plurality of consecutive time periods, operations of receiving, from the server, a current set of gradient vectors of a layer of the server partition bordering the device partition and a current set of loss values of a loss function relating activations to output instances, computing a set of gradient vectors for each layer of the device partition, based on a preceding set of gradient vectors of the layer of the server partition bordering the device partition and a preceding set of loss values, the preceding set of gradient vectors and the preceding set of loss values received during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods.

10. The computer-readable medium of claim 9, wherein, during each of the first plurality of consecutive time periods, the operation of applying the device partition consumes a duration of time substantially similar to a duration of time consumed by the operation of transmitting the preceding set of activations of the layer bordering the server partition.

11. The computer-readable medium of claim 9, wherein, during each of the second plurality of consecutive time periods, the operation of receiving the current set of gradient vectors of the layer of the server partition bordering the device partition and the current set of loss values consumes a duration of time substantially similar to a duration of time consumed by the operation of computing the set of gradient vectors for each layer of the device partition.

12. The computer-readable medium of claim 9, wherein the first plurality of consecutive time periods immediately precedes the second plurality of consecutive time periods.

13. The computer-readable medium of claim 9, wherein each set of data samples corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini-batches, and the device partition is applied to each data sample in the batch once during training.

14. The computer-readable medium of claim 9, wherein the operations further comprise: updating, after the training, the weight values of the device partition based on the set of gradient vectors for each layer of the device partition computed during each of the second plurality of consecutive time periods.

15. The computer-readable medium of claim 14, wherein the operations further comprise: performing a plurality of iterations of the training and the updating the weight values; and transmitting the device partition to the server.

16. The computer-readable medium of claim 9, wherein during the transmitting of each of the second plurality of consecutive time periods, transmitting a current set of labels to the server.

17. A method comprising: training, collaboratively with a computation device through a network, a neural network model by performing, during each of a first plurality of consecutive time periods, operations of receiving, from the computation device, a current set of activations output from a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, applying the server partition to each activation among a preceding set of activations to obtain a current set of output instances, the preceding set of activations received during a preceding time period among the first plurality of consecutive time periods, applying each output instance among the current set of output instances to a loss function relating activations to output instances to obtain a current set of loss values, and computing a set of gradient vectors for each layer of the server partition, including a current set of gradient vectors of a layer bordering the device partition, based on the current set of loss values, and during each of a second plurality of consecutive time periods, operations of transmitting, to the computation device, a preceding set of gradient vectors of the layer bordering the device partition computed during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods and a preceding set of loss values of the loss function obtained in the preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods.

18. The method of claim 17, wherein the first plurality of consecutive time periods overlaps with the second plurality of consecutive time periods, such that the operations of both the first plurality of consecutive time periods and the second plurality of consecutive time periods are performed in at least two consecutive time periods among both the first plurality of consecutive time periods and the second plurality of consecutive time periods.

19. The method of claim 18, wherein each set of activations corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini-batches, and the server partition is applied to the activation corresponding to each data sample in the batch once during the training.

20. The method of claim 19, further comprising: transmitting the neural network model to the computation device; training the neural network model; receiving a plurality of estimation values from the computation device; estimating an overall collaborative training value representing a duration of time for collaboratively training the neural network model with the computation device; and determining the partition value and the mini-batch value based on the estimating.

Description:
COLLABORATIVE TRAINING WITH PARALLEL OPERATIONS

BACKGROUND

[0001] TECHNICAL FIELD

[0002] This description relates to collaborative training techniques.

[0003] BACKGROUND

[0004] Collaborative machine learning (CML) techniques, such as federated learning, are used to collaboratively train neural network models using multiple computation devices, such as end-user devices, and a server. CML techniques preserve the privacy of end-users because it does not require user data to be transferred to the server. Instead, local models are trained and shared with the server.

SUMMARY

[0005] According to at least some embodiments of the subject disclosure, neural networks are collaboratively trained with parallel operations by performing operations in a plurality of consecutive time periods including a first plurality of consecutive time periods during which the server receives a set of activations, applies the server partition to a set of activations, applies a set of output instances to a loss function, and computes a set of gradient vectors, and a second plurality of consecutive time periods during which the server transmits a set of gradient vectors.

[0006] Some embodiments include the instructions in a computer program, the method performed by the processor executing the instructions of the computer program, and an apparatus that performs the method. In some embodiments, the apparatus includes a controller including circuitry configured to perform the operations in the instructions. BRIEF DESCRIPTION OF THE DRAWINGS

[0007] Aspects of the present disclosure are best understood from the following detailed description when read with the accompanying figures. It is noted that, in accordance with the standard practice in the industry, various features are not drawn to scale. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion.

[0008] FIG. 1 is a schematic diagram of a system for collaborative training with parallel operations, according to at least some embodiments of the subject disclosure.

[0009] FIG. 2 is an operational flow for collaborative training with parallel operations, according to at least some embodiments of the subject disclosure.

[0010] FIG. 3 is an operational flow of a server performing device profiling, according to at least some embodiments of the subject disclosure.

[0011] FIG. 4 is an operational flow of a computation device performing device profiling, according to at least some embodiments of the subject disclosure.

[0012] FIG. 5 is an operational flow for an epoch of training in collaboration with a computation device, according to at least some embodiments of the subject disclosure.

[0013] FIG. 6 is an operational flow for an epoch of training in collaboration with a server, according to at least some embodiments of the subject disclosure.

[0014] FIG. 7 is an operational and communication flow of a computation device and server collaboratively training a neural network model using a batch of data samples, according to at least some embodiments of the subject disclosure.

[0015] FIG. 8 is an operational flow for training a neural network model using a batch of data samples in collaboration with a server, according to at least some embodiments of the subject disclosure. [0016] FIG. 9 is an operational flow for training a neural network model using a batch of data samples in collaboration with a computation device, according to at least some embodiments of the subject disclosure.

[0017] FIG. 10 is a block diagram of a hardware configuration for collaborative training with parallel operations, according to at least some embodiments of the subject disclosure.

DETAILED DESCRIPTION

[0018] The following disclosure provides many different embodiments, or examples, for implementing different features of the provided subj ect matter. Specific examples of components, values, operations, materials, arrangements, or the like, are described below to simplify the present disclosure. These are, of course, merely examples and are not intended to be limiting. Other components, values, operations, materials, arrangements, or the like, are contemplated. In addition, the present disclosure may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.

[0019] Training a Deep Neural Network (DNN) involves the forward propagation pass (or forward pass) and backward propagation pass (or backward pass). In the forward pass, one sample of input data is used as input for the first input layer and the output of each layer is passed on to subsequent layers to compute the loss function. In the backward pass, the loss function is passed from the last DNN layer to the first layer to compute the gradients of the DNN model parameters.

[0020] In some CML methods, the devices and server work in an alternating manner. For example, since a DNN consists of consecutive layers, the complete DNN is split into two parts at the granularity of layers, and a server partition is deployed on the server and a device partition is deployed on the devices. The devices train the initial layers of the DNN (the device partition) and the server trains the remaining layers (the server partition). First, a device runs the forward pass of the device partition on its local data. Next, the device sends the intermediate results (activations) to the server. Then, the server uses the activations to complete the forward pass of the server partition to obtain the loss. Next, the loss is then used for the backward pass on the server to compute the gradients of the parameters of the server partition and the gradients of the activations. Then, the server sends the gradients of the activations back to the device. Finally, the device computes the gradients of the parameters of the device partition in the device-side backward pass. In these methods, idle time on the device exists between the forward pass and the backward pass of the device partition. In at least some embodiments, the device performs the forward pass of the next few mini-batches during the device-side idle time to fill the pipeline.

[0021] In at least some embodiments, the following operations are performed for collaborative training with parallel operations. A DNN is divided into two parts and deploys them on the server and devices. Then, the forward and backward passes are reordered for multiple mini-batches. Each device executes the forward pass for multiple mini-batches in sequence. The immediate result of each forward pass (referred to as smashed data or activations) is transmitted to the server, which runs the forward and backward passes for the remaining layers and sends the gradients of the activations back to the device. The device then sequentially performs the backward passes for the mini-batches. The devices operate in parallel, and the local models are aggregated at a set frequency. Since the forward passes occur sequentially on the device, the communication for each forward pass overlaps the computation of the following forward passes. Also, the server and device computations occur simultaneously for different mini-batches. Thus, at least some embodiments reduce the idle time of devices and servers by overlapping server-side and device-side computations and server-device communication.

[0022] FIG. 1 is a schematic diagram of a system for collaborative training with parallel operations, according to at least some embodiments of the subject disclosure. The system includes a server 100, a plurality of computation devices 105A, 105B, 105C, and 105D, and a network 107.

[0023] Server 100 is computation device capable of performing calculations to train a neural network or other machine learning function. In at least some embodiments, server 100 includes a non-transitory computer-readable medium including instructions executable by a processor to cause the processor to perform training with parallel operations in collaboration with computation devices 105A, 105B, 105C, and 105D. In at least some embodiments, server 100 is a single server, a plurality of servers, a portion of a server, a virtual instance of cloud computing, etc. In at least some embodiments where server 100 is a plurality of servers or a plurality of virtual instances of cloud computing, server 100 includes a central server working with edge servers, each edge server having a logical location that is closer to the respective computation device among computation devices 105 A, 105B, 105C, and 105D with which the edge server is in communication.

[0024] Computation devices 105 A, 105B, 105C, and 105D are devices capable of performing calculations to train a neural network or other machine learning function. In at least some embodiments, computation devices 105 A, 105B, 105C, and 105D each include a non-transitory computer-readable medium including instructions executable by a processor to cause the processor to perform training with parallel operations in collaboration with server 100. In at least some embodiments, computation devices 105 A, 105B, 105C, and 105D are heterogeneous, meaning the devices have varying computation resources, such as processing power, memory, etc. In at least some embodiments, computation devices 105A, 105B, 105C, and 105D include devices having limited computation resources, such as smart watches, fitness trackers, Intemet- of-Things (loT) devices, etc., and/or devices having computation resources for a broader range of capabilities, such as smart phones, tablets, personal computers, etc. In at least some embodiments, computation devices 105A, 105B, 105C, and 105D receive private information, either by detecting it directly, such as through onboard microphones, cameras, etc., or by receiving data through electronic communication with another device, and use the private information as training data. In at least some embodiments, the training data is not private information or is a mixture of private and non-private information.

[0025] Computation devices 105 A, 105B, 105C, and 105D are in communication with server 100 through network 107. In at least some embodiments, network 102 is configured to relay communication among server 100 and computation devices 105 A, 105B, 105C, and 105D. In at least some embodiments, network 107 is a local area network (LAN), a wide area network (WAN), such as the internet, a radio access network (RAN), or any combination. In at least some embodiments, network 107 is a packet-switched network operating according to IPv4, IPv6 or other network protocol.

[0026] FIG. 2 is an operational flow for collaborative training with parallel operations, according to at least some embodiments of the subject disclosure. The operational flow provides a method of collaborative training with parallel operations. In at least some embodiments, the method is performed by a controller of a server including sections for performing certain operations, such as the controller and server shown in FIG. 10, which will be explained hereinafter. [0027] At S210, a profiling section profiles each computation device. In at least some embodiments, the profiling section profiles each computation device to estimate parameters for the collaborative training. In at least some embodiments, the profiling section performs the operational flow shown in FIG. 3, which will be explained hereinafter.

[0028] At S212, a partitioning section partitions a neural network model for each computation device. In at least some embodiments, the partitioning section partitions the plurality of layers of the neural network model into the device partition and the server partition based on the partition value. In at least some embodiments, the partitioning section partitions neural network model M k into a device partition M Ck and a server-side model M Sk represented as: M k = M Sk ®M Ck EQ. 1 where the binary operator ® stacks the layers of two partitions of a deep learning model as a complete model. In at least some embodiments, there are K pairs of {M Ck , M Sk ], one for each of K computation devices, where M Ck is deployed on computation device k while all of M Sk are deployed on the server. In at least some embodiments, the complete model M k contains Q layers, M Ck contains the initial P layers and M Sk contains the remaining layers, where 1 ≤ P ≤ Q.

[0029] At S214, a training section collaboratively trains the neural network models with the computation devices. In at least some embodiments, the training section trains each instance of the neural network model collaboratively with a corresponding computation device among a plurality of computation devices. In at least some embodiments, the training section continuously updates the parameters, such as weights, of each instance of the neural network model for a number of rounds or until the parameters are satisfactory. In at least some embodiments, the training section performs, for each computation device, the operational flow shown in FIG. 5, which will be explained hereinafter.

[0030] At S216, an aggregating section aggregates the models collaboratively trained with the computation devices. In at least some embodiments, the aggregating section aggregates the updated parameters of neural network model instances received from the plurality of computation devices to generate an updated neural network model. In at least some embodiments, the aggregating section averages the gradient values across the neural network model instances, and calculates weight values of a global neural network model accordingly. In at least some embodiments, the aggregating section averages the weight values across the neural network model instances. In at least some embodiments, a global neural network model is obtained by aggregating neural network model instances M k using the following algorithm: EQ 2 where D k is the local dataset on device k and is the function to obtain the size of the given dataset. In at least some embodiments, an epoch of collaborative training is complete when the aggregating section generates the updated global neural network model.

[0031] At S218, the controller or a section thereof determines whether a termination condition has been met. In at least some embodiments, the termination condition is met when the model converges. In at least some embodiments, the termination condition is met after a predetermined number of epochs of collaborative training have been performed. In at least some embodiments, the termination condition is met when a time limit is exceeded. If the controller determines that the termination condition has not been met, then the operational flow returns to neural network model partitioning at S212. If the controller determines that the termination condition has been met, then the operational flow ends.

[0032] FIG. 3 is an operational flow of a server performing device profiling, according to at least some embodiments of the subject disclosure. The operational flow provides a method of device profiling by a server. In at least some embodiments, the operational flow is performed for each computation device among a plurality of computation devices. In at least some embodiments, the operational flow is performed in parallel for each computation device among the plurality of computation devices. In at least some embodiments, the method is performed by a profiling section of a server, such as the server shown in FIG. 10, which will be explained hereinafter.

[0033] At S320, the profiling section or a sub-section thereof transmits the model to the computation device. In at least some embodiments, the profiling section transmits the neural network model to the computation device.

[0034] At S321, the profiling section or a sub-section thereof trains the model. In at least some embodiments, the profiling section causes the training section to train the model. In at least some embodiments, the profiling section trains the neural network model for a number of rounds to identify the size of the output data and the training time for each layer of the neural network model. In at least some embodiments, represents a duration of time consumed by computation device k for activation computing of layer q in device partition represents a duration of time consumed by computation device k for gradient vector computing of layer q of device partition represents a duration of time consumed by the server for activation computing of layer q of server partition represents a duration of time consumed by the server for gradient vector computing of layer q of server partition M Sk , where represents a forward pass of layer q in device partition by computation device k, represents a backward pass of layer q in device partition by computation device represents a forward pass of layer q in server partition by the server, and represents a backward pass of layer q in server partition by the server. In at least some embodiments, represents the output data volume for a forward pass of layer q, and represents the output data volume for a backward pass of layer q. In at least some embodiments, the profiling section records values of and during training of the neural network model.

[0035] At S323, the profiling section or a sub-section thereof determines whether training is complete. In at least some embodiments, the neural network model is trained for a predetermined number of rounds. If the profiling section determines that training is not complete, then the operational flow returns to model training at S321 for the next round (S324). If the profiling section determines that training is complete, then the operational flow proceeds to estimation value reception at S326.

[0036] At S326, the profiling section receives estimation values. In at least some embodiments, the profiling section receives a plurality of estimation values from the computation device. In at least some embodiments, the plurality of estimation values includes a plurality of device forward pass values and a plurality of device backward pass values. In at least some embodiments, each device forward pass value represents a duration of time consumed by the computation device for activation computing of a corresponding layer among the plurality of layers of the neural network model. In at least some embodiments, each device backward pass value represents a duration of time consumed by the computation device for gradient vector computing of a corresponding layer among the plurality of layers of the neural network model. In at least some embodiments, the profiling section receives values of and from computation device k.

[0037] At S327, the profiling section estimates overall collaborative training time. In at least some embodiments, the profiling section estimates an overall collaborative training value representing a duration of time for collaboratively training the neural network model with the computation device. In at least some embodiments, the profiling section estimates the overall collaborative training time by estimating each training stage by the following equations: where represents the forward passes of by computation device represents the forward passes of by the server, represents the backward passes of by computation device k, and represents the backward passes of by the server for mini -batch n of N mini - batches in a round of collaborative training. In at least some embodiments, the time spent uploading and downloading between computation device k and the server is estimated by the following equations: where represents uploading activations of mini-batch n from computation device k to the server and represents downloading gradient vectors of mini-batch n from the server to computation device k, represents the uplink bandwidth between computation device k and the server, represents the downlink bandwidth between computation device k and the server, represents the data volume for activations of layer P k , and represents the data volume for gradient vectors of layer P k .

[0038] In at least some embodiments, the profiling section estimates the overall collaborative training time based on the training time of each round using dynamic programming with the following equations: where represents the first stage, represents the last stage, prev(r) represents a function to obtain all stages previous to stage r according to TABLE I, T(r) represents the total time from the beginning of the training round to the end of the stage r, t(r) represents the time spent in stage r, and thus represents the training time of each round.

TABLE I

[0039] At S328, the profiling section determines a partition value and a mini-batch value. In at least some embodiments, the profiling section determines the partition value and the mini -batch value based on the estimating. In at least some embodiments, the profiling section estimates the overall collaborative training value is based on the plurality of device forward pass values, the plurality of device forward pass values, a plurality of activation volume values, a plurality of gradient vector volume values, an uplink bandwidth value, a downlink bandwidth value, a partition value representing the number of layers of the device partition, and a mini-batch value representing the number of mini-batches in the plurality of mini-batches. In at least some embodiments, each activation volume value represents a volume of data output by activation computing of a corresponding layer among the plurality of layers of the neural network model.

In at least some embodiments, each gradient vector volume value represents a volume of data output by gradient vector computing of a corresponding layer among the plurality of layers of the neural network model. In at least some embodiments, the uplink bandwidth value represents an uplink bandwidth between the device and the server. In at least some embodiments, the downlink bandwidth value represents a downlink bandwidth between the device and the server. In at least some embodiments, the profiling section estimates the overall collaborative training value for each pair among a plurality of pairs of values N k and P k . In at least some embodiments, the range of where Z + is the set of all positive integers. In at least some embodiments, the profiling section estimates the parallel batch number N using the following equation:

In at least some embodiments, the profiling section determines which pair to us based on the pair that yields the least overall collaborative training value. In at least some embodiments, the profiling section shortlists candidate pairs of N k and P k .

[0040] FIG. 4 is an operational flow of a computation device performing device profiling, according to at least some embodiments of the subject disclosure. The operational flow provides a method of device profiling by a computation device. In at least some embodiments, the operational flow is performed by each computation device among a plurality of computation devices. In at least some embodiments, the operational flow is performed in parallel by each computation device among the plurality of computation devices.

[0041] At S430, the computation device receives a model from the server. In at least some embodiments, the computation device receives a neural network model from the server.

[0042] At S432, the computation device trains the model. In at least some embodiments, the computation device trains the neural network model for a number of rounds to identify the training time for each layer of the neural network model. In at least some embodiments, the computation device records values of during training of the neural network model.

[0043] At S434, the computation device determines whether training is complete. In at least some embodiments, the neural network model is trained for a predetermined number of rounds. If the computation device determines that training is not complete, then the operational flow returns to model training at S432 for the next round (S435). If the computation device determines that training is complete, then the operational flow proceeds to estimation value transmission at S437.

[0044] At S437, the computation device transmits estimation values. In at least some embodiments, the computation device transmits a plurality of estimation values to the server. In at least some embodiments, the plurality of estimation values includes a plurality of device forward pass values and a plurality of device backward pass values. In at least some embodiments, computation device k transmits values of to the server.

[0045] FIG. 5 is an operational flow for an epoch of training in collaboration with a computation device, according to at least some embodiments of the subject disclosure. The operational flow provides a method of training in collaboration with one computation device for one epoch. In at least some embodiments, the operational flow is performed for each computation device among a plurality of computation devices. In at least some embodiments, the operational flow is performed in parallel for each computation device among the plurality of computation devices. In at least some embodiments, the method is performed by a training section of a server, such as the server shown in FIG. 10, which will be explained hereinafter.

[0046] At S540, the training section or a sub-section thereof transmits the device partition. In at least some embodiments, the training section transmitting, before the training, the device partition to the computation device. In at least some embodiments, the training section transmits device partition M Ck to computation device k.

[0047] At S542, the training section or a sub-section thereof collaboratively trains the model using a batch of data samples. In at least some embodiments, the training section trains, collaboratively with a computation device through a network, a neural network model. In at least some embodiments, the training section trains server partition while computation device k trains device partition M Ck . In at least some embodiments, the training section performs the operational flow shown in FIG. 8, which will be explained hereinafter.

[0048] At S543, the training section or a sub-section thereof updates the weight values. In at least some embodiments, the training section updates, after the training, the weight values of the server partition based on a set of gradient vectors for each layer of the server partition computed during each of a first plurality of consecutive time periods during the training. In at least some embodiments, the training section updates the parameters of server partition M Sk at the end of the training round. In at least some embodiments, as iterations of S542 and S543 proceed, the training section performs a plurality of iterations of the training and the updating the weight values to produce updated server partition

[0049] At S545, the training section or a sub-section thereof determines whether a termination condition has been met. In at least some embodiments, the training section does not stop training server partition until a “stop epoch” signal is received from computation device k. If the training section determines that the termination condition has not been met, then the operational flow returns to collaborative training at S542 for collaborative training using the next batch (S546). If the profiling section determines that the termination condition has been met, then the operational flow proceeds to device partition reception at S548.

[0050] At S548, the training section or a sub-section thereof receives the device partition. In at least some embodiments, the training section receives the device partition from the computation device. In at least some embodiments, the training section receives updated device partition from computation device k.

[0051] At S549, the training section or a sub-section thereof combines partitions. In at least some embodiments, the training section combining the device partition with the server partition to obtain an updated neural network model. In at least some embodiments, the training section combines updated device partition from computation device k with updated server partition M Sk ' to produce an updated model

[0052] FIG. 6 is an operational flow for an epoch of training in collaboration with a server, according to at least some embodiments of the subject disclosure. The operational flow provides a method of training by one computation device in collaboration with the server for one epoch. In at least some embodiments, the operational flow is performed for each computation device among a plurality of computation devices. In at least some embodiments, the operational flow is performed in parallel for each computation device among the plurality of computation devices.

[0053] At S650, the computation device receives a device partition. In at least some embodiments, the computation device receives, from the server, a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition. In at least some embodiments, computation device k receives device partition M Ck from the server. [0054] At S652, the computation device collaboratively trains the model using a batch of data samples. In at least some embodiments, the computation device trains, collaboratively with a server through a network, a neural network model. In at least some embodiments, computation device k trains device partition while the server trains server partition M Sk . In at least some embodiments, the computation device performs the operational flow shown in FIG. 9, which will be explained hereinafter.

[0055] At S653, the computation device updates the weight values. In at least some embodiments, the computation device updates, after the training, the weight values of the device partition based on the set of gradient vectors for each layer of the device partition computed during each of the second plurality of consecutive time periods during the training. In at least some embodiments, computation device k updates the parameters of device partition at the end of the training round. In at least some embodiments, as iterations of S652 and S653 proceed, the computation device performs a plurality of iterations of the training and the updating the weight values to produce updated device partition

[0056] At S655, the computation device determines whether a termination condition has been met. In at least some embodiments, the termination condition is met when collaborative training has been performed using a predetermined number of batches. In at least some embodiments, the termination condition is met when collaborative training has been performed for a predetermined amount of time. If the computation device determines that the termination condition has not been met, then the operational flow returns to collaborative training at S652 for collaborative training using the next batch (S656). If the computation device determines that the termination condition has been met, then the operational flow proceeds to server notification at S658.

[0057] At S658, the computation device notifies the server of termination of the round of collaborative training. In at least some embodiments, the computation device transmits a “stop epoch” signal to the server. [0058] At S659, the computation device transmits the device partition to the server. In at least some embodiments, the computation device transmits the device partition to the server. In at least some embodiments, computation device k transmits updated device partition M Ck ' to the server.

[0059] FIG. 7 is an operational and communication flow of a computation device and server collaboratively training a neural network model using a batch of data samples, according to at least some embodiments of the subject disclosure. The operational and communication flow includes a plurality of consecutive time periods t 1 - t 14 .

[0060] In at least some embodiments, from the perspective of the computation device, the plurality of consecutive time periods includes a first plurality of consecutive time periods t 2 - t 7 and a second plurality of consecutive time periods t 8 - t. 13 During time periods t 2 - t 7 , the computation device applies a device partition to a mini-batch of data samples to obtain a set of activations, and transmits a set of activations obtained during a preceding time period. During time periods t 8 - t 13 , the computation device receives a set of gradient vectors and a set of loss values, and computes a set of gradient vectors based on a set of gradient vectors received during a preceding time period.

[0061] For example, during time period t 2 , the computation device performs operation by applying a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, to each data sample among a current set of data samples to obtain a current set of activations, and performs operation u 1 by transmitting, to the server, a preceding set of activations of a layer bordering the server partition (output of obtained during a preceding time period ( t 1 ) among any of the first plurality of consecutive time periods.

[0062] For example, during time period t 8 , the computation device performs operation di by receiving, from the server, a current set of gradient vectors of a layer of the server partition bordering the device partition (output of and a current set of loss values (output of of a loss function relating activations to output instances, and performs operation by computing a set of gradient vectors for each layer of the device partition, based on a preceding set of gradient vectors of the layer of the server partition bordering the device partition and a preceding set of loss values, the preceding set of gradient vectors and the preceding set of loss values received (by d 1 ) during a preceding time period (t 7 ) among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods.

[0063] In at least some embodiments, during each of the first plurality of consecutive time periods, the operation of applying the device partition consumes a duration of time substantially similar to a duration of time consumed by the operation of transmitting the preceding set of activations of the layer bordering the server partition. In at least some embodiments, during each of the second plurality of consecutive time periods, the operation of receiving the current set of gradient vectors of the layer of the server partition bordering the device partition and the current set of loss values consumes a duration of time substantially similar to a duration of time consumed by the operation of computing the set of gradient vectors for each layer of the device partition.

[0064] In at least some embodiments, from the perspective of the computation device, the first plurality of consecutive time periods immediately precedes the second plurality of consecutive time periods.

[0065] In at least some embodiments, from the perspective of the server, the plurality of consecutive time periods includes a first plurality of consecutive time periods t 3 - t 8 and a second plurality of consecutive time periods t 7 - t 13 . During time periods t 3 - t 8 , the server receives a set of activations, applies the server partition to a set of activations, applies a set of output instances to a loss function, and computes a set of gradient vectors. During time periods t 7 - t 13 , the server transmits a set of gradient vectors.

[0066] For example, during time period t 3 , the server performs operation ui by receiving, from the computation device, a current set of activations (output of output from a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, performs operation by applying the server partition to each activation among a preceding set of activations to obtain a current set of output instances, the preceding set of activations received (by u 1 ) during a preceding time period ( t 2 ) among the first plurality of consecutive time periods and applying each output instance among the current set of output instances to a loss function relating activations to output instances to obtain a current set of loss values, and performs operation by computing a set of gradient vectors for each layer of the server partition, including a current set of gradient vectors of a layer bordering the device partition, based on the current set of loss values (output of

[0067] For example, during time period t 7 , the server performs operation by transmitting, to the computation device, a preceding set of gradient vectors of the layer bordering the device partition computed during a preceding time period ( t 3 ) among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods and a preceding set of loss values of the loss function obtained in the preceding time period (t 3 ) among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods.

[0068] In at least some embodiments, from the perspective of the server, the first plurality of consecutive time periods overlaps with the second plurality of consecutive time periods, such that the operations of both the first plurality of consecutive time periods and the second plurality of consecutive time periods are performed in at least two consecutive time periods among both the first plurality of consecutive time periods and the second plurality of consecutive time periods. For example, the server performs operations u 6 , d 1 , and during time period t 7 , and performs operations u 7 , d 2 , and during time period t».

[0069] Although FIG. 7 shows an operational and communication flow where there are seven mini-batches, in other embodiments, other numbers of mini-batches are used.

[0070] FIG. 8 is an operational flow for training a neural network model using a batch of data samples in collaboration with a server, according to at least some embodiments of the subject disclosure. The operational flow provides a method of training a neural network model by one computation device using a batch of data samples in collaboration with a server. In at least some embodiments, the operational flow is performed for each computation device among a plurality of computation devices. In at least some embodiments, the operational flow is performed in parallel for each computation device among the plurality of computation devices.

[0071] Each iteration of the operational flow represents one time period among consecutive time periods. However, not every operation is performed in every consecutive time period as is explained above.

[0072] At S860, the computation device applies a device partition to current data samples. In at least some embodiments, the computation device applies a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, to each data sample among a current set of data samples to obtain a current set of activations. In at least some embodiments, during the transmitting of each of the second plurality of consecutive time periods, the computation device also transmits a current set of labels to the server. The operation at S860 is substantially similar to operations performed during time periods t 1 - t 7 .

[0073] At S862, the computation device transmits a preceding set of activations. In at least some embodiments, the computation device transmits, to the server, a preceding set of activations of a layer bordering the server partition obtained during a preceding time period among any of the first plurality of consecutive time periods. The operation at S862 is substantially similar to operations u 1 - u 7 , performed during time periods t 2 - t 8 .

[0074] At S864, the computation device receives a current set of gradient vectors and loss values. In at least some embodiments, the computation device receives, from the server, a current set of gradient vectors of a layer of the server partition bordering the device partition and a current set of loss values of a loss function relating activations to output instances. The operation at S864 is substantially similar to operations d\ - di, performed during time periods t 7 - t 13 .

[0075] At S866, the computation device computes a set of gradient vectors. In at least some embodiments, the computation device computes a set of gradient vectors for each layer of the device partition, based on a preceding set of gradient vectors of the layer of the server partition bordering the device partition and a preceding set of loss values, the preceding set of gradient vectors and the preceding set of loss values received during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods. The operation at S866 is substantially similar to operations performed during time periods t 8 - t 14 .

[0076] At S868, the computation device determines whether the batch is complete. In at least some embodiments, each set of data samples corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini -batches. In at least some embodiments, the device partition is applied to each data sample in the batch once during training. If the computation device determines that the batch is incomplete, then the operational flow returns for another iteration of applicable operations in the next time period (S869). If the computation device determines that the batch is complete, then the operational flow ends.

[0077] FIG. 9 is an operational flow for training a neural network model using a batch of data samples in collaboration with a computation device, according to at least some embodiments of the subject disclosure. The operational flow provides a method of training a neural network model using a batch of data samples in collaboration with one computation device. In at least some embodiments, the operational flow is performed for each computation device among a plurality of computation devices. In at least some embodiments, the operational flow is performed in parallel for each computation device among the plurality of computation devices. In at least some embodiments, the method is performed by a training section of a server, such as the server shown in FIG. 10, which will be explained hereinafter.

[0078] Each iteration of the operational flow represents one time period among consecutive time periods. However, not every operation is performed in every consecutive time period as is explained above.

[0079] At S970, the training section or a sub-section thereof receives a current set of activations. In at least some embodiments, the training section receives, from the computation device, a current set of activations output from a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition. In at least some embodiments, during the receiving, the training section also receives a current set of labels from the computation device. The operation at S970 is substantially similar to operations u 1 - u 7 , performed during time periods t 2 - t 8 .

[0080] At S972, the training section or a sub-section thereof applies the server partition to a preceding set of activations. In at least some embodiments, the training section applies the server partition to each activation among a preceding set of activations to obtain a current set of output instances, the preceding set of activations received during a preceding time period among the first plurality of consecutive time periods.

[0081] At S974, the training section or a sub-section thereof applies the current set of output instances to a loss function. In at least some embodiments, the training section applies each output instance among the current set of output instances to a loss function relating activations to output instances to obtain a current set of loss values. The operations at S972 and S974 are substantially similar to operations performed during time periods t 3 - t 9 .

[0082] At S975, the training section or a sub-section thereof computes a current set of gradient vectors. In at least some embodiments, the training section computes a set of gradient vectors for each layer of the server partition, including a current set of gradient vectors of a layer bordering the device partition, based on the current set of loss values. The operation at S975 is substantially similar to operations performed during time periods t 3 - t 9 .

[0083] At S977, the training section or a sub-section thereof transmits a preceding set of gradient vectors and loss values. In at least some embodiments, the training section transmits, to the computation device, a preceding set of gradient vectors of the layer bordering the device partition computed during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods and a preceding set of loss values of the loss function obtained in the preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods. The operation at S977 is substantially similar to operations d 1 - d 7 , performed during time periods t 7 - t 13 .

[0084] At S978, the training section determines whether the batch is complete. In at least some embodiments, each set of activations corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini-batches. In at least some embodiments, the server partition is applied to the activation corresponding to each data sample in the batch once during the training. If the training section determines that the batch is incomplete, then the operational flow returns for another iteration of applicable operations in the next time period (S979). If the training section determines that the batch is complete, then the operational flow ends.

[0085] FIG. 10 is a block diagram of a hardware configuration for collaborative training with parallel operations, according to at least some embodiments of the subject disclosure.

[0086] The exemplary hardware configuration includes server 1000, which interacts with input device 1008, and communicates with computation devices 1005 A and 1005B through network 1007. In at least some embodiments, server 1000 is a computer or other computing device that receives input or commands from input device 1008. In at least some embodiments, server 1000 is integrated with input device 1008. In at least some embodiments, server 1000 is a computer system that executes computer-readable instructions to perform operations for embedded milestone status.

[0087] Server 1000 includes a controller 1002, a storage unit 1004, an input/output interface 1006, and a communication interface 1009. In at least some embodiments, controller 1002 includes a processor or programmable circuitry executing instructions to cause the processor or programmable circuitry to perform operations according to the instructions. In at least some embodiments, controller 1002 includes analog or digital programmable circuitry, or any combination thereof. In at least some embodiments, controller 1002 includes physically separated storage or circuitry that interacts through communication. In at least some embodiments, storage unit 1004 includes a non-volatile computer-readable medium capable of storing executable and non-executable data for access by controller 1002 during execution of the instructions. Communication interface 1009 transmits and receives data from network 1007. Input/output interface 1006 connects to various input and output units, such as input device 1008, via a parallel port, a serial port, a keyboard port, a mouse port, a monitor port, and the like to accept commands and present information. In some embodiments, storage unit 1004 is external from server 1000. [0088] Controller 1002 includes profiling section 1080, partitioning section 1082, training section 1084, and aggregating section 1086. Storage unit 1004 includes model parameters 1090, activations 1092, loss values 1094, and gradients 1096.

[0089] Profiling section 1080 is the circuitry or instructions of controller 1002 configured to profile computation devices. In at least some embodiments, profiling section 1080 is configured to profile each computation device to estimate parameters for collaborative training. In at least some embodiments, profiling section 1080 records information in storage unit 1004. In at least some embodiments, profiling section 1080 includes sub-sections for performing additional functions, as described in the foregoing flow charts. In at least some embodiments, such subsections is referred to by a name associated with a corresponding function.

[0090] Partitioning section 1082 is the circuitry or instructions of controller 1002 configured to partition neural network models. In at least some embodiments, partitioning section 1082 is configured to partition the plurality of layers of the neural network model into the device partition and the server partition based on the partition value. In at least some embodiments, partitioning section 1082 utilizes information in storage unit 1004, such as model parameters 1090. In at least some embodiments, partitioning section 1082 includes sub-sections for performing additional functions, as described in the foregoing flow charts. In at least some embodiments, such subsections is referred to by a name associated with a corresponding function.

[0091] Training section 1084 is the circuitry or instructions of controller 1002 configured to train neural network models. In at least some embodiments, training section 1084 is configured to train, collaboratively with plurality of computation devices through a network, a neural network model. In at least some embodiments, training section 1084 utilizes information from storage unit 1004, such as model parameters 1090, activations 1092, loss values 1094, and gradients 1096. In at least some embodiments, training section 1084 includes sub-sections for performing additional functions, as described in the foregoing flow charts. In at least some embodiments, such sub-sections is referred to by a name associated with a corresponding function.

[0092] Aggregating section 1086 is the circuitry or instructions of controller 1002 configured to aggregate neural network models. In at least some embodiments, aggregating section 1086 is configured to train, collaboratively with plurality of computation devices through a network, a neural network model. In at least some embodiments, aggregating section 1086 utilizes information from storage unit 1004, such as model parameters 1090 and gradients 1096. In at least some embodiments, aggregating section 1086 includes sub-sections for performing additional functions, as described in the foregoing flow charts. In at least some embodiments, such sub-sections is referred to by a name associated with a corresponding function.

[0093] In at least some embodiments, the apparatus is another device capable of processing logical functions in order to perform the operations herein. In at least some embodiments, the controller and the storage unit need not be entirely separate devices, but share circuitry or one or more computer-readable mediums in some embodiments. In at least some embodiments, the storage unit includes a hard drive storing both the computer-executable instructions and the data accessed by the controller, and the controller includes a combination of a central processing unit (CPU) and RAM, in which the computer-executable instructions are able to be copied in whole or in part for execution by the CPU during performance of the operations herein.

[0094] In at least some embodiments where the apparatus is a computer, a program that is installed in the computer is capable of causing the computer to function as or perform operations associated with apparatuses of the embodiments described herein. In at least some embodiments, such a program is executable by a processor to cause the computer to perform certain operations associated with some or all of the blocks of flowcharts and block diagrams described herein.

[0095] At least some embodiments are described with reference to flowcharts and block diagrams whose blocks represent (1) steps of processes in which operations are performed or (2) sections of a controller responsible for performing operations. In at least some embodiments, certain steps and sections are implemented by dedicated circuitry, programmable circuitry supplied with computer-readable instructions stored on computer-readable media, and/or processors supplied with computer-readable instructions stored on computer-readable media. In at least some embodiments, dedicated circuitry includes digital and/or analog hardware circuits and include integrated circuits (IC) and/or discrete circuits. In at least some embodiments, programmable circuitry includes reconfigurable hardware circuits comprising logical AND, OR, XOR, NAND, NOR, and other logical operations, flip-flops, registers, memory elements, etc., such as field-programmable gate arrays (FPGA), programmable logic arrays (PLA), etc.

[0096] In at least some embodiments, the computer readable storage medium includes a tangible device that is able to retain and store instructions for use by an instruction execution device. In some embodiments, the computer readable storage medium includes, for example, but is not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. A non-exhaustive list of more specific examples of the computer readable storage medium includes the following: a portable computer diskette, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or Flash memory), a static random access memory (SRAM), a portable compact disc read-only memory (CD-ROM), a digital versatile disk (DVD), a memory stick, a floppy disk, a mechanically encoded device such as punch-cards or raised structures in a groove having instructions recorded thereon, and any suitable combination of the foregoing. A computer readable storage medium, as used herein, is not to be construed as being transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission media (e.g., light pulses passing through a fiber-optic cable), or electrical signals transmitted through a wire.

[0097] In at least some embodiments, computer readable program instructions described herein are downloadable to respective computing/processing devices from a computer readable storage medium or to an external computer or external storage device via a network, for example, the Internet, a local area network, a wide area network and/or a wireless network. In at least some embodiments, the network includes copper transmission cables, optical transmission fibers, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. In at least some embodiments, a network adapter card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium within the respective computing/processing device.

[0098] In at least some embodiments, computer readable program instructions for carrying out operations described above are assembler instructions, instruction-set-architecture (ISA) instructions, machine instructions, machine dependent instructions, microcode, firmware instructions, state-setting data, or either source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C++ or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. In at least some embodiments, the computer readable program instructions are executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In at least some embodiments, in the latter scenario, the remote computer is connected to the user's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or the connection is made to an external computer (for example, through the Internet using an Internet Service Provider). In at least some embodiments, electronic circuitry including, for example, programmable logic circuitry, field-programmable gate arrays (FPGA), or programmable logic arrays (PLA) execute the computer readable program instructions by utilizing state information of the computer readable program instructions to individualize the electronic circuitry, in order to perform aspects of the subject disclosure.

[0099] While embodiments of the subject disclosure have been described, the technical scope of any subject matter claimed is not limited to the above described embodiments. Persons skilled in the art would understand that various alterations and improvements to the above-described embodiments are possible. Persons skilled in the art would also understand from the scope of the claims that the embodiments added with such alterations or improvements are included in the technical scope of the invention.

[0100] The operations, procedures, steps, and stages of each process performed by an apparatus, system, program, and method shown in the claims, embodiments, or diagrams are able to be performed in any order as long as the order is not indicated by “prior to,” “before,” or the like and as long as the output from a previous process is not used in a later process. Even if the process flow is described using phrases such as “first” or “next” in the claims, embodiments, or diagrams, such a description does not necessarily mean that the processes must be performed in the described order.

[0101] An aspect of this description relates to a non-transitory computer-readable medium including instructions executable by a processor to cause the processor to perform operations including training, collaboratively with a computation device through a network, a neural network model by performing, during each of a first plurality of consecutive time periods, operations of receiving, from the computation device, a current set of activations output from a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, applying the server partition to each activation among a preceding set of activations to obtain a current set of output instances, the preceding set of activations received during a preceding time period among the first plurality of consecutive time periods, applying each output instance among the current set of output instances to a loss function relating activations to output instances to obtain a current set of loss values, and computing a set of gradient vectors for each layer of the server partition, including a current set of gradient vectors of a layer bordering the device partition, based on the current set of loss values, and during each of a second plurality of consecutive time periods, operations of transmitting, to the computation device, a preceding set of gradient vectors of the layer bordering the device partition computed during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods and a preceding set of loss values of the loss function obtained in the preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods. The first plurality of consecutive time periods overlaps with the second plurality of consecutive time periods, such that the operations of both the first plurality of consecutive time periods and the second plurality of consecutive time periods are performed in at least two consecutive time periods among both the first plurality of consecutive time periods and the second plurality of consecutive time periods. Each set of activations corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini-batches, and the server partition is applied to the activation corresponding to each data sample in the batch once during the training. The operations further include transmitting the neural network model to the computation device; training the neural network model; receiving a plurality of estimation values from the computation device; estimating an overall collaborative training value representing a duration of time for collaboratively training the neural network model with the computation device; and determining the partition value and the mini-batch value based on the estimating. The plurality of estimation values includes a plurality of device forward pass values, each device forward pass value representing a duration of time consumed by the computation device for activation computing of a corresponding layer among the plurality of layers of the neural network model, and a plurality of device backward pass values, each device backward pass value representing a duration of time consumed by the computation device for gradient vector computing of a corresponding layer among the plurality of layers of the neural network model; the estimating the overall collaborative training value is based on the plurality of device forward pass values, the plurality of device forward pass values, a plurality of activation volume values, each activation volume value representing a volume of data output by activation computing of a corresponding layer among the plurality of layers of the neural network model, a plurality of gradient vector volume values, each gradient vector volume value representing a volume of data output by gradient vector computing of a corresponding layer among the plurality of layers of the neural network model, an uplink bandwidth value representing an uplink bandwidth between the device and the server, a downlink bandwidth value representing a downlink bandwidth between the device and the server, a partition value representing the number of layers of the device partition, and a mini-batch value representing the number of mini-batches in the plurality of mini-batches. The operations further comprise updating, after the training, the weight values of the server partition based on the set of gradient vectors for each layer of the server partition computed during each of the first plurality of consecutive time periods. The operations further comprise: partitioning the plurality of layers of the neural network model into the device partition and the server partition based on the partition value; transmitting, before the training, the device partition to the computation device; performing a plurality of iterations of the training and the updating the weight values; receiving the device partition from the computation device; and combining the device partition with the server partition to obtain an updated neural network model. During the receiving of each of the first plurality of consecutive time periods, receiving a current set of labels from the computation device.

[0102] An aspect of this description relates to a non-transitory computer-readable medium including instructions executable by a processor to cause the processor to perform operations including receiving, from the server, a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition; training, collaboratively with a server through a network, a neural network model by performing, during each of a first plurality of consecutive time periods, operations of applying a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, to each data sample among a current set of data samples to obtain a current set of activations, transmitting, to the server, a preceding set of activations of a layer bordering the server partition obtained during a preceding time period among any of the first plurality of consecutive time periods, and during each of a second plurality of consecutive time periods, operations of receiving, from the server, a current set of gradient vectors of a layer of the server partition bordering the device partition and a current set of loss values of a loss function relating activations to output instances, computing a set of gradient vectors for each layer of the device partition, based on a preceding set of gradient vectors of the layer of the server partition bordering the device partition and a preceding set of loss values, the preceding set of gradient vectors and the preceding set of loss values received during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods. During each of the first plurality of consecutive time periods, the operation of applying the device partition consumes a duration of time substantially similar to a duration of time consumed by the operation of transmitting the preceding set of activations of the layer bordering the server partition. During each of the second plurality of consecutive time periods, the operation of receiving the current set of gradient vectors of the layer of the server partition bordering the device partition and the current set of loss values consumes a duration of time substantially similar to a duration of time consumed by the operation of computing the set of gradient vectors for each layer of the device partition. The first plurality of consecutive time periods immediately precedes the second plurality of consecutive time periods. Each set of data samples corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini -batches, and the device partition is applied to each data sample in the batch once during training. The operations further include updating, after the training, the weight values of the device partition based on the set of gradient vectors for each layer of the device partition computed during each of the second plurality of consecutive time periods. The operations further include performing a plurality of iterations of the training and the updating the weight values; and transmitting the device partition to the server. During the transmitting of each of the second plurality of consecutive time periods, transmitting a current set of labels to the server. [0103] An aspect of this description relates to a method including training, collaboratively with a computation device through a network, a neural network model by performing, during each of a first plurality of consecutive time periods, operations of receiving, from the computation device, a current set of activations output from a device partition of a neural network model, the neural network model including a plurality of layers partitioned into the device partition and a server partition, applying the server partition to each activation among a preceding set of activations to obtain a current set of output instances, the preceding set of activations received during a preceding time period among the first plurality of consecutive time periods, applying each output instance among the current set of output instances to a loss function relating activations to output instances to obtain a current set of loss values, and computing a set of gradient vectors for each layer of the server partition, including a current set of gradient vectors of a layer bordering the device partition, based on the current set of loss values, and during each of a second plurality of consecutive time periods, operations of transmitting, to the computation device, a preceding set of gradient vectors of the layer bordering the device partition computed during a preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods and a preceding set of loss values of the loss function obtained in the preceding time period among any of the first plurality of consecutive time periods and the second plurality of consecutive time periods. The first plurality of consecutive time periods overlaps with the second plurality of consecutive time periods, such that the operations of both the first plurality of consecutive time periods and the second plurality of consecutive time periods are performed in at least two consecutive time periods among both the first plurality of consecutive time periods and the second plurality of consecutive time periods. Each set of activations corresponds to a mini-batch of data samples, wherein a batch of data samples includes a number of mini-batches, and the server partition is applied to the activation corresponding to each data sample in the batch once during the training. The method further includes transmitting the neural network model to the computation device; training the neural network model; receiving a plurality of estimation values from the computation device; estimating an overall collaborative training value representing a duration of time for collaboratively training the neural network model with the computation device; and determining the partition value and the mini-batch value based on the estimating. [0104] The foregoing outlines features of several embodiments so that those skilled in the art may better understand the aspects of the present disclosure. Those skilled in the art should appreciate that they may readily use the present disclosure as a basis for designing or modifying other processes and structures for carrying out the same purposes and/or achieving the same advantages of the embodiments introduced herein. Those skilled in the art should also realize that such equivalent constructions do not depart from the spirit and scope of the present disclosure, and that they may make various changes, substitutions, and alterations herein without departing from the spirit and scope of the present disclosure.