Background Data Dispatch with Custom Training Loop

Question

0 votes

I have a question regarding the training of a deep neural network with Matlab.

I have built a custom training loop for the training of a regression network on a machine with 2 GPUs.

The training loop performs fine, however it is rather slow in comparison to the automatic trainNetwork function.

The trainNetwork function does not provide the type of network progress monitor i like. The trainNetwork function also seems to error unpredictably on my machine and sometimes the network are not "finished" properly. This is why i make use of a custom training loop.

I use a parallel pool with 2 workers and the randomPatchExtraction Datastore (which is partitionable). The parallel operations

are written in an spmd block.

What would be the best way to use data dispatching in the background in a custom training loop?

I have tried to scale up the number of workers in the parallel pool. This leads to the case that some workers

cannot read data since the Datastores are only partitioned according to the number of GPUs, not the number of workers.

Which operations do i have to assign to the workers that are supposed to preload data?

Has anybody tried using a "self-written" data dispatching in a custom training loop?

Thanks in advance!

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Connectez-vous pour commenter.

Connectez-vous pour répondre à cette question.

Follow Question

Answer 1

Joss Knight le 22 Nov 2020

0 votes

Use a minibatchqueue with the DispatchInBackground option.

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

Pascal Kutschbach le 23 Nov 2020

Ouvrir dans MATLAB Online

Hi Joss, Thanks for the answer.

The use of a minibatchqueue only partly solves my issue.

I can build a mbq with the "DispatchInBackground" option but yet i have to assign specific workers

of the spmd block to the "dispatching operation". With 2 GPUs i can use a parpool with 4 workers.

Two workers should be assigned to the training computation and two (or more) workers should be assigned to the dispatch operation. Currently with these settings i (obviously) have 4 workers working on the training computation even with the usage of a minibatchqueue and the "DispatchInBackground" option enabled. My question is how do i build an spmd block where:

spmd
    if labindex == 1
        [X,Y] = next(mbq);        
        "training computation"
    elseif labindex == 2
        [X,Y] = next(mbq);    
        "training computation"
    elseif labindex == 3
        "dispatch operation for worker 1"
    elseif labindex == 4
        "dispatch operation for worker 2"
    end   
end

How do i assign the proper sequence of workers so that worker 3 and 4 operate first and then worker 1 and 2 operate second? How do i solve data consistency so that data is available on workers 1 and 2 in the first iteration? How do i find out which workers are specifically tied to GPUs?

Thanks in advance!

Joss Knight le 23 Nov 2020

Modifié(e) : Joss Knight le 18 Déc 2020

Ouvrir dans MATLAB Online

Hi, you can't use DispatchInBackground and use SPMD, is the simple answer, not in a custom training loop. Only trainNetwork supports both parallel training and background dispatch, and it uses a bunch of clever stuff involving MPI communicators.

In a future release this will become possible using a thread pool nested inside a process pool, but not yet.

You can do this right now using some complex point-to-point communication, not pretty:

spmd
    % Workers 1 and 2 are background workers. 3 and 4
    % are compute workers.
    % Partition datastore into 2 parts. Read first part
    % and send to compute workers.
    if labindex < 3
        subds = partition(ds, 2, labindex);
        data = read(subds); % Add some batching logic here
        labSend(data, labindex+2);
    else
        data = labReceive(labindex-2);
    end
    loop = true;
    while loop
        % Background workers read next batch while compute
        % workers process the current one
        if labindex < 3
            loop = hasdata(subds);
            if loop
                data = read(subds); % Again, might need batching logic
            end
            labSend(data, labindex+2);
        else
            % Do some computation to compute gradients
            % Send and receive gradients between compute workers
            otherComputeWorker = mod(labindex-2,2)+3;
            theirGradients = labSendReceive(otherComputeWorker,otherComputeWorker,myGradients);
            % Do something with the two sets of gradients,
            % probably add them together and update a model
            % Then receive next batch
            data = labReceive(labindex-2);
        end
        % Detect when either partition is finished
        loop = gop(@and, loop);
    end
end

You can do a version of this using gop or gplus to sum gradients as in the examples, but you need to make sure the background workers also participate - every worker has to call gop. One advantage of doing that is that you'll get fast peer-to-peer data transfer for gpuArray data. At the moment labSendReceive doesn't use fast data transfer. You could also implement labSendReceive using a call to labSend then to labReceive on worker 3, and the opposite way round on the other compute worker. That will use fast GPU-GPU communication, but loses out on the asynchronicity.

I haven't actually checked this works so maybe there are issues but I'm sure you can debug them.

Pascal Kutschbach le 25 Nov 2020

This example definitely helps and will solve my issue.

I was not able to make it run yet but i can see the schematic behind the idea. I understand that i have to tell each specific worker when and what to send to the other workers in order to mimic a communication between the workers. As of now i run into "deadlock" issues where (i assume) at a certain point i want to receive data on one worker where there is no data to be received (yet). Probably this results from the usage of labSend and labReceive instead of labSendReceive to make use of NV-Link communication between the GPUs.

Thanks again for the help!

Joss Knight le 25 Nov 2020

Great! labSend is blocking, so you can't have both workers 3 and 4 call labSend at the same time. You need to choose which one goes first.

Connectez-vous pour commenter.

Background Data Dispatch with Custom Training Loop

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponse acceptée

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

Plus de réponses (0)

Catégories

Produits

Version

Tags

Community Treasure Hunt

Background Data Dispatch with Custom Training Loop

0 commentaires Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

Réponse acceptée

4 commentaires Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens

Plus de réponses (0)

Catégories

Produits

Version

Tags

Voir également

Community Treasure Hunt

0 commentaires
Afficher -2 commentaires plus anciens Masquer -2 commentaires plus anciens

4 commentaires
Afficher 2 commentaires plus anciens Masquer 2 commentaires plus anciens