Cluster error: Opening log file

2 vues (au cours des 30 derniers jours)
Ceren GURKAN
Ceren GURKAN le 25 Fév 2013
Hi everybody,
I am running a matlab code through university cluster which is basically a for loop that submits job to the cluster, waits 2.5 hours for the results to be generated and moves to the next iteration. However, say it completes generation 8, and after 2.5 hours it starts generation 9 and also completes that but in the point it suppose to move to generation 10 this error message appears in the screen "Opening log file: /eng/cvcluster/eggurkanc/java.log.3643" and it does not move to 10th generation. I have no idea how to cope with that, any help will be appreciated.
Thanks in advance.

Réponse acceptée

Jason Ross
Jason Ross le 25 Fév 2013
Modifié(e) : Jason Ross le 25 Fév 2013
Are you out of disk space? Have you exceeded a disk quota? Looks like you aren't in a normal "home" directory, so there may be more restrictive limits on the cluster.
Does the queue you are submitting to have restrictions on job time or hours of the day it runs? You might need to check with the admins.
Are you getting pre-empted by some other job that jumps the queue?
Are there any emails from the cluster about your job?
If you check the job status what does it show? (this will depend on the scheduler you are using to know what the command is, but it might be something like qstat)
  4 commentaires
Ceren GURKAN
Ceren GURKAN le 26 Fév 2013
I am not sure if I understand you completely or not, so first of all sorry for that :( , what I can say is that I am just running this specific code and nothng else. So not sure if could I be using the log file simultaneously, and if I do so how I can understand and prevent that to happen ???
Jason Ross
Jason Ross le 26 Fév 2013
One of the common problems that happens on clustered systems is that something that you test/prototype in single execution that works becomes a shared resource when you run it on a cluster. Since you can now have multiple threads of execution acting on the same resource, this can become a problem. For example, the following will work fine with one process
cd to /cluster/shared/filesystem
open a file named "myresults"
write to "myresults"
close "myresults" when done.
Then you submit this to a cluster and problems start. When you had one process working on that file, everything was OK. Now you have n processes trying to write to the file simultaneously. You end up with (at best) a jumbled mess of output, and at worst you deadlock and get confused.
To get out of this, the solutions are many. One is to use the PID to try and make the log unique (which it looks like is already being tried -- but you can still get a clash). You can also use random numbers, machine name, etc to further make files unique (and then concatenate them at the end of your run).
This is a pretty simple example -- but I'd inspect and further instrument the code to see where it's getting to and what is stopping the execution.

Connectez-vous pour commenter.

Plus de réponses (0)

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by