Advanced search

Message boards : Technical Support : Long Running Planck Tasks

1 · 2 · Next
Author Message
ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20813 - Posted: 3 Feb 2016, 22:33:23 UTC
Last modified: 3 Feb 2016, 22:41:54 UTC

My two Intel hosts were working on the new Planck app before I suspended them because they had been running for so long. One has run for almost 9 hours and the other one over 4 hours.

Both use app_config files to limit running one task at a time and to use 2 cores/threads. Windows Task Manager indicates that both tasks are getting the CPU time expected. I've rebooted both machines with no effect other than to reset the progress bar, which had been between 33%-45% for each job. There are no general CPU% or NCPU% restrictions set in BOINC. Neither has had much trouble with the CAMB MTd apps and both are running BOINC 7.6.22 and VBox 5.0.10.

My AMD hosts (BOINC 7.6.22 and VBox 5.0.10) are crunching the Plancks with no apparent problems.
____________

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20814 - Posted: 4 Feb 2016, 10:45:39 UTC - in response to Message 20813.

Hi ritterm, can you Abort these jobs and Update so I can see the logs on the server? All I'm seeing from you right now is "success".

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20815 - Posted: 4 Feb 2016, 11:38:05 UTC - in response to Message 20814.

Marius wrote:
...can you Abort these jobs and Update...

Done! I should have thought to do this first... The hosts are running other Planck jobs now and after about 25 min are at 15% progress. So, perhaps there was just some strange one time problem?
____________

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20816 - Posted: 4 Feb 2016, 14:00:48 UTC

Since aborting those tasks, both hosts seem to be running the Planck tasks okay and have returned valid results. My apologies for what appears to be a false alarm! :-)

MarkR
____________

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20817 - Posted: 4 Feb 2016, 19:31:54 UTC - in response to Message 20816.

...both hosts seem to be running the Planck tasks okay...

Or, so I thought... I aborted Task 36322809 after about 4 hours of run time. Also, I noticed its progress was reset at least once.

Why is there no VM log compared to one of its siblings that I aborted earlier today?
____________

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20818 - Posted: 4 Feb 2016, 20:55:15 UTC - in response to Message 20817.

Good question about the no VM log, I'm not sure. The other two offer some hints but nothing clear unfortunately. Can you maybe try upgrading to the very latest Virtualbox? I don't think you're the only one seeing this type errors, I've been meaning to make a pass through the logs to see if I can spot any patterns, I will try to get to it soon.

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20819 - Posted: 5 Feb 2016, 13:38:30 UTC - in response to Message 20818.
Last modified: 5 Feb 2016, 13:39:56 UTC

Can you maybe try upgrading to the very latest Virtualbox?

Both hosts have been upgraded to VBox 5.0.14 and have completed tasks okay with no errors. However, some have unusually long run times and one currently running task shows over 3 hours of elapsed time and 33% completion (I'm gong to let that continue for now). See the valid tasks returned on 5 Feb 16 for host 178960 and 178962.
____________

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20829 - Posted: 10 Feb 2016, 14:59:05 UTC
Last modified: 10 Feb 2016, 14:59:20 UTC

Things have been going mostly okay for my hosts since moving all to VBox 5.0.14. However, some tasks run for several hours and one errored out with "time limit exceeded". Examples below from different hosts:

Long running task: Task 36445199

"Time limit exceeded" error: Task 36326281

Is anyone else seeing this long running behavior?
____________

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20855 - Posted: 16 Feb 2016, 15:07:41 UTC

Unfortunately for me, problems running Planck tasks on my hosts 178960 and 178962 have returned.

For now, I'm going to go back to running only the camb_boinc2docker tasks on these hosts and see how it goes.

On multiple occasions, I've seen the progress reset to zero and start over. Should these tasks be able to tolerate starts and stops based on BOINC manager settings?
____________

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20861 - Posted: 17 Feb 2016, 12:53:04 UTC - in response to Message 20855.

Unfortunately for me, problems running Planck tasks on my hosts 178960 and 178962 have returned.

For now, I'm going to go back to running only the camb_boinc2docker tasks on these hosts and see how it goes.

On multiple occasions, I've seen the progress reset to zero and start over. Should these tasks be able to tolerate starts and stops based on BOINC manager settings?

Do you have "leave applications in memory" unchecked? If so every time you interrupt the computation, it should start over from the beginning. Its clear from the logs your jobs are restarting several times but I don't know whether its because they were suspended by BOINC or if they themselves reset for some reason.

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20864 - Posted: 17 Feb 2016, 16:02:37 UTC - in response to Message 20861.

Do you have "leave applications in memory" unchecked? If so every time you interrupt the computation, it should start over from the beginning.

I thought about the "leave applications in memory" option and looked to see that none of my hosts have that option checked. So, there's no checkpointing on these tasks? Maybe this is a problem for hosts with relatively slower CPUs, like the two I've been concerned about in this thread. I'll experiment with that as well as setting up the workload on these hosts so they run straight through the Planck tasks without stopping.
____________

Rasputin42
Send message
Joined: 30 Apr 09
Posts: 20
Credit: 327,276
RAC: 11
Message 20881 - Posted: 23 Feb 2016, 23:19:39 UTC

Is there checkpointing on planck tasks?

I have a task , that is already running 3 times longer than any other planck task, reached and gone past 100% and no indications of finishing.

Should i abort?

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20882 - Posted: 24 Feb 2016, 0:12:59 UTC - in response to Message 20881.

Is there checkpointing on planck tasks?

I'll let Marius or someone else more knowledgeable say for sure, but I think the answer is yes. I've seen VM log entries that seem to refer to checkpointing.

I have a task , that is already running 3 times longer than any other planck task, reached and gone past 100% and no indications of finishing.

Should i abort?

I've had the same experience and have found that almost all eventually finish. Only one of my tasks ended with the "time exceeded" error. Another long running task ended in some kind of compute error. I would let it go.
____________

Rasputin42
Send message
Joined: 30 Apr 09
Posts: 20
Credit: 327,276
RAC: 11
Message 20883 - Posted: 24 Feb 2016, 0:42:14 UTC
Last modified: 24 Feb 2016, 0:45:51 UTC

Thanks ritterm.
I check the folder shared/progress and it went 0.001 0.002.....

0.909
0.910
0.910
0.911
0.1.410
1.411
1.412
1.412
1.413
1.414
It never hit the 100% mark (1.000)
I suspended the task and resumed.(leave app in memory ticked)
The "progress" file was reset and started from 0.
This indicates to me, that there is no checkpoint.
I understand, that this progress file is independent from boinc and i do not know, what purpose it serves.
The boinc progress is still at 100% and it keeps running...

EDIT: It just finished, but it was very very long.

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20884 - Posted: 24 Feb 2016, 23:28:32 UTC - in response to Message 20882.

Is there checkpointing on planck tasks?

I'll let Marius or someone else more knowledgeable say for sure, but I think the answer is yes. I've seen VM log entries that seem to refer to checkpointing.

There is actually not checkpointing on either planck_param_sims or camb_boinc2docker jobs. So if you don't have "leave applications in memory" checked, jobs will restart when they are suspended. For now I've tried to combat the downsides of this by keeping jobs fairly short. Its definitely on the TODO list though to get checkpointing back though (which we had in camb_boinc2docker beta testing but was at the time pretty unreliable).

EDIT: It just finished, but it was very very long.

Yea sorry, unfortunately there's no way to know ahead of time exactly how long these jobs are (they run a minimizer which will take varying amounts of time to converge depending on the simulation). Based on pre-release tests, I picked the progress bar so that 99% of jobs finish before the progress bar ends. I guess consider yourself one of the 1%!

ritterm
Send message
Joined: 30 May 08
Posts: 25
Credit: 896,913
RAC: 0
Message 20885 - Posted: 25 Feb 2016, 1:46:24 UTC - in response to Message 20884.

Yea sorry, unfortunately there's no way to know ahead of time exactly how long these jobs are ...

Okay. So, it's not necessarily indicative of a problem that some of these jobs are taking several hours to run?
____________

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20886 - Posted: 25 Feb 2016, 11:16:07 UTC - in response to Message 20885.

Yea sorry, unfortunately there's no way to know ahead of time exactly how long these jobs are ...

Okay. So, it's not necessarily indicative of a problem that some of these jobs are taking several hours to run?

If only once in a while that happens and the CPU usage is still going, most likely its fine yea. If you're really curious, you can check after the fact how many steps it took for the minimizer to finish, its the number after "nfev" listed in the job's log. Most jobs should finish before 1500, but some go longer.

Rasputin42
Send message
Joined: 30 Apr 09
Posts: 20
Credit: 327,276
RAC: 11
Message 20887 - Posted: 25 Feb 2016, 13:21:08 UTC

Thanks for the explanation.
Is the "nfev" proportional to the total cpu time?

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20888 - Posted: 25 Feb 2016, 13:32:32 UTC - in response to Message 20887.

Thanks for the explanation.
Is the "nfev" proportional to the total cpu time?

Yep, exactly.

Rasputin42
Send message
Joined: 30 Apr 09
Posts: 20
Credit: 327,276
RAC: 11
Message 20889 - Posted: 25 Feb 2016, 14:12:47 UTC - in response to Message 20888.

So if you divide the cpu time by nfev; the bigger the number, the less efficient the calculation?
I am trying to determine, what the best configuration is. 1 task 4cpu with 2GB of mem,
or 2 tasks with 2CPU and 1GB mem, or whatever combination is best.


How much dedicated memory actually makes sense?

1 · 2 · Next

Message boards : Technical Support : Long Running Planck Tasks