1) Forums : Technical Support : Camb_boinc2docker Errors (Message 21400)
Posted 4 May 2017 by ritterm
Post:
Jim1348 wrote:
This has been the same for several years, so I expect it is normal.

Marius wrote:
Its true, this occasional failure has been around for a while, this reminds thought that its actually fixable on our end, just with a bit of work, which I will try to get to shortly!

Thanks for the feedback, guys.

With regard to the access violation errors, these are happening only on my Intel-Win10 hosts. One is 2-core/4-thread/8GB RAM and the other is a 4-core/4GB RAM. Is there anything I can do about that? Generally speaking, neither has trouble on other projects I've run recently (e.g., WCG, LHC, Asteroids, Universe).
2) Forums : Technical Support : Camb_boinc2docker Errors (Message 21395)
Posted 4 May 2017 by ritterm
Post:
Three of my hosts are throwing off computation errors. Two AMD-Linux hosts have lines similar to the following the stderr output:

2017-05-03 17:59:37 (2320): Guest Log: Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG
2017-05-03 17:59:37 (2320): Guest Log: ERROR STOP Stopped due to parameter error
2017-05-03 17:59:37 (2320): Guest Log: Optical depth is strange. You have: 9.1423397432200005E-003
2017-05-03 17:59:38 (2320): Guest Log: boinc_app exited (1)

Are some tasks naturally going to error out like this? If so, what's the expected error rate?

The errors for an Intel-Win10 host are access violations and I'm not sure there's much I can do.
3) Forums : Technical Support : Long Running Planck Tasks (Message 20911)
Posted 28 Feb 2016 by ritterm
Post:
2889 is in the top 0.1% of longest jobs it looks like!

Whoa... Task 37153848:

2016-02-28 10:14:31 (9908): Guest Log: nfev: 3523
4) Forums : Technical Support : Long Running Planck Tasks (Message 20904)
Posted 27 Feb 2016 by ritterm
Post:
Unfortunately there's no good way to award credits based on the actual number of steps the job takes...Fortunately, as long as you run more than a few of these jobs, it'll average out in the end.

Thanks for pointing that out, Marius. I hesitate to say this, but I've noted that I'm getting a little better than 40 cr/CPU/hr on average across all my hosts. Compared to CPU tasks at other projects, that's pretty good.
5) Forums : Technical Support : Long Running Planck Tasks (Message 20890)
Posted 25 Feb 2016 by ritterm
Post:
If you're really curious, you can check after the fact how many steps it took for the minimizer to finish, its the number after "nfev" listed in the job's log. Most jobs should finish before 1500, but some go longer.

Thanks, Marius. I've seen nfev values as high as 1555 on my long tasks. On some tasks, not necessarily long ones, I don't see nfev in the output at all. Is that expected? I'm sorry I didn't note an example when I first looked, but I can dig one up, if you like.
6) Forums : Technical Support : Long Running Planck Tasks (Message 20885)
Posted 25 Feb 2016 by ritterm
Post:
Yea sorry, unfortunately there's no way to know ahead of time exactly how long these jobs are ...

Okay. So, it's not necessarily indicative of a problem that some of these jobs are taking several hours to run?
7) Forums : Technical Support : Long Running Planck Tasks (Message 20882)
Posted 24 Feb 2016 by ritterm
Post:
Is there checkpointing on planck tasks?

I'll let Marius or someone else more knowledgeable say for sure, but I think the answer is yes. I've seen VM log entries that seem to refer to checkpointing.

I have a task , that is already running 3 times longer than any other planck task, reached and gone past 100% and no indications of finishing.

Should i abort?

I've had the same experience and have found that almost all eventually finish. Only one of my tasks ended with the "time exceeded" error. Another long running task ended in some kind of compute error. I would let it go.
8) Forums : Technical Support : Long Running Planck Tasks (Message 20864)
Posted 17 Feb 2016 by ritterm
Post:
Do you have "leave applications in memory" unchecked? If so every time you interrupt the computation, it should start over from the beginning.

I thought about the "leave applications in memory" option and looked to see that none of my hosts have that option checked. So, there's no checkpointing on these tasks? Maybe this is a problem for hosts with relatively slower CPUs, like the two I've been concerned about in this thread. I'll experiment with that as well as setting up the workload on these hosts so they run straight through the Planck tasks without stopping.
9) Forums : Technical Support : Long Running Planck Tasks (Message 20855)
Posted 16 Feb 2016 by ritterm
Post:
Unfortunately for me, problems running Planck tasks on my hosts 178960 and 178962 have returned.

For now, I'm going to go back to running only the camb_boinc2docker tasks on these hosts and see how it goes.

On multiple occasions, I've seen the progress reset to zero and start over. Should these tasks be able to tolerate starts and stops based on BOINC manager settings?
10) Forums : Technical Support : Invalid Tasks (Message 20846)
Posted 12 Feb 2016 by ritterm
Post:
I will re-validate these jobs...so everyone will get the credits.

Okay, great. Thanks, Marius!
11) Forums : Technical Support : How do I throttle down the Planck tasks? (Message 20844)
Posted 12 Feb 2016 by ritterm
Post:
...I might have picked up the information I needed from the XML files in the projects/cosmology@home file...

Also, if you checked the messages in the BOINC manager at start up or after a forced read of the config files, you might have seen something like this:

Cosmology@Home 2/11/2016 8:04:43 PM Your app_config.xml file refers to an unknown application 'planck'. Known applications: 'lsplitsims'
12) Forums : Technical Support : Invalid Tasks (Message 20842)
Posted 11 Feb 2016 by ritterm
Post:
I think the replication error (discussed here) could be causing real problems when tasks for the same WU are sent to the same host (which I think might be a separate issue itself).

I've returned at least 5 tasks today that have been marked as invalid and 4 of those were from WUs where both tasks were sent to the same host. In all cases, including the one where my wingman validated, the invalid task was returned after the valid task.
13) Forums : Technical Support : Cancelled by server (Message 20839)
Posted 11 Feb 2016 by ritterm
Post:
...It's not actually wasting computing time right, the job gets cancelled before it starts?

That's correct (at least for me). I mostly wanted to point it out in case it was a sign of bigger issues.
14) Forums : Technical Support : Cancelled by server (Message 20837)
Posted 11 Feb 2016 by ritterm
Post:
And on the replication of two, I am my own wingman...

Whoa! I didn't check that and notice now that the same happened to me in at least two cases (WU 23096095 and WU 23096108).
15) Forums : Technical Support : Cancelled by server (Message 20835)
Posted 11 Feb 2016 by ritterm
Post:
I'm seeing numerous server cancellations recently and have noticed that the initial replication for the Planck tasks has changed from one to two. What changed and why? Just curious... :-)
16) Forums : Technical Support : Long Running Planck Tasks (Message 20829)
Posted 10 Feb 2016 by ritterm
Post:
Things have been going mostly okay for my hosts since moving all to VBox 5.0.14. However, some tasks run for several hours and one errored out with "time limit exceeded". Examples below from different hosts:

Long running task: Task 36445199

"Time limit exceeded" error: Task 36326281

Is anyone else seeing this long running behavior?
17) Forums : Technical Support : Long Running Planck Tasks (Message 20819)
Posted 5 Feb 2016 by ritterm
Post:
Can you maybe try upgrading to the very latest Virtualbox?

Both hosts have been upgraded to VBox 5.0.14 and have completed tasks okay with no errors. However, some have unusually long run times and one currently running task shows over 3 hours of elapsed time and 33% completion (I'm gong to let that continue for now). See the valid tasks returned on 5 Feb 16 for host 178960 and 178962.
18) Forums : Technical Support : Long Running Planck Tasks (Message 20817)
Posted 4 Feb 2016 by ritterm
Post:
...both hosts seem to be running the Planck tasks okay...

Or, so I thought... I aborted Task 36322809 after about 4 hours of run time. Also, I noticed its progress was reset at least once.

Why is there no VM log compared to one of its siblings that I aborted earlier today?
19) Forums : Technical Support : Long Running Planck Tasks (Message 20816)
Posted 4 Feb 2016 by ritterm
Post:
Since aborting those tasks, both hosts seem to be running the Planck tasks okay and have returned valid results. My apologies for what appears to be a false alarm! :-)

MarkR
20) Forums : Technical Support : Long Running Planck Tasks (Message 20815)
Posted 4 Feb 2016 by ritterm
Post:
Marius wrote:
...can you Abort these jobs and Update...

Done! I should have thought to do this first... The hosts are running other Planck jobs now and after about 25 min are at 15% progress. So, perhaps there was just some strange one time problem?


Next 20