Advanced search

Message boards : Announcements : Beta testing the new C@H

Previous · 1 · 2 · 3 · 4 · 5 · Next
Author Message
Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,884
RAC: 0
Message 20358 - Posted: 18 Oct 2015, 20:54:10 UTC - in response to Message 20344.

A few of the updates which I have pushed recently:

  • I did away with check-pointing entirely for now. I would like to have it, but for now seemed more trouble that its worth. This should solve many memory / disk space / stuck job problems some were seeing.
  • I shortened the jobs (~20min on my laptop) so they're shorter and there's less need for check-pointing anyway.
  • The server status page has a link to the exact version of the code which the server is currently running, for those curious.
  • No more camb_legacy jobs should be sneaking in if your host can run camb_boinc2docker.



Maybe you could add <enable_vm_savestate_usage/> in your camb_boinc2docker_0.04_vbox_job.xml file.
That would save the state of the VM in stead of powering it off when a task is suspended with "Leave Application in Memory" ticked off or when BOINC is stopped/restarted.
Now after a resume a task has to start from the very beginning/the VM is booting from scratch.

CP

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20359 - Posted: 18 Oct 2015, 21:07:27 UTC - in response to Message 20358.

I played with that option, but suspend/resume still seemed very unstable. Plenty of times it seemed I got into a state where the task just hung indefinitely, restarted anyway, etc... Conversely the current setup has seemed very robust based on results I'm seeing on the beta server. The only drawback is, as you say, having to start over. At least the jobs are very short so you're not losing too much work.

For now I'm going to launch with the current setup. Eventually I think is definitely the way to go, but after a little more debugging.

Profile [VENETO] boboviz
Send message
Joined: 28 Nov 07
Posts: 12
Credit: 26,360
RAC: 0
Message 20360 - Posted: 19 Oct 2015, 7:04:19 UTC - in response to Message 20354.

boboviz, did you use a different username on the beta server? I can't find your jobs, but if they're able to crash your computer I'd like to take a look at them right away!


http://beta.cosmologyathome.org/results.php?userid=4147

I'm crunching rosetta/denis/citizensciencegrid on cpu, poem/seti on gpu without problems

I forget: during crunch, pc lags (with a R7260x gpu)

Profile [VENETO] boboviz
Send message
Joined: 28 Nov 07
Posts: 12
Credit: 26,360
RAC: 0
Message 20361 - Posted: 19 Oct 2015, 7:05:19 UTC - in response to Message 20356.

Do you still see it happen if you lower your BOINC CPU usage to say 50%?


I'll try

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,884
RAC: 0
Message 20362 - Posted: 19 Oct 2015, 9:19:45 UTC - in response to Message 20359.

I played with that option, but suspend/resume still seemed very unstable. Plenty of times it seemed I got into a state where the task just hung indefinitely, restarted anyway, etc... Conversely the current setup has seemed very robust based on results I'm seeing on the beta server. The only drawback is, as you say, having to start over. At least the jobs are very short so you're not losing too much work.

For now I'm going to launch with the current setup. Eventually I think <enable_vm_savestate_usage/> is definitely the way to go, but after a little more debugging.

With those short tasks it's no big issue to restart from the beginning.
I've done several tasks with mentioned tag, suspend and resumed.
Most of the times the VM is turning into the wished savestate, but sometimes the VM doesn't save properly and turns into a stopped state.
After resume such a task, the VM restarted/booted from the beginning and then the task ends into an error.

This is cause you reduced the <rsc_disk_bound> to a too low value and the task errorred out because of EXIT_DISK_LIMIT_EXCEEDED.
Example: http://beta.cosmologyathome.org/result.php?resultid=41172

I've increased the disk_bound myself and waited for a 'stopped' stated VM.
After resume it did not error out: http://beta.cosmologyathome.org/result.php?resultid=41188
Note the peak disk usage is 967.30 MB.

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20363 - Posted: 19 Oct 2015, 9:47:10 UTC - in response to Message 20362.
Last modified: 19 Oct 2015, 9:48:01 UTC

Most of the times the VM is turning into the wished savestate, but sometimes the VM doesn't save properly and turns into a stopped state.
After resume such a task, the VM restarted/booted from the beginning and then the task ends into an error.

Yea I saw this several times. There is also the problem that if you quit BOINC, it gives tasks ~15sec to shutdown, and checkpointing sometimes takes longer than this, so BOINC just kills the task, also putting it into a state which causes it to hang/crash next time up.


So do I understand correctly, you edited the vbox_job.xml file to add ? I'm confused though because in http://beta.cosmologyathome.org/result.php?resultid=41172 I see no mention of saving state in the log? In any case, I lowered the disk bound since without checkpointing it wasn't necessary.

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,884
RAC: 0
Message 20365 - Posted: 19 Oct 2015, 13:53:30 UTC - in response to Message 20363.
Last modified: 19 Oct 2015, 14:07:52 UTC

So do I understand correctly, you edited the vbox_job.xml file to add <enable_vmsavestate/>? I'm confused though because in http://beta.cosmologyathome.org/result.php?resultid=41172 I see no mention of saving state in the log? In any case, I lowered the disk bound since without checkpointing it wasn't necessary.

I added the <enable_vmsavestate/> to camb_boinc2docker_0.04_vbox_job.xml.
In the stderr of the results never comes saving the state, but only "Stopping VM."
If the save is successful that line is following by "Successfully stopped VM."
If that line is missing, the VM-state turned into the 'Stopped' state.
During the save a file like "2015-10-19T14-01-31-048797600Z.sav" is written into the slot-sub/Snapshots directory.
Sometimes this is a very big file causing disk bound exceeding.
When the "Stopped" state occurs that sav-file seems not deleted after the resume. That's why BOINC is getting an error.
After a good 'Save state', that file is deleted after the resume.
At least doubling the disk bound should be enough to reduce that kind of errors, I think.

Profile [VENETO] boboviz
Send message
Joined: 28 Nov 07
Posts: 12
Credit: 26,360
RAC: 0
Message 20372 - Posted: 20 Oct 2015, 19:04:18 UTC

My tests.
As Marius says, i start with cpu time usage at 50% and slowly pass to 60, 70, 80.... and so on. At 75% my pc starts to lag. At 85% crashes. Temperature is near to limit (83°), but if i crunch Rosetta@home (that stresses very well the cpu) at 100% i have no problem.
My cpu is a AMD FX6300 with stock dissipation.
This is the ONLY project i have these problems. It's an atomic application!!! :-)

Profile C@H Sceptic
Send message
Joined: 23 Jan 15
Posts: 17
Credit: 101,772
RAC: 0
Message 20491 - Posted: 23 Oct 2015, 9:32:47 UTC - in response to Message 20348.

I think generally I don't completely understand the experience people are having with the camb_boinc2docker app who also run other projects concurrently, since I myself am testing only with C@H. Maybe some of you could explain a bit more how you would like things to run vs. how they do run?

This is no use at all - you must add several other BOINC projects to see the effect.
Some suggestions are:
    * Restrict the number of active CPU cores per host, defaulting to (maybe) one half, configurable on the pref page
    * Work with the VBOX Wrapper authors to set the CPU Priorities to "lower than default".


Without it this project wipes out most other uses of the computer on which it is running.

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20492 - Posted: 23 Oct 2015, 10:20:55 UTC - in response to Message 20491.

Restrict the number of active CPU cores per host, defaulting to (maybe) one half, configurable on the pref page

I'm sure you've seen, but its possible. Having it be configurable on the project page or the client is on the todo list.

Work with the VBOX Wrapper authors to set the CPU Priorities to "lower than default".

This is a limitation of Virtualbox, not of vboxwrapper. Not sure the prognosis on resolving it unfortunately.

Jacob Klein
Send message
Joined: 28 May 12
Posts: 2
Credit: 587,278
RAC: 0
Message 20521 - Posted: 26 Oct 2015, 12:13:03 UTC

On the new Cosmology@Home website, how can I log off or log out of the current account? I manage a few different BOINC accounts, but I don't see any option to log off or log out on Firefox, like I do on my other BOINC projects.

Am I missing something, or did you forget to implement it?

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20523 - Posted: 26 Oct 2015, 23:23:18 UTC - in response to Message 20521.

On the new Cosmology@Home website, how can I log off or log out of the current account? I manage a few different BOINC accounts, but I don't see any option to log off or log out on Firefox, like I do on my other BOINC projects.

Am I missing something, or did you forget to implement it?

Fixed now.

Profile Phil1966
Send message
Joined: 10 Jul 13
Posts: 26
Credit: 3,212,542
RAC: 0
Message 20570 - Posted: 8 Nov 2015, 18:16:27 UTC

Hello,

Had to abort this WU after 7h10 running time :/

http://www.cosmologyathome.org/result.php?resultid=34122249

Would it be possible to add an automatic "abandon" function ?

ie after x minutes.

Thank You

Best

Phil1966

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20573 - Posted: 8 Nov 2015, 20:20:13 UTC - in response to Message 20570.

Hi Phil1966, hmm thanks for pointing me to this, these jobs hanging after the computation is over I thought was fixed in camb_boinc2docker 0.08, but this one seems to not be. I will look into it. Let me know if you notice any other patterns of jobs hanging like this.

Profile Phil1966
Send message
Joined: 10 Jul 13
Posts: 26
Credit: 3,212,542
RAC: 0
Message 20574 - Posted: 9 Nov 2015, 5:04:57 UTC - in response to Message 20573.

Dear Marius,

Thank You for your answer.

Just had another one last night (> 8 hours) : http://www.cosmologyathome.org/result.php?resultid=34122301

Best Regards,

Phil1966

Profile Phil1966
Send message
Joined: 10 Jul 13
Posts: 26
Credit: 3,212,542
RAC: 0
Message 20578 - Posted: 9 Nov 2015, 19:36:50 UTC - in response to Message 20574.

I stop crunching these WU's. There is maybe something wrong with my crunchbox.

The first one I launched tonight was still running after 48 minutes => manual abandon.

Back to camb_legacy.

Phil1966

Jim1348
Send message
Joined: 17 Nov 14
Posts: 48
Credit: 2,358,299
RAC: 0
Message 20579 - Posted: 9 Nov 2015, 20:26:59 UTC - in response to Message 20578.

I recall seeing someone mention problems with VirtualBox 4.3.12. I would try VirtualBox 5.0.8.
https://www.virtualbox.org/wiki/Downloads

Profile Phil1966
Send message
Joined: 10 Jul 13
Posts: 26
Credit: 3,212,542
RAC: 0
Message 20581 - Posted: 10 Nov 2015, 4:58:30 UTC - in response to Message 20579.

Hello Jim348,

VirtualBox 5 was installed on my crunchbox, but as it was not working
when runing C@H, I re-installed the standard 4.3.12 version.

Will give another try this evening.

Thank You.

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,884
RAC: 0
Message 20582 - Posted: 10 Nov 2015, 10:33:30 UTC - in response to Message 20573.

Hi Phil1966, hmm thanks for pointing me to this, these jobs hanging after the computation is over ...

That's the problem ..... the computation is not over, but the presence of VM completion file is detected.
The VM can't be cleaned up, because it's still in use for the calculation.

Normally this completion file should come from your machine.
Is it created too early or coming from elsewhere?

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 427
Credit: 4,276
RAC: 0
Message 20583 - Posted: 10 Nov 2015, 10:56:08 UTC
Last modified: 10 Nov 2015, 10:57:22 UTC

Actually Phil1966's error logs show that the computation is over and that the VM is powered off. There's also several cleanup steps completed but it seems to be hanging right before where in a successful run it would say "Removing VM from VirtualBox". Continuing to look into it. What's clear now at least is that this is definitely a different problem than the problem with hung jobs that was indeed correctly fixed by camb_boinc2docker 0.08.

Re: Virtualbox version, the only problem I know of is that Windows 10 RTM build 10240 requires Virtualbox 5.0.8, but that doesn't seem to be the case here.

Previous · 1 · 2 · 3 · 4 · 5 · Next

Message boards : Announcements : Beta testing the new C@H