Forums :
Announcements :
Beta testing the new C@H
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
Crystal Pellet Send message Joined: 12 Feb 13 Posts: 23 Credit: 363,354 RAC: 0 |
A few of the updates which I have pushed recently: Maybe you could add <enable_vm_savestate_usage/> in your camb_boinc2docker_0.04_vbox_job.xml file. That would save the state of the VM in stead of powering it off when a task is suspended with "Leave Application in Memory" ticked off or when BOINC is stopped/restarted. Now after a resume a task has to start from the very beginning/the VM is booting from scratch. CP |
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
I played with that option, but suspend/resume still seemed very unstable. Plenty of times it seemed I got into a state where the task just hung indefinitely, restarted anyway, etc... Conversely the current setup has seemed very robust based on results I'm seeing on the beta server. The only drawback is, as you say, having to start over. At least the jobs are very short so you're not losing too much work. For now I'm going to launch with the current setup. Eventually I think |
![]() Send message Joined: 28 Nov 07 Posts: 12 Credit: 26,360 RAC: 0 |
boboviz, did you use a different username on the beta server? I can't find your jobs, but if they're able to crash your computer I'd like to take a look at them right away! http://beta.cosmologyathome.org/results.php?userid=4147 I'm crunching rosetta/denis/citizensciencegrid on cpu, poem/seti on gpu without problems I forget: during crunch, pc lags (with a R7260x gpu) |
![]() Send message Joined: 28 Nov 07 Posts: 12 Credit: 26,360 RAC: 0 |
Do you still see it happen if you lower your BOINC CPU usage to say 50%? I'll try |
Crystal Pellet Send message Joined: 12 Feb 13 Posts: 23 Credit: 363,354 RAC: 0 |
I played with that option, but suspend/resume still seemed very unstable. Plenty of times it seemed I got into a state where the task just hung indefinitely, restarted anyway, etc... Conversely the current setup has seemed very robust based on results I'm seeing on the beta server. The only drawback is, as you say, having to start over. At least the jobs are very short so you're not losing too much work. With those short tasks it's no big issue to restart from the beginning. I've done several tasks with mentioned tag, suspend and resumed. Most of the times the VM is turning into the wished savestate, but sometimes the VM doesn't save properly and turns into a stopped state. After resume such a task, the VM restarted/booted from the beginning and then the task ends into an error. This is cause you reduced the <rsc_disk_bound> to a too low value and the task errorred out because of EXIT_DISK_LIMIT_EXCEEDED. Example: http://beta.cosmologyathome.org/result.php?resultid=41172 I've increased the disk_bound myself and waited for a 'stopped' stated VM. After resume it did not error out: http://beta.cosmologyathome.org/result.php?resultid=41188 Note the peak disk usage is 967.30 MB. |
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
Most of the times the VM is turning into the wished savestate, but sometimes the VM doesn't save properly and turns into a stopped state. Yea I saw this several times. There is also the problem that if you quit BOINC, it gives tasks ~15sec to shutdown, and checkpointing sometimes takes longer than this, so BOINC just kills the task, also putting it into a state which causes it to hang/crash next time up. So do I understand correctly, you edited the vbox_job.xml file to add |
Crystal Pellet Send message Joined: 12 Feb 13 Posts: 23 Credit: 363,354 RAC: 0 |
So do I understand correctly, you edited the vbox_job.xml file to add <enable_vmsavestate/>? I'm confused though because in http://beta.cosmologyathome.org/result.php?resultid=41172 I see no mention of saving state in the log? In any case, I lowered the disk bound since without checkpointing it wasn't necessary. I added the <enable_vmsavestate/> to camb_boinc2docker_0.04_vbox_job.xml. In the stderr of the results never comes saving the state, but only "Stopping VM." If the save is successful that line is following by "Successfully stopped VM." If that line is missing, the VM-state turned into the 'Stopped' state. During the save a file like "2015-10-19T14-01-31-048797600Z.sav" is written into the slot-sub/Snapshots directory. Sometimes this is a very big file causing disk bound exceeding. When the "Stopped" state occurs that sav-file seems not deleted after the resume. That's why BOINC is getting an error. After a good 'Save state', that file is deleted after the resume. At least doubling the disk bound should be enough to reduce that kind of errors, I think. |
![]() Send message Joined: 28 Nov 07 Posts: 12 Credit: 26,360 RAC: 0 |
My tests. As Marius says, i start with cpu time usage at 50% and slowly pass to 60, 70, 80.... and so on. At 75% my pc starts to lag. At 85% crashes. Temperature is near to limit (83°), but if i crunch Rosetta@home (that stresses very well the cpu) at 100% i have no problem. My cpu is a AMD FX6300 with stock dissipation. This is the ONLY project i have these problems. It's an atomic application!!! :-) |
![]() Send message Joined: 23 Jan 15 Posts: 17 Credit: 101,772 RAC: 0 |
I think generally I don't completely understand the experience people are having with the camb_boinc2docker app who also run other projects concurrently, since I myself am testing only with C@H. Maybe some of you could explain a bit more how you would like things to run vs. how they do run? This is no use at all - you must add several other BOINC projects to see the effect. Some suggestions are:
* Work with the VBOX Wrapper authors to set the CPU Priorities to "lower than default".
|
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
Restrict the number of active CPU cores per host, defaulting to (maybe) one half, configurable on the pref page I'm sure you've seen, but its possible. Having it be configurable on the project page or the client is on the todo list. Work with the VBOX Wrapper authors to set the CPU Priorities to "lower than default". This is a limitation of Virtualbox, not of vboxwrapper. Not sure the prognosis on resolving it unfortunately. |
Jacob Klein Send message Joined: 28 May 12 Posts: 2 Credit: 587,861 RAC: 0 |
On the new Cosmology@Home website, how can I log off or log out of the current account? I manage a few different BOINC accounts, but I don't see any option to log off or log out on Firefox, like I do on my other BOINC projects. Am I missing something, or did you forget to implement it? |
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
On the new Cosmology@Home website, how can I log off or log out of the current account? I manage a few different BOINC accounts, but I don't see any option to log off or log out on Firefox, like I do on my other BOINC projects. Fixed now. |
![]() Send message Joined: 10 Jul 13 Posts: 26 Credit: 3,547,685 RAC: 0 |
Hello, Had to abort this WU after 7h10 running time :/ http://www.cosmologyathome.org/result.php?resultid=34122249 Would it be possible to add an automatic "abandon" function ? ie after x minutes. Thank You Best Phil1966 |
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
Hi Phil1966, hmm thanks for pointing me to this, these jobs hanging after the computation is over I thought was fixed in camb_boinc2docker 0.08, but this one seems to not be. I will look into it. Let me know if you notice any other patterns of jobs hanging like this. |
![]() Send message Joined: 10 Jul 13 Posts: 26 Credit: 3,547,685 RAC: 0 |
Dear Marius, Thank You for your answer. Just had another one last night (> 8 hours) : http://www.cosmologyathome.org/result.php?resultid=34122301 Best Regards, Phil1966 |
![]() Send message Joined: 10 Jul 13 Posts: 26 Credit: 3,547,685 RAC: 0 |
I stop crunching these WU's. There is maybe something wrong with my crunchbox. The first one I launched tonight was still running after 48 minutes => manual abandon. Back to camb_legacy. Phil1966 |
Jim1348 Send message Joined: 17 Nov 14 Posts: 135 Credit: 5,412,499 RAC: 0 |
I recall seeing someone mention problems with VirtualBox 4.3.12. I would try VirtualBox 5.0.8. https://www.virtualbox.org/wiki/Downloads |
![]() Send message Joined: 10 Jul 13 Posts: 26 Credit: 3,547,685 RAC: 0 |
Hello Jim348, VirtualBox 5 was installed on my crunchbox, but as it was not working when runing C@H, I re-installed the standard 4.3.12 version. Will give another try this evening. Thank You. |
Crystal Pellet Send message Joined: 12 Feb 13 Posts: 23 Credit: 363,354 RAC: 0 |
Hi Phil1966, hmm thanks for pointing me to this, these jobs hanging after the computation is over ... That's the problem ..... the computation is not over, but the presence of VM completion file is detected. The VM can't be cleaned up, because it's still in use for the calculation. Normally this completion file should come from your machine. Is it created too early or coming from elsewhere? |
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
Actually Phil1966's error logs show that the computation is over and that the VM is powered off. There's also several cleanup steps completed but it seems to be hanging right before where in a successful run it would say "Removing VM from VirtualBox". Continuing to look into it. What's clear now at least is that this is definitely a different problem than the problem with hung jobs that was indeed correctly fixed by camb_boinc2docker 0.08. Re: Virtualbox version, the only problem I know of is that Windows 10 RTM build 10240 requires Virtualbox 5.0.8, but that doesn't seem to be the case here. |