Message boards : Announcements : Beta testing the new C@H
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
A few of the updates which I have pushed recently: Maybe you could add <enable_vm_savestate_usage/> in your camb_boinc2docker_0.04_vbox_job.xml file. That would save the state of the VM in stead of powering it off when a task is suspended with "Leave Application in Memory" ticked off or when BOINC is stopped/restarted. Now after a resume a task has to start from the very beginning/the VM is booting from scratch. CP | |
ID: 20358 · ![]() | |
I played with that option, but suspend/resume still seemed very unstable. Plenty of times it seemed I got into a state where the task just hung indefinitely, restarted anyway, etc... Conversely the current setup has seemed very robust based on results I'm seeing on the beta server. The only drawback is, as you say, having to start over. At least the jobs are very short so you're not losing too much work. | |
ID: 20359 · ![]() | |
boboviz, did you use a different username on the beta server? I can't find your jobs, but if they're able to crash your computer I'd like to take a look at them right away! http://beta.cosmologyathome.org/results.php?userid=4147 I'm crunching rosetta/denis/citizensciencegrid on cpu, poem/seti on gpu without problems I forget: during crunch, pc lags (with a R7260x gpu) | |
ID: 20360 · ![]() | |
Do you still see it happen if you lower your BOINC CPU usage to say 50%? I'll try | |
ID: 20361 · ![]() | |
I played with that option, but suspend/resume still seemed very unstable. Plenty of times it seemed I got into a state where the task just hung indefinitely, restarted anyway, etc... Conversely the current setup has seemed very robust based on results I'm seeing on the beta server. The only drawback is, as you say, having to start over. At least the jobs are very short so you're not losing too much work. With those short tasks it's no big issue to restart from the beginning. I've done several tasks with mentioned tag, suspend and resumed. Most of the times the VM is turning into the wished savestate, but sometimes the VM doesn't save properly and turns into a stopped state. After resume such a task, the VM restarted/booted from the beginning and then the task ends into an error. This is cause you reduced the <rsc_disk_bound> to a too low value and the task errorred out because of EXIT_DISK_LIMIT_EXCEEDED. Example: http://beta.cosmologyathome.org/result.php?resultid=41172 I've increased the disk_bound myself and waited for a 'stopped' stated VM. After resume it did not error out: http://beta.cosmologyathome.org/result.php?resultid=41188 Note the peak disk usage is 967.30 MB. | |
ID: 20362 · ![]() | |
Most of the times the VM is turning into the wished savestate, but sometimes the VM doesn't save properly and turns into a stopped state. Yea I saw this several times. There is also the problem that if you quit BOINC, it gives tasks ~15sec to shutdown, and checkpointing sometimes takes longer than this, so BOINC just kills the task, also putting it into a state which causes it to hang/crash next time up. So do I understand correctly, you edited the vbox_job.xml file to add | |
ID: 20363 · ![]() | |
So do I understand correctly, you edited the vbox_job.xml file to add <enable_vmsavestate/>? I'm confused though because in http://beta.cosmologyathome.org/result.php?resultid=41172 I see no mention of saving state in the log? In any case, I lowered the disk bound since without checkpointing it wasn't necessary. I added the <enable_vmsavestate/> to camb_boinc2docker_0.04_vbox_job.xml. In the stderr of the results never comes saving the state, but only "Stopping VM." If the save is successful that line is following by "Successfully stopped VM." If that line is missing, the VM-state turned into the 'Stopped' state. During the save a file like "2015-10-19T14-01-31-048797600Z.sav" is written into the slot-sub/Snapshots directory. Sometimes this is a very big file causing disk bound exceeding. When the "Stopped" state occurs that sav-file seems not deleted after the resume. That's why BOINC is getting an error. After a good 'Save state', that file is deleted after the resume. At least doubling the disk bound should be enough to reduce that kind of errors, I think. | |
ID: 20365 · ![]() | |
My tests. | |
ID: 20372 · ![]() | |
I think generally I don't completely understand the experience people are having with the camb_boinc2docker app who also run other projects concurrently, since I myself am testing only with C@H. Maybe some of you could explain a bit more how you would like things to run vs. how they do run? This is no use at all - you must add several other BOINC projects to see the effect. Some suggestions are:
* Work with the VBOX Wrapper authors to set the CPU Priorities to "lower than default".
| |
ID: 20491 · ![]() | |
Restrict the number of active CPU cores per host, defaulting to (maybe) one half, configurable on the pref page I'm sure you've seen, but its possible. Having it be configurable on the project page or the client is on the todo list. Work with the VBOX Wrapper authors to set the CPU Priorities to "lower than default". This is a limitation of Virtualbox, not of vboxwrapper. Not sure the prognosis on resolving it unfortunately. | |
ID: 20492 · ![]() | |
On the new Cosmology@Home website, how can I log off or log out of the current account? I manage a few different BOINC accounts, but I don't see any option to log off or log out on Firefox, like I do on my other BOINC projects. | |
ID: 20521 · ![]() | |
On the new Cosmology@Home website, how can I log off or log out of the current account? I manage a few different BOINC accounts, but I don't see any option to log off or log out on Firefox, like I do on my other BOINC projects. Fixed now. | |
ID: 20523 · ![]() | |
Hello, | |
ID: 20570 · ![]() | |
Hi Phil1966, hmm thanks for pointing me to this, these jobs hanging after the computation is over I thought was fixed in camb_boinc2docker 0.08, but this one seems to not be. I will look into it. Let me know if you notice any other patterns of jobs hanging like this. | |
ID: 20573 · ![]() | |
Dear Marius, | |
ID: 20574 · ![]() | |
I stop crunching these WU's. There is maybe something wrong with my crunchbox. | |
ID: 20578 · ![]() | |
I recall seeing someone mention problems with VirtualBox 4.3.12. I would try VirtualBox 5.0.8. | |
ID: 20579 · ![]() | |
Hello Jim348, | |
ID: 20581 · ![]() | |
Hi Phil1966, hmm thanks for pointing me to this, these jobs hanging after the computation is over ... That's the problem ..... the computation is not over, but the presence of VM completion file is detected. The VM can't be cleaned up, because it's still in use for the calculation. Normally this completion file should come from your machine. Is it created too early or coming from elsewhere? | |
ID: 20582 · ![]() | |
Actually Phil1966's error logs show that the computation is over and that the VM is powered off. There's also several cleanup steps completed but it seems to be hanging right before where in a successful run it would say "Removing VM from VirtualBox". Continuing to look into it. What's clear now at least is that this is definitely a different problem than the problem with hung jobs that was indeed correctly fixed by camb_boinc2docker 0.08. | |
ID: 20583 · ![]() | |
Message boards : Announcements : Beta testing the new C@H