Advanced search

Message boards : Announcements : Beta testing the new C@H

1 · 2 · 3 · 4 . . . 5 · Next
Author Message
Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 418
Credit: 4,276
RAC: 0
Message 20282 - Posted: 24 Aug 2015, 21:49:28 UTC
Last modified: 24 Aug 2015, 22:20:05 UTC

Hi all,

Today is an exciting day as its a big step in the process of completely revamping the C@H server. After a lot of work the last couple of months, we're ready to open up the new server to public beta testing. It is currently up and running at beta.cosmologyathome.org. If you would like help, please point your clients to this server, run a few jobs then report back in this thread about your experience.

On Aug 17 we dumped the user database into this server, so if you registered before that, your regular C@H login will work on there. Otherwise you will need to create a new account. Once beta testing is done we will attempt to transfer over any credits you earned on the beta server, however please note this will be on a best-effort basis with no guarantee. All other aspects of the beta server, including message boards, etc... will be deleted. The new server is beta software, and may crash, be reset without notice, etc..., however if you are excited or curious about trying out new software, we could use your help ironing out all the bugs!

So what's new?


  • A new app, "camb_boinc2docker". This updates us the very latest version of CAMB. More importantly perhaps, it runs in an entirely new way, using software I developed for BOINC called "boinc2docker". To run these jobs, you will need a 64-bit OS and Virtualbox installed (for the last year or so, Virtualbox has been an optional part of the install for the BOINC client itself, if you did not install it then you should download version 4.x.x here). The advantage of boinc2docker is that it makes it drop-dead easy for us as developers to upgrade and deploy new apps, so that its not 5 years in between updating app versions :) It also adds the following:
  • Mac OS support
  • Multi-threaded support
  • Full pause/resume/checkpoint support
  • An actual progress bar
  • The new default "third" BOINC credit system
  • A very recent version of the BOINC server software, which adds some slightly nicer forum functions and a new forum based news format.
  • For 32-bit users or users who don't have Virtualbox installed, the existing camb app, now called "camb_legacy", is still supported.
  • The server code is (almost) entirely public on github.



Many thanks to the BOINC developers for their help with along the way, especially Rom Walton for his help with vboxwrapper, on which boinc2docker is heavily based.

So give it a spin don't hesitate to post here, I'll be keeping a close eye on this thread!

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20283 - Posted: 25 Aug 2015, 7:27:08 UTC - in response to Message 20282.
Last modified: 25 Aug 2015, 8:07:09 UTC

1. Please send for this BETA test only the new application: camb_boinc2docker

2. Too less disk space is required. The slot needs more than 1000000000 bytes.

cosmohome 25 Aug 09:10:48 Aborting task camb_boinc2docker_3266_1440455267.684832_1: exceeded disk limit: 1135.79MB > 953.67MB

cosmohome 25 Aug 10:00:52 Aborting task camb_boinc2docker_3776_1440459354.579906_0: exceeded disk limit: 1162.56MB > 953.67MB


Task aborted at about 80% done.

3. Don't allocate all available threads, but max minus 1.
VBoxHeadless.exe is always running at standard priority and can't be lowered either by vboxwrapper nor by the user, due the Oracle's security policy concerning VirtualBox.

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20284 - Posted: 25 Aug 2015, 8:43:01 UTC - in response to Message 20283.
Last modified: 25 Aug 2015, 9:31:49 UTC

Hi Marius,

The disk limit is exceeded, cause you are using snapshots.
The snapshot alone needs almost 1 GB on disk.
The development of vboxwrapper leaded to the conclusion that working with the VM's gives lesser problems, when not using snapshots.
For jobs lasting very long (days and days), snapshots are useful, but for your very short running tasks, you could better do without.

You could use following camb_boinc2docker_vbox_job.xml to achieve that:

<vbox_job> <!-- Set as desired --> <memory_size_mb>2048</memory_size_mb> <!-- This is the VBox guest OS, not the host OS, so it stays this for all app_versions. --> <os_name>Linux26_64</os_name> <!-- These are all needed for boinc2docker --> <enable_isocontextualization>1</enable_isocontextualization> <enable_cache_disk>1</enable_cache_disk> <enable_shared_directory/> <enable_scratch_directory/> <enable_network/> <completion_trigger_file>completion_trigger_file</completion_trigger_file> <!-- --> <fraction_done_filename>results/progress</fraction_done_filename> <minimum_checkpoint_interval>60</minimum_checkpoint_interval> <enable_vm_savestate_usage/> <disable_automatic_checkpoints/> </vbox_job>


You use vbox_job.xml twice, named as vbox_job.xml and camb_boinc2docker_vbox_job.xml.
The vbox_job.xml can be deleted. It's confusing, cause not used.

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20285 - Posted: 25 Aug 2015, 9:49:19 UTC - in response to Message 20284.
Last modified: 25 Aug 2015, 9:51:50 UTC

When using the 'save state' method, the task still requires more than the 1000000000 bytes you allow.
When a user suspends the task with 'leave application in memory' ticked off or when BOINC is stopped, a save-file of also almost 1GB in size of the VM is temporary created in the slot directory.

Profile pschoefer
Send message
Joined: 26 Mar 08
Posts: 9
Credit: 1,946,775
RAC: 23
Message 20286 - Posted: 25 Aug 2015, 10:32:19 UTC - in response to Message 20282.

After setting "Request tasks to checkpoint at most every xx seconds" to a value higher than the runtime of the task, I'm able to complete tasks on Win7 x64 with vbox 5.0.0. So the 'maximum disk limit exceeded' error is definitely caused by the checkpoint snapshots.

I agree with Crystal Pellet's point that a vbox_mt app should not use all available CPU cores, because normal priority can cause problems for other processes (e.g. GPU tasks) running in parallel. Unless you need fast results or plan to send out very large tasks which would take days on a single core, I don't see an advantage of a multi-threaded app at all.

Another remark: If there is any good measure for the actual computing power used to complete a task, you may consider using this for granting credit instead of CreditNew. With any credit system, there may be complaints that credits are too low, too high or both compared to project X (I personally don't care; cross-project comparison will always be apples and oranges, because the projects are calculating different things), but CreditNew has often lead to a credit lottery, leaving everybody unhappy who cares about his stats.

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20287 - Posted: 25 Aug 2015, 16:15:22 UTC

MT-tasks with 8 threads: Elapsed time avg: 947.76 sec - CPU used 6700,41 seconds on average.

MT-tasks with 7 threads: Elapsed time avg: 1013.74 sec - CPU used 6371.32 seconds on average and on the 8th thread a camb_legacy v2.16 was running.

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 418
Credit: 4,276
RAC: 0
Message 20288 - Posted: 25 Aug 2015, 16:32:34 UTC
Last modified: 25 Aug 2015, 16:33:28 UTC

Awesome, thanks for the useful comments! Replies below..


  • Disk space: Yep, several of the jobs are coming back with this disk space error, and its clearly related to check pointing. The problem I believe is on my end though, I only specified 1GB disk space max for the jobs, whereas it should be about 2GB to accommodate the checkpoint file. (The checkpoint time is somewhat randomized, so this error may not be 100% reproducible if you manage to complete the task before checkpointing). The desired behaviour is that if in your computing preferences you have disk space limited to <2GB, your client should never even download the job in the first place. Note that this 2GB is not permanently used on your disk, the checkpoint is deleted once the job finishes. I'm open to nixing checkpointing completely, these jobs are not that long, but for now lets see how this fix works (I'll post back here when the fix is in place).

  • Multi threaded: By default BOINC is going to allocate all free CPUs to the job. If you have 4 CPUS and in your computing preferences you tell BOINC to use 50% CPU time, it'll run it as 2 CPU job. Is this a solution to what you guys are talking about, or am I misunderstanding? The advantage of multi-threaded apps is memory. A single 4-CPU multi-threaded app vs. running 4 jobs in parallel are going to give 4 results in roughly the same amount of time, the latter will take 4 times the memory!

  • Beta testing only camb_boinc2docker: We're really testing the entire server since the whole configuration is drastically different, so we'd like to keep this on there.

  • Comment about credits: Noted, let me get back to you on this.

  • Crystal Pellet: Thanks good catch, there's an unnecessary vbox_job.xml in there. Btw, what is the tag, I'm not seeing that in the docs for vboxwrapper?

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20289 - Posted: 25 Aug 2015, 18:11:35 UTC - in response to Message 20288.

* Multi threaded: By default BOINC is going to allocate all free CPUs to the job. If you have 4 CPUS and in your computing preferences you tell BOINC to use 50% CPU time, it'll run it as 2 CPU job. Is this a solution to what you guys are talking about, or am I misunderstanding?

1. The problem is that VBoxHeadless.exe is running at the 'normal' priority, where normal BOINC-tasks are running at the lowest 'idle' priority.
So your task is concurring with the user himself.
Setting cpu's to e.g. 50% is only a partial solution, cause most crunchers want to use all cores, but al lowest priority for BOINC.
There is a cmdline parameter --nthreads. Maybe you could use that, when taking ncpus - 1 for --nthreads.

2. When your mt-task is starting it pushes all other already running BOINC-tasks to a waiting state, maybe even loosing a lot of computing time when 'Leave in application' is not set or swapped to disk when "LAIM" is set, but system is low on memory. Your VM needs about 1.5GB RAM.

* Crystal Pellet: Thanks good catch, there's an unnecessary vbox_job.xml in there. Btw, what is the <enable_vm_savestate_usage> tag, I'm not seeing that in the docs for vboxwrapper?

If you set that tag, in your *job.xml file together with the also not documented disable_automatic_checkpoint tag the VM will save its state immediately when a user suspend the task (LAIM off) or BOINC stops.
The VM is saved and not poweroff (although of course not running anymore)
After resume no loss, because it restores from the very last point where the user suspended it. In your setup the whole task could be lost when no checkpoint was made or at least the loss of time since the last checkpoint.
Therefore also in my setup to checkpoint every 60 seconds, because no checkpoints needed, but the checkpoint-file updates more regular now.
That file is also used for restoring the cpu-seconds after a task-resume.

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 418
Credit: 4,276
RAC: 0
Message 20290 - Posted: 25 Aug 2015, 21:27:34 UTC

I upped the job disk bound to 3gb, let me know if anyone still sees the disk errors. The 3gb should be an overestimate, I will work on tweaking the exact space / memory requirements.

Rapture
Avatar
Send message
Joined: 27 Oct 07
Posts: 85
Credit: 638,604
RAC: 0
Message 20293 - Posted: 26 Aug 2015, 19:09:41 UTC - in response to Message 20282.

Thanks for the long awaited update! I was wondering what has happened with this great project. I am looking forward to the changed being planned. Keep up the good work!

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20296 - Posted: 27 Aug 2015, 9:37:03 UTC

You removed a needed file from the download directory:

cosmohome 27 Aug 11:34:07 Giving up on download of camb_boinc2docker_boinc_app: permanent HTTP error

STE\/E
Volunteer tester
Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,522,388
RAC: 0
Message 20297 - Posted: 27 Aug 2015, 9:37:45 UTC

Wu's seem to stop after 10 Min's with this Message ?

PBT99

6005 cosmohome 8/27/2015 5:13:37 AM task postponed 86400.000000 sec: VM Hypervisor failed to enter an online state in a timely fashion.
6006 cosmohome 8/27/2015 5:13:37 AM Starting task camb_boinc2docker_26880_1440646584.135051_0

Then another Wu Starts & runs for 10 Min's then stops & etc. I have vbox 4.3.12 installed on a Win 8 Laptop all Wu's that run for 10 Min's just stay suspended waiting to run ...

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 418
Credit: 4,276
RAC: 0
Message 20298 - Posted: 27 Aug 2015, 16:56:40 UTC

Crystal Pellet: Yea, I noticed the file was gone and readded it. It might have been gotten deleted again at some other points too, I'll look into why the file deleter is getting it. Btw, your suggestion with the vm_save_state looks really great, I'm testing it now. Thanks!

STEVE: In the DB I see a number of failed jobs from you due to the same file getting deleted error, which is now hopefully back. That doesn't sound like the error you're describing though. I also see several in progress, not sure if those are this 10min thing? Can you hit Update to see if it sends back any error logs that might help debugging the problem?

STE\/E
Volunteer tester
Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,522,388
RAC: 0
Message 20299 - Posted: 27 Aug 2015, 20:07:50 UTC

I've updated the project several times now ... The deleted Wu's were actually download error's, I've returned no successful Wu's yet as they all hang/suspend themselves after 10 Min's ...

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20300 - Posted: 27 Aug 2015, 22:03:59 UTC - in response to Message 20298.
Last modified: 27 Aug 2015, 22:11:53 UTC

Crystal Pellet: Yea, I noticed the file was gone and readded it. It might have been gotten deleted again at some other points too, I'll look into why the file deleter is getting it. Btw, your suggestion with the vm_save_state looks really great, I'm testing it now. Thanks!

Hi Marius,

That camb_boinc2docker_boinc_app-file is gone again. At least it's not in the download-dir.

I've successfully tested an option to reduce the number of cores for the Virtual Machine by the user himself.
You don't have to do anything, when the user places following file with the name app_config.xml in his project directory:

<app_config> <project_max_concurrent>1</project_max_concurrent> <app> <name>camb_boinc2docker</name> <max_concurrent>1</max_concurrent> </app> <app_version> <app_name>camb_boinc2docker</app_name> <plan_class>vbox64_mt</plan_class> <avg_ncpus>7.000000</avg_ncpus> <max_ncpus>7.000000</max_ncpus> </app_version> </app_config>


In the example I've reduced the number of cores to 7 on my 8-threaded machines. The VM is created and running with 7 cores.

Results with 6 cores:

http://beta.cosmologyathome.org/result.php?resultid=1951
http://beta.cosmologyathome.org/result.php?resultid=1939

Result with 7 cores:

http://beta.cosmologyathome.org/result.php?resultid=1897

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 418
Credit: 4,276
RAC: 0
Message 20301 - Posted: 27 Aug 2015, 22:25:17 UTC

STEVE: That camb_boinc2docker_boinc_app file (http://beta.cosmologyathome.org/download/2b0/camb_boinc2docker_boinc_app) is and should have been present for at least the last four hours. But I do still see your client giving errors downloading it. Can you try a project reset? Maybe a remove / add too? Other clients have been able to complete the exact same workunit after your client gave a download error on them, so my guess is the workunits / files are fine on the server.


Crystal Pellet: Very useful, thanks. To make sure I understand, the difference between this, and say, just lowering the "Use at most" CPU time option is that this targets camb_boinc2docker specifically, leaving other apps to use that last 8th core?

Crystal Pellet
Send message
Joined: 12 Feb 13
Posts: 21
Credit: 351,882
RAC: 16
Message 20302 - Posted: 28 Aug 2015, 5:34:53 UTC - in response to Message 20301.
Last modified: 28 Aug 2015, 5:37:42 UTC

STEVE: That camb_boinc2docker_boinc_app file (http://beta.cosmologyathome.org/download/2b0/camb_boinc2docker_boinc_app) is and should have been present for at least the last four hours.
...

Shouldn't be that file 1 directory higher: in download-dir itself and not in /2b0/ ?
It looks like it is deleted after every task from the user's machine.

Crystal Pellet: Very useful, thanks. To make sure I understand, the difference between this, and say, just lowering the "Use at most" CPU time option is that this targets camb_boinc2docker specifically, leaving other apps to use that last 8th core?

That's correct!
This last core could be left free for GPU-task support or another single-core CPU-task could use it.
That app_config.xml should be placed in the Cosmology project directory on the users machine (now of course the beta-directory).

STE\/E
Volunteer tester
Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,522,388
RAC: 0
Message 20303 - Posted: 28 Aug 2015, 17:54:55 UTC - in response to Message 20301.

STEVE: That camb_boinc2docker_boinc_app file (http://beta.cosmologyathome.org/download/2b0/camb_boinc2docker_boinc_app) is and should have been present for at least the last four hours. But I do still see your client giving errors downloading it. Can you try a project reset? Maybe a remove / add too? Other clients have been able to complete the exact same workunit after your client gave a download error on them, so my guess is the workunits / files are fine on the server


I haven't been able to get the camb_boinc2docker_boinc_app to run for more than 10 min's before suspending & starting another task on the Win 8 Laptop. I did get it to run on another PC though that has Win 7 Pro installed ...

Question, are the camb_legacy wu's I'm getting multi task too ??? they only run 1 at a time ???

fzs600
Send message
Joined: 7 May 10
Posts: 1
Credit: 227,009
RAC: 0
Message 20305 - Posted: 1 Sep 2015, 15:43:54 UTC - in response to Message 20303.

Hello

it is possible to choose its application in the preferences of his account?

choose between : camb_legacy and camb_boinc2docker


Thank you

Profile Marius
Project administrator
Project developer
Project scientist
Avatar
Send message
Joined: 29 Jun 15
Posts: 418
Credit: 4,276
RAC: 0
Message 20307 - Posted: 2 Sep 2015, 0:57:05 UTC

Steve: Yes, camb_legacy is single threaded. We have no plans to modify this app in the future, so it will stay single threaded, but if you've got the RAM, it's just as efficient to run multiple copies of it as if it were multithreaded.

fsz600: Ah, it should be possible by going to Your Account -> Cosmology@Home preferences, but currently its not. I'll work on fixing this and post back here when its live.

1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Announcements : Beta testing the new C@H