Joined: 28 Feb 18
Since running Docker-Jobs for Cosmology@Home, I am experiencing a very annoying behaviour: Every few hours, a task runs into "VM job unmanageable". Then all other VM jobs show a computation error or run endless. Even worse, the status of these unmanageable jobs seems to stop the server from sending new WUs. So almost every morning when I check my PC, it is idle or crunching another (backup-)project.
I think I have read through all related threads here and on other boinc projects like the LHC-Checklist (https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4161) but could not find a solution. I am running a 12-core (24 threads) Ryzen 3900 with 32 GB RAM (90% useable by boinc) and RAM usage is always way below this limit.
Using 21 threads for C@H (2 threads for feeding my 2 GPUs and 1 for maintenance), I set the max no. of CPUs in C@H to 3 so that I am running 7 docker-jobs parallel. Reason is that the relation between time used to set up the container and time used for crunching gets worse the more CPUs are used for one VM job because they get crunched so quickly. 1 minute to setup/finalize a job and 2 minutes to crunch it with 21 threads leads to just too much idle time.
So I understand that the project prefers docker jobs but they seem to be much less stable and use my CPU less than legacy jobs or other boinc projects - which is a pitty. I do like C@H a lot and despite high electricity costs in Germany, I am crunching 24h a day but looking at an idle CPU every morning is de-motivating.
I am running virtualbox 6.1.10, AMD-v aka SVM is activated and works, Hyper-V should not be relevant running Linux (Mint 20, Kernel 5.4.0-54 generic 64). More details about my machine are available here: http://www.cosmologyathome.org/show_host_detail.php?hostid=430811
To summarize, I have 2 questions:
1. Any ideas here what I could do despite going back to legacy application? Do other users also have this problem?
2. Could multicore-WUs in docker be sliced "thicker" so that the relation between setup and computation improves and idle time is reduced?
Any comments are much appreciated, thanks!
Joined: 27 Sep 17
You would need to cut down on the number of concurrently running virtual machines. Use the app_config.xml method to control the camb_boinc2docker application. It is explained a bit in the FAQ.
Your computer is just too busy dealing with all the virtual machines under VirtualBox and the BOINC / virtualbox wrapper isn't able to communicate in a timely manor and causes the errors.
Attached my app_config.xml below. I usually run two concurrent, four core work units. I have done three concurrent but then it runs into errors every so often.
<app_config> <app> <name>camb_boinc2docker</name> <max_concurrent>2</max_concurrent> </app> <app_version> <app_name>camb_boinc2docker</app_name> <plan_class>vbox64_mt</plan_class> <avg_ncpus>4</avg_ncpus> </app_version> </app_config>