1) Forums : Technical Support : Project causing hang-ups to BOINC. (Message 19980)
Posted 28 Mar 2014 by Profile Ananas
Post:
Not on the communication problem but on your 1st posting in this thread :

Cosmo WUs often need 800+ MB of RAM.

As soon as the configured percentage "Use at most xx% of memory ..." is violated, BOINC stops the running workunits and might try to start another one - which is nothing but plain nonsense if applications stay in memory, but that's BOINC.

At least one of your boxes seems to have insufficient RAM to run two concurrent Cosmo results.

You can try to increase the percentage value and/or add more RAM sticks - 1GB per BOINC task is the minimum here - e.g. a dual core should not have less than 2GB with a percentage setting of 90%.
2) Forums : Technical Support : No heartbeat from core client in 30 sec - exiting calling boinc finish (Message 13755)
Posted 8 Feb 2014 by Profile Ananas
Post:
I currently have this problem on a Xeon L5520 (HT enabled) with a not too fast HDD.

In my case it is caused by Cosmo itself. As soon as a workunit reaches 95%, creating the final result files keeps the BOINC core client so busy, that the core client does not update the heartbeat timestamp in time.

All workunits of all projects - including the one that caused the issue - run into the heartbeat bug. The one that caused the problem restarts at about 90%, continues to 95% and causes the same problem again - endless until RSC_FPOPS_BOUND is violated.

Currently the only solution I see is disabling heartbeat polling in the Cosmo client. As this cannot be configured on client side but requires a change in the Cosmo client source code *), the only chance for me is to disable Cosmo on that box.

The heartbeat bug is sometimes triggered by other events - a delayed project or name server communication can cause it, Malaria checkpoints (they write huge checkpoints using the gzopen libraries) can cause it, unzipping huge project files (especially with subdirectories like CPDN and RNA) can also cause it. The heartbeat mechanism itself is the worst flaw in the BOINC concept, it causes way more trouble than it solves - but those who introduced it love it as a necessary feature.

Increasing the fault tolerance for missing heartbeat timestamps to maybe 30 minutes would fix the problem too without loosing the protective feature against stuck core clients.

There are other concepts to check if a parent process is still active, the intelligent ones would even be backwards compatible to old core clients - but smashing your head into a solid rock has about the same effect as making proposals to the BOINC developers.

*) It can be done, there is a documented method to disable the heartbeat polling in the BOINC project API. Afaik. the only project using it is RNA - with success!
3) Forums : Technical Support : Shortage of Memory (Message 13753)
Posted 7 Feb 2014 by Profile Ananas
Post:
800MB for one result are not unusual, shorter ones sometimes stay below 300 though.
4) Forums : Technical Support : Getting "Download failed" errors (Message 13752)
Posted 6 Feb 2014 by Profile Ananas
Post:
Not a problem on your side. Have a look here. The files simply aren't there and this happens with every workunit that requires a second (3rd, 4th, 5th) delivery, because the first one had any kind of error. The majority of failures is caused by timeouts, people using a way too large cache. Each timeout causes 4 download errors.
5) Forums : General Topics : Download error. (Message 13734)
Posted 16 Jan 2014 by Profile Ananas
Post:
Many of the download errors are caused by people with a too large cache setting, that causes the scheduler to pick up the result again, when the deadline is over.

When the result comes back just short after the deadline, the scheduler already has the result in the queue, but it is deleted when the first result is sent out (or on delayed return?). Each delayed result can cause up to 4 download errors.

Actually this is even better than really sending out the results a second time, as they do not validate.

An example : This host returns nearly all results after the deadline and nearly all of its workunits are scheduled again causing download errors (check the valid results, most of them have caused 4 download errors when the deadline was over!).

In order to catch those hosts, a lower limit of concurrent workunits would help. 10 per core would be a good value for the combination of avg. runtime and deadline in this project.

p.s.: A good solution would be to make the scheduler more deadline tolerant but this would require a server side BOINC code patch. Tell the host the deadline is 2 weeks but tell the BOINC server the deadline is 3 weeks. Some years back I proposed this somewhere in the BOINC project itself but they haven't been interested.
6) Forums : General Topics : Song of the day (Message 12000)
Posted 10 Oct 2012 by Profile Ananas
Post:
Not Garmarna and not In Extremo either ... I bet it will make a lot of you smile, this mannelig guy ;-)

Great one from Mano Negra : Sidi'h'bibi

This is very intense, Hunok csatája (Battle Of The Huns) by Vágtázó Halottkéme / Die Rasenden Leichenbeschauer / Galloping Coroners (on some CDs they have the name in 3 languages)

This used to be somewhat hard to find for some time but now even their old stuff is available again : Berurier Noir play Et Hop - afaik. their drum machine is an official band member *g

And now turn up the volume even more ... Toxicity as an electric violin cover
7) Forums : General Topics : Song of the day (Message 11993)
Posted 10 Oct 2012 by Profile Ananas
Post:
Reactivating two (probably forgotten) oldies

Blonker - Indigo
For the Germans, who think this sounds familiar ... it has been the intro of a TV show

Grobschnitt - Silent Movie
Not typical for this band. They have always been great on stage and it seems that they are back, playing in the Grugahalle next saturday (Krautrock-Festival, 13. Oct. 2012)
8) Forums : General Topics : Song of the day (Message 11937)
Posted 2 Oct 2012 by Profile Ananas
Post:
@Faik : I do not speak Swedish (yet) either but I need to learn at least some basic conversation quite soon. They have the English version of Carolus Rex on YT too, plus one with the lyrics. The band likes themes about fighting and killing though, so the not understood Swedish version might be more like the real thing ;-)


Edgar Winter Group – Frankenstein plus - Extended version

Single backside of the short version has been "Hombre Al Athetcho" (sp? - it translates "Undercovered Man" or so), two great tracks!
9) Forums : General Topics : Song of the day (Message 11924)
Posted 1 Oct 2012 by Profile Ananas
Post:
Heard this one on Bandit Rock not long ago, it took me a while to figure out what it was :-)
10) Forums : Technical Support : URGENT PROBLEMS THREAD (2009 and after) (Message 10290)
Posted 24 Feb 2012 by Profile Ananas
Post:
Teams seem to be back :-)
11) Forums : Technical Support : Team Problem (Message 10250)
Posted 21 Feb 2012 by Profile Ananas
Post:
same here, the sched_reply.xml does not contain the team information anymore either.
12) Forums : Technical Support : URGENT PROBLEMS THREAD (2009 and after) (Message 10216)
Posted 20 Feb 2012 by Profile Ananas
Post:
...What are you saying? Can a download error show upp both as Status = "In progress" and Status "Error while downloading"?????


Not both, no.

"In progress" means, the server knows nothing about the result state yet.

After the download error occurs, only the BOINC client on user side knows about the problem. As long as that host does not contact the project at least once (to report and/or fetch work) after the download error, the result stays "In progress".
13) Forums : General Topics : Can't upload, upload server out of disk space (Message 10213)
Posted 20 Feb 2012 by Profile Ananas
Post:
The assimilator isn't running, no idea where that disk space came from. It must just have been enough for very few results, my new results refuse to upload too, just the expired and nearly expired ones went through.

Btw., I was wrong with my assumption about this "file not found" thing, it affects a lot of workunits but for others the second download does work.
14) Forums : Technical Support : URGENT PROBLEMS THREAD (2009 and after) (Message 10208)
Posted 20 Feb 2012 by Profile Ananas
Post:
"in progress" means, that the client has not reported anything yet. The next contact of that host will show what's going on :-)

p.s.: This is what it usually looks like. One timeout (or crashed or aborted), then a few download failures.


p.p.s.: Hmmm, I checked a few more, it seems that both situations can occur now, I have downloaded one re-delivered result successfully. A bit surprising for me :-/
15) Forums : Technical Support : URGENT PROBLEMS THREAD (2009 and after) (Message 10199)
Posted 19 Feb 2012 by Profile Ananas
Post:
There seems to be no "over and over again" in this project. No workunit will be processed twice as the workunit source file seems to be removed after the first delivery (that's where all those download errors on replacement deliveries come from).

"No response" results (timeout) might fail with "too many total/error results" though, caused by the results with 404 error.
16) Forums : General Topics : Can't upload, upload server out of disk space (Message 10197)
Posted 19 Feb 2012 by Profile Ananas
Post:
All uploads done. About the timed-out ones :

I still could report them without problems but found that some of them already triggered a second delivery.

Imo. that doesn't matter too much, no CPU time wasted as a second delivery always results in a "file not found" error on downloading the .ini file.


edit : There is one risk though ... if you're not fast enough with the reporting, the result might get invalidated having "Too many error results" and/or "Too many total results". Afaik. a result that is in this state does not accept any successful report anymore.
17) Forums : General Topics : Can't upload, upload server out of disk space (Message 10178)
Posted 18 Feb 2012 by Profile Ananas
Post:
Some went through last nicht, others still get the same error message, strange.
18) Forums : Wish list : Most workunits give download failed (Message 10078)
Posted 12 Feb 2012 by Profile Ananas
Post:
This is a known issue here. When the first result of a workunit is sent out, the result files seem to be deleted on server side.

As the "max # of total" is 4, BOINC sends another 4 tasks for that workunit (an ancient bug, should send only 3), which all fail with this download error.


As long as you don't get a download error as the first receiver of a workunit, the problem didn't get worse ;-)
19) Forums : General Topics : Work Unit of 80+ hours?? (Message 9631)
Posted 2 Nov 2011 by Profile Ananas
Post:
Workunits here usually run way shorter, between 4 and 12 hours (lately more of the shorter ones occur) would be normal and I haven't had such a long running one so far. I guess there has been something wrong with the workunit or the host somehow had a problem with it.

Normal Cosmo result credits are not too low here in average and deadlines are quite comfortable.

If your next one runs that long again, I would rather suspect that there is something that somehow collides with the Cosmo ressource requirements on your host.
20) Forums : General Topics : Stats XML export is messed up. (Message 9316)
Posted 4 Feb 2011 by Profile Ananas
Post:
today the export seems to have worked, BOINCstats has updated the lists


Next 20