Advanced search

Forums : Technical Support : Looping result
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5487 - Posted: 26 Mar 2008, 23:50:43 UTC
Last modified: 26 Mar 2008, 23:55:43 UTC

I had wu 1679668 sent to me which at completion kept going from 60-95%. The same issue from before.

Why are these old units being resent out when they are faulty?

This causes a big lack of faith in anything that was re-generated with old work ID\'s and means I have to troll through my hosts results list and abort anything that looks old unless I want the host to be found looping like this host was.
ID: 5487 · Report as offensive     Reply Quote
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 5491 - Posted: 27 Mar 2008, 2:05:02 UTC
Last modified: 27 Mar 2008, 2:08:30 UTC

Maybe it would be a good thing to have the WU series coded in the names.

This would allow to cancel certain faulty series on server side much easier or reduce their quota to 1, so random incoming results will still be accepted but no new ones will be issued.

It would also allow a list of known faulty series which the crunchers would better abort if they still have some uncrunched results of that type.


The numbers might already contain such a scheme, a combination of 2 or 3 letters as a signature would be easier to recognize in the task list and they would be easier to filter by SQL.
ID: 5491 · Report as offensive     Reply Quote
STE\/E
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,539,257
RAC: 0
Message 5492 - Posted: 27 Mar 2008, 2:11:15 UTC

I Aborted about 200 Wu\'s this morning that were dated before 3/09/08 & 3/10/08, I got tired of opening up BOINCView & seeing 4-6 Wu\'s that had erred out after 2 or 3 hr\'s of running.

I haven\'t had 1 error out since then ... :)
ID: 5492 · Report as offensive     Reply Quote
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5493 - Posted: 27 Mar 2008, 2:19:14 UTC - in response to Message 5492.  

I Aborted about 200 Wu\'s this morning that were dated before 3/09/08 & 3/10/08, I got tired of opening up BOINCView & seeing 4-6 Wu\'s that had erred out after 2 or 3 hr\'s of running.

I haven\'t had 1 error out since then ... :)


Better keep an eagle eye out for newly downloaded until the are canceled or they get too many error results ;)
ID: 5493 · Report as offensive     Reply Quote
STE\/E
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,539,257
RAC: 0
Message 5494 - Posted: 27 Mar 2008, 2:29:38 UTC - in response to Message 5493.  

I Aborted about 200 Wu\'s this morning that were dated before 3/09/08 & 3/10/08, I got tired of opening up BOINCView & seeing 4-6 Wu\'s that had erred out after 2 or 3 hr\'s of running.

I haven\'t had 1 error out since then ... :)


Better keep an eagle eye out for newly downloaded until the are canceled or they get too many error results ;)


I have my Pc\'s set to NNW since I did that so I wouldn\'t just get some of them back, for now I can last a few days that way then I\'ll get more and delete what I have to ... :)
ID: 5494 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 11 Dec 07
Posts: 420
Credit: 270,580
RAC: 0
Message 5495 - Posted: 27 Mar 2008, 2:42:37 UTC
Last modified: 27 Mar 2008, 2:42:55 UTC

For now, I think the major issues have passed with the lensing tasks, but I do feel for you folks that are seeing these again...

Anyway, in my Quest To Irritate Project Staff (Q-TIPS, as it were :smile:), I would like to again mention that unit testing and other Quality Control / Quality Assurance steps should be used in the future so as not to get into this kind of situation...

[/Q-TIPS PSA]

:-)
ID: 5495 · Report as offensive     Reply Quote
Phoneman1

Send message
Joined: 5 Nov 07
Posts: 113
Credit: 3,100,327
RAC: 0
Message 5497 - Posted: 27 Mar 2008, 7:17:04 UTC
Last modified: 27 Mar 2008, 7:27:44 UTC

I\'ve had five of these failures in the last five days. I aborted them whenenver I noticed them looping but yesterday I suspended the tasks first so I could look in the slots folders. The paramater files didn\'t have any warning messages at the top.

All five failures have been dated either 9th or 10th of March but I have crunched successfully many more wu\'s with these dates in the last few days. I haven\'t been able to find anything which I could see which would predict which ones out of these batches of wu\'s will fail.

On a poitive note, I\'ve not yet had any failures since March 22nd on wu\'s with id\'s like wu_032x08_xxxxxx_x_x

As March 9th and 10th wu\'s are still being downloaded today I will be aborting all I see coming through as suggested by PoorBoy.

Phoneman1
ID: 5497 · Report as offensive     Reply Quote
STE\/E
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,539,257
RAC: 0
Message 5498 - Posted: 27 Mar 2008, 8:46:27 UTC - in response to Message 5492.  

I Aborted about 200 Wu\'s this morning that were dated before 3/09/08 & 3/10/08, I got tired of opening up BOINCView & seeing 4-6 Wu\'s that had erred out after 2 or 3 hr\'s of running.

I haven\'t had 1 error out since then ... :)


That post of mine should read (I Aborted about 200 Wu\'s this morning that were dated 3/09/08 & 3/10/08) just to clarify it ... :)
ID: 5498 · Report as offensive     Reply Quote
Phoneman1

Send message
Joined: 5 Nov 07
Posts: 113
Credit: 3,100,327
RAC: 0
Message 5500 - Posted: 27 Mar 2008, 10:51:37 UTC

After my earlier post I increased the \"maintain enough work\" item in the general preferences to 7 days and reported some completed tasks. In the flood of tasks that downloaded over the next couple of hours the majority had wu id\'s like

wu_032x08_xxxxxx_x_x

that is to say they were created in the last 7 or 8 days (if my understanding of the wu numbering system is correct).

I also downloaded a few with

wu_030908_xxxxxx_x_x
wu_031008_xxxxxx_x_x

which I aborted due to the high likelihood of them looping.

In addition I\'ve also downloaded some in theses ranges:

wu_031108_xxxxxx_x_x
wu_030608_xxxxxx_x_x
wu_022708_xxxxxx_x_x
wu_022208_xxxxxx_x_x

Are these units also likely to loop?

I haven\'t mean watching out for wu id\'s that have been downloaded before today but is this normal??

- Some wu\'s downloaded today were created over a month ago (if I\'m reading the numbering system right).

Perplexed,

Phoneman1
ID: 5500 · Report as offensive     Reply Quote
STE\/E
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,539,257
RAC: 0
Message 5501 - Posted: 27 Mar 2008, 10:58:06 UTC

wu_032x08_xxxxxx_x_x

that is to say they were created in the last 7 or 8 days (if my understanding of the wu numbering system is correct)


Yes, the 03 is the Month, the 2x is the Day & the 08 is the Year the WU was created. The Numbers that follow are the actual Time of day it was created ... :)
ID: 5501 · Report as offensive     Reply Quote
Profile Westsail and *Pyxey*
Avatar

Send message
Joined: 19 Dec 07
Posts: 24
Credit: 889,050
RAC: 0
Message 5521 - Posted: 28 Mar 2008, 4:53:18 UTC

Just had to cancel two on different machines. Same errors as before.

http://cosmologyathome.org/workunit.php?wuid=1649140
http://cosmologyathome.org/workunit.php?wuid=1656096
ID: 5521 · Report as offensive     Reply Quote
Honza
Volunteer tester

Send message
Joined: 21 May 07
Posts: 26
Credit: 5,222,146
RAC: 0
Message 5525 - Posted: 28 Mar 2008, 7:51:27 UTC
Last modified: 28 Mar 2008, 7:57:14 UTC

I had about 10 of them again last day. Perhaps more that I\'m not aware of.
Damn, 11 more - seems like the rate is increasing.
BOINC Project specifications and hardware requirements
ID: 5525 · Report as offensive     Reply Quote
Phoneman1

Send message
Joined: 5 Nov 07
Posts: 113
Credit: 3,100,327
RAC: 0
Message 5531 - Posted: 28 Mar 2008, 11:11:08 UTC

To answer in part the question I asked here yesterday, wu\'s in the range wu_031108_xxxxxx_x_x do loop with the DLL message! So far 5 have got 95%, 3 of those have gone on and uploaded OK and 2 have looped. Fail rate on a small sample is 40%!

Interestingly, all wu\'s which downloaded yesterday that I have looked at in the range wu_032x08_xxxxxx_x_x have the param do_lensing = F.

All the ones in the ranges wu_031x08_xxxxxx_x_x and wu_030x08_xxxxxx_x_x have the param do_lensing = T. Some in these ranges also have the \"warning messages\" before the parameters, exactly as before. I should emphasise again these all downloaded yesterday.

Of the two that have failed today one had the warning messages in the param file, one didn\'t.

It remains the case that since March 22nd none of my wu\'s in the range wu_032x08_xxxxxx_x_x has failed and only those in the ranges wu_031008_xxxxxx_x_x and wu_030x08_xxxxxx_x_x have had any failure or displayed any sign of looping and have been aborted by me. The common denominator seems to be do_lensing = ?. Having it set to T increases the chances of a failure from zero to up to 40%, if my experience is typical.

Phoneman1
ID: 5531 · Report as offensive     Reply Quote
Profile Conan
Avatar

Send message
Joined: 28 Aug 07
Posts: 169
Credit: 1,438,854
RAC: 2,507
Message 5610 - Posted: 31 Mar 2008, 0:49:46 UTC - in response to Message 5531.  

To answer in part the question I asked here yesterday, wu\'s in the range wu_031108_xxxxxx_x_x do loop with the DLL message! So far 5 have got 95%, 3 of those have gone on and uploaded OK and 2 have looped. Fail rate on a small sample is 40%!

Interestingly, all wu\'s which downloaded yesterday that I have looked at in the range wu_032x08_xxxxxx_x_x have the param do_lensing = F.

All the ones in the ranges wu_031x08_xxxxxx_x_x and wu_030x08_xxxxxx_x_x have the param do_lensing = T. Some in these ranges also have the \"warning messages\" before the parameters, exactly as before. I should emphasise again these all downloaded yesterday.

Of the two that have failed today one had the warning messages in the param file, one didn\'t.

It remains the case that since March 22nd none of my wu\'s in the range wu_032x08_xxxxxx_x_x has failed and only those in the ranges wu_031008_xxxxxx_x_x and wu_030x08_xxxxxx_x_x have had any failure or displayed any sign of looping and have been aborted by me. The common denominator seems to be do_lensing = ?. Having it set to T increases the chances of a failure from zero to up to 40%, if my experience is typical.

Phoneman1


G\'Day Phoneman1,
Of the ranges that you quoted and the question you asked about looping, I have found that \"wu_030908_\" is the best chance for getting a loop.
I have caught this type about 3 times now just as it loops from 95% back to about 56%.
I have aborted a number since as well on suspicion.

The other ranges 030608, 031108, 022708, 022208, 031008, 031808, 022808 and a few others all seem to work OK for me, the 030608 one could be another looper but I have caught it doing this.

Unsure why this old work has been re-badged and sent back out to us.

ID: 5610 · Report as offensive     Reply Quote
Profile Sou'westerly

Send message
Joined: 1 Jul 07
Posts: 37
Credit: 208,284
RAC: 0
Message 5647 - Posted: 1 Apr 2008, 14:10:17 UTC - in response to Message 5610.  


The other ranges 030608, 031108, 022708, 022208, 031008, 031808, 022808 and a few others all seem to work OK for me, the 030608 one could be another looper but I have caught it doing this.

Unsure why this old work has been re-badged and sent back out to us.


Hi Conan, Since and exception proves the rule, I had this WU in the range 031808 loop for just over 8 hours today before I caught it. It also locked up BOINC and took out the Einstein unit on the other core when I aborted it, luckily Einstein is a bit more stable and was able to restart from the checkpoint. Dave.
ID: 5647 · Report as offensive     Reply Quote
Profile Westsail and *Pyxey*
Avatar

Send message
Joined: 19 Dec 07
Posts: 24
Credit: 889,050
RAC: 0
Message 5659 - Posted: 1 Apr 2008, 20:49:52 UTC

Same thing here...been too busy to babysit.
Just caught this WU looping for over 24 hours:
http://cosmologyathome.org/result.php?resultid=4331722
Pages and pages of:
Tue 01 Apr 2008 10:27:43 AM HST|Cosmology@Home|Task wu_031808_051257_4_3 exited with zero status but no \'finished\' file
Tue 01 Apr 2008 10:27:43 AM HST|Cosmology@Home|If this happens repeatedly you may need to reset the project.
Tue 01 Apr 2008 10:27:43 AM HST|Cosmology@Home|Restarting task wu_031808_051257_4_3 using camb version 212

ARRGGG!
ID: 5659 · Report as offensive     Reply Quote
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5676 - Posted: 2 Apr 2008, 2:24:07 UTC

Scott- Old work dated before the 20th is still being sent out. Can this be looked at? These are re-releases of cxl work. Thanks
ID: 5676 · Report as offensive     Reply Quote

Forums : Technical Support : Looping result