Advanced search

Forums : Technical Support : URGENT Problems Discussion Thread
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 18 · Next

AuthorMessage
Phoneman1

Send message
Joined: 5 Nov 07
Posts: 113
Credit: 3,100,327
RAC: 0
Message 5661 - Posted: 1 Apr 2008, 21:54:45 UTC
Last modified: 1 Apr 2008, 22:01:49 UTC

A few questions for the experienced crunchers on this project...

Have you noticed reporting completing tasks on this project have been rather hit and miss for the last four hours or so?

Have you seen the message Server Can\'t open database in the last say 4 hours (since 18:30 utc)?

Have you seen timeout messages on reporting work in this period?

Have you seen failed to upload messages - backing off for 1min etc in this period?

Have you looked at the service status page on the web site and noticed that work units to validate remains around 42,000 mark in the last few hours despite falling from 52,000 earlier today.

Have you noticed that the transitioner backlog has crept up to 2 hours? - It was 0 earlier.

If you answered yes to all that lot you might be thinking... this project is looking like it is having a relapse - time to page Dr Scott.

But no one has updated the urgent thread to report an urgent problem.

So my final question is if you have seen all that, is this normal for project recovering from big problems?

It\'s getting late here and I\'ve early start tomorrow so I can\'t babysit it any longer tonight but I thought I\'d draw folks attention to some strange happenings
server side.

Phoneman1

EDIT - Having typed that lot in the nth automatic retry report on both my machines worked. It is still a concern it took so long to report work. It could cause credit to be lost on machines that report every 24 hours for example.
ID: 5661 · Report as offensive
Profile Rebirther
Volunteer tester
Avatar

Send message
Joined: 22 May 07
Posts: 23
Credit: 277,369
RAC: 0
Message 5662 - Posted: 1 Apr 2008, 22:04:04 UTC

Reporting WUs and validation are very slow, only getting no work messages.
ID: 5662 · Report as offensive
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5663 - Posted: 1 Apr 2008, 22:06:49 UTC - in response to Message 5661.  
Last modified: 1 Apr 2008, 22:10:09 UTC

A few questions for the experienced crunchers on this project...

Have you noticed reporting completing tasks on this project have been rather hit and miss for the last four hours or so?

Have you seen the message Server Can\'t open database in the last say 4 hours (since 18:30 utc)?

Have you seen timeout messages on reporting work in this period?

Have you seen failed to upload messages - backing off for 1min etc in this period?

Have you looked at the service status page on the web site and noticed that work units to validate remains around 42,000 mark in the last few hours despite falling from 52,000 earlier today.

Have you noticed that the transitioner backlog has crept up to 2 hours? - It was 0 earlier.

If you answered yes to all that lot you might be thinking... this project is looking like it is having a relapse - time to page Dr Scott.

But no one has updated the urgent thread to report an urgent problem.

So my final question is if you have seen all that, is this normal for project recovering from big problems?

It\'s getting late here and I\'ve early start tomorrow so I can\'t babysit it any longer tonight but I thought I\'d draw folks attention to some strange happenings
server side.

Phoneman1

EDIT - Having typed that lot in the nth automatic retry report on both my machines worked. It is still a concern it took so long to report work. It could cause credit to be lost on machines report every 24 hours for example.



Yes to all of the above as well as isolated mysql errors and get bad response from server trying to open my pending list.

As far as \"normal\" for a project recovering I would say no most of the problems remain....and normal is different for every project.

Since I am only dipping my toe back in the Cosmo waters (again)...nothing is urgent anymore....Scott\'s posting said it went wacky over the weekend but in reality it started tues/wed and noone in project admin even noticed...so for me it is no longer urgent....but for those trying to run this project as primary I would encourage you to post there such as you phoneman1 :)
ID: 5663 · Report as offensive
Profile Labbie
Avatar

Send message
Joined: 8 Nov 07
Posts: 64
Credit: 859,370
RAC: 0
Message 5664 - Posted: 1 Apr 2008, 22:13:38 UTC

Personally, I think they should shut down everything that is not required for the validation process until the WUs waiting for validation is down to a reasonable number. I think that the (now) 43,000 WUs waiting for validation is causing some of the DB issues.

Calm Chaos Forum...Join Calm Chaos Now
ID: 5664 · Report as offensive
STE\/E
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,522,388
RAC: 0
Message 5665 - Posted: 1 Apr 2008, 22:23:07 UTC - in response to Message 5655.  

Thank you Scott for fixing the problem quickly when you returned.


If you think the Prolems are fixed I\'d hate to be around when you think things are Broke ... ;)

A few questions for the experienced crunchers on this project...

Have you noticed reporting completing tasks on this project have been rather hit and miss for the last four hours or so?

Have you seen the message Server Can\'t open database in the last say 4 hours (since 18:30 utc)?

Have you seen timeout messages on reporting work in this period?

Have you seen failed to upload messages - backing off for 1min etc in this period?

Have you looked at the service status page on the web site and noticed that work units to validate remains around 42,000 mark in the last few hours despite falling from 52,000 earlier today.

Have you noticed that the transitioner backlog has crept up to 2 hours? - It was 0 earlier.

If you answered yes to all that lot you might be thinking... this project is looking like it is having a relapse - time to page Dr Scott.

But no one has updated the urgent thread to report an urgent problem.

So my final question is if you have seen all that, is this normal for project recovering from big problems?

It\'s getting late here and I\'ve early start tomorrow so I can\'t babysit it any longer tonight but I thought I\'d draw folks attention to some strange happenings
server side.

Phoneman1

EDIT - Having typed that lot in the nth automatic retry report on both my machines worked. It is still a concern it took so long to report work. It could cause credit to be lost on machines that report every 24 hours for example.


Yup, things still are not running the way they should be ... o_0


ID: 5665 · Report as offensive
Stwainer

Send message
Joined: 21 Jun 07
Posts: 18
Credit: 536,245
RAC: 0
Message 5666 - Posted: 1 Apr 2008, 22:23:57 UTC

All I\'m saying is that maybe Scott has something else to do now and again. He is an undergrad - maybe he has to attend class from time to time or have a beer with friends occasionally.

Some of the posts have been very \"it\'s the end of the project/world\" sounding to me. It\'s a bump in the road - I\'m sure we\'ll all get over it.

I do agree with Labbie though. Maybe give the validator some time to catch up before releasing more WU\'s.

ID: 5666 · Report as offensive
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5667 - Posted: 1 Apr 2008, 22:32:36 UTC - in response to Message 5666.  
Last modified: 1 Apr 2008, 22:46:43 UTC

All I\'m saying is that maybe Scott has something else to do now and again. He is an undergrad - maybe he has to attend class from time to time or have a beer with friends occasionally.

Some of the posts have been very \"it\'s the end of the project/world\" sounding to me. It\'s a bump in the road - I\'m sure we\'ll all get over it.

I do agree with Labbie though. Maybe give the validator some time to catch up before releasing more WU\'s.



Stwainer you are absolutely correct...however I might add....projects based at universities usually have some students to help admin so that problems don\'t go unnoticed for much more than a day.Participants that were crunching Cosmo only felt like it was the end of the world because they had formerly productive machines doing nothing....especially if the machine has to be configured remotely...everyones situation is different some have 100\'s of hosts some just 1 and if the last month or so has taught the users anything it is you can\'t trust Cosmo (anymore) to run like say Einstein.
ID: 5667 · Report as offensive
Profile Scott
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 1 Apr 07
Posts: 662
Credit: 13,742
RAC: 0
Message 5668 - Posted: 1 Apr 2008, 22:55:40 UTC

We are actually considering hiring another person to help out, especially considering that I will either be leaving UIUC this next semester or I will be a grad student here. Either way, I\'ll have a lot less time than even now to work on C@H after this semester.
Scott Kruger
Project Administrator, Cosmology@Home
ID: 5668 · Report as offensive
Roland_f

Send message
Joined: 21 Mar 08
Posts: 10
Credit: 281,620
RAC: 0
Message 5672 - Posted: 2 Apr 2008, 0:01:36 UTC - in response to Message 5668.  

We are actually considering hiring another person to help out, especially considering that I will either be leaving UIUC this next semester or I will be a grad student here. Either way, I\'ll have a lot less time than even now to work on C@H after this semester.

Ok you are only part time (hobby wise) working as admin, scaling down your study for it - must be a horrible job with all the crunchers here making angry postings in case of problems.

So thanks for your effort and system support.

After more \'server not reached\' messages and download errors last evening, I just could upload some finished WU\'s crunched over night after C@H had another Myserv problem preventing upload of results an hour ago.
Strange that my fallback project I hooked up on Monday (Lattice) also run out of work last night and I see Lattice have 55k WU for delition and 112k results for deletion what a waste - so I really hope at least the C@H results which are finally went through are not wasted content wise.
ID: 5672 · Report as offensive
Brian Silvers

Send message
Joined: 11 Dec 07
Posts: 420
Credit: 270,580
RAC: 0
Message 5673 - Posted: 2 Apr 2008, 1:09:43 UTC - in response to Message 5667.  

you can\'t trust Cosmo (anymore) to run like say Einstein.


I should hope not, at least not in all cases. Part of what you do not know about me is that I laid into Bernd and Bruce very heavily when they released S5R2 out as essentially an unannounced public beta. Sure, they said it was coming, but they never told anyone it was going to have a multitude of validation problems, tasks crashing, compiler-enduced penalties for AMD systems, tasks that ran 50-60 hours on my class of machine (AMD 3700+ overclocked to the equivalent of nearly a FX-57, which is the fastest single-core out there unless exotic cooling), etc, etc, etc... I was very hard on them. I raised the same issue about doing testing. I provided them with a relatively simple tip of enabling runtime checks in Visual C++ while doing a debug build that helped them track down multiple issues.

I\'m willing to not pursue the issue if you\'ll meet me halfway...
ID: 5673 · Report as offensive
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 5675 - Posted: 2 Apr 2008, 1:52:22 UTC - in response to Message 5673.  
Last modified: 2 Apr 2008, 2:05:10 UTC



I\'m willing to not pursue the issue if you\'ll meet me halfway...


When one earns and deserves respect,one gets respect in return :)
ID: 5675 · Report as offensive
tekwyzrd

Send message
Joined: 4 Mar 08
Posts: 14
Credit: 73,910
RAC: 0
Message 5677 - Posted: 2 Apr 2008, 2:45:22 UTC - in response to Message 5609.  

I\'m out of work too. AMD Athlon XP 2600+ running Sabayon Linux (Gentoo based distribution) keeps getting the dreaded

Sun 30 Mar 2008 07:48:14 PM EDT|Cosmology@Home|Message from server: No work sent
Sun 30 Mar 2008 07:48:14 PM EDT|Cosmology@Home|Message from server: (there was work but it was committed to other platforms)

I guess it\'s time to suspend cosmology@home and start running something else until the problems are solved.


WooHoo! I got SIX TASKS today before I again got \"Message from server: (there was work but it was committed to other platforms)\"

Not enough to make it through the night. Time to search for another new project.
ID: 5677 · Report as offensive
Phoneman1

Send message
Joined: 5 Nov 07
Posts: 113
Credit: 3,100,327
RAC: 0
Message 5678 - Posted: 2 Apr 2008, 5:56:03 UTC - in response to Message 5661.  

Thanks for the responses on the those questions I posed last night.

The picture is looking much better this morning. The number to validate is down by about 11,000 compared to 8 hours ago. The transitioner backlog is back at 0 and wu\'s are being reported as complete on the first attempt!

It looks like some issues with not getting work out for all machines (mainly AMD) remain but it is looking much more like the project it was before last week\'s major problems.

All typed with fingers firmly crossed!!!

Phoneman1

ID: 5678 · Report as offensive
Profile kevint

Send message
Joined: 30 Aug 07
Posts: 46
Credit: 6,502,980
RAC: 0
Message 5686 - Posted: 2 Apr 2008, 15:26:05 UTC



More problems ???

Seems that uploading is fine, but reporting of WU is not working.

Getting the same issues as last weekend.

4/2/2008 8:30:43 AM|Cosmology@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 23 completed tasks
4/2/2008 8:30:52 AM|Cosmology@Home|Scheduler request succeeded: got 0 new tasks
4/2/2008 8:30:52 AM|Cosmology@Home|Using the wrong URL can cause problems in some cases.
4/2/2008 8:30:52 AM|Cosmology@Home|When convenient, detach this project, then reattach to http://www.cosmologyathome.org/
4/2/2008 8:30:52 AM|Cosmology@Home|Message from server: Server can\'t open database
4/2/2008 8:30:52 AM|Cosmology@Home|New host venue:



AND -

Am having seeing this - seems there are still some older WU\'s out there that wont complete...

4/2/2008 8:31:57 AM|Cosmology@Home|Task wu_030608_110824_3_6 exited with a DLL initialization error.
4/2/2008 8:31:57 AM|Cosmology@Home|If this happens repeatedly you may need to reboot your computer.
4/2/2008 8:31:57 AM|Cosmology@Home|Restarting task wu_030608_110824_3_6 using camb version 212
4/2/2008 8:34:18 AM|Cosmology@Home|Task wu_030608_110824_3_6 exited with a DLL initialization error.
4/2/2008 8:34:18 AM|Cosmology@Home|If this happens repeatedly you may need to reboot your computer.
4/2/2008 8:34:18 AM|Cosmology@Home|Restarting task wu_030608_110824_3_6 using camb version 212
4/2/2008 8:36:35 AM|Cosmology@Home|Task wu_030608_110824_3_6 exited with a DLL initialization error.
4/2/2008 8:36:35 AM|Cosmology@Home|If this happens repeatedly you may need to reboot your computer.
4/2/2008 8:36:35 AM|Cosmology@Home|Restarting task wu_030608_110824_3_6 using camb version 212
4/2/2008 8:38:51 AM|Cosmology@Home|Task wu_030608_110824_3_6 exited with a DLL initialization error.
4/2/2008 8:38:51 AM|Cosmology@Home|If this happens repeatedly you may need to reboot your computer.
4/2/2008 8:38:51 AM|Cosmology@Home|Restarting task wu_030608_110824_3_6 using camb version 212

ID: 5686 · Report as offensive
tekwyzrd

Send message
Joined: 4 Mar 08
Posts: 14
Credit: 73,910
RAC: 0
Message 5687 - Posted: 2 Apr 2008, 15:50:33 UTC

Problems continue.Six tasks since March 31st.

Wed 02 Apr 2008 11:29:14 AM EDT|Cosmology@Home|Message from server: No work sent
Wed 02 Apr 2008 11:29:14 AM EDT|Cosmology@Home|Message from server: (there was work but it was committed to other platforms)
Wed 02 Apr 2008 11:30:14 AM EDT|Cosmology@Home|Sending scheduler request: To fetch work. Requesting 234198 seconds of work, reporting 0 completed tasks
Wed 02 Apr 2008 11:30:52 AM EDT|Cosmology@Home|Scheduler request succeeded: got 0 new tasks
Wed 02 Apr 2008 11:31:53 AM EDT|Cosmology@Home|Sending scheduler request: To fetch work. Requesting 234375 seconds of work, reporting 0 completed tasks
Wed 02 Apr 2008 11:31:59 AM EDT|Cosmology@Home|Scheduler request succeeded: got 0 new tasks
Wed 02 Apr 2008 11:31:59 AM EDT|Cosmology@Home|Message from server: Server can\'t open database
ID: 5687 · Report as offensive
Phoneman1

Send message
Joined: 5 Nov 07
Posts: 113
Credit: 3,100,327
RAC: 0
Message 5700 - Posted: 3 Apr 2008, 12:29:19 UTC

Re this item in the Urgent Problem thread

http://www.cosmologyathome.org/forum_thread.php?id=65&nowrap=true#5694

It has been 7 hours since Honza\'s post and still no completed work is able to be reported. There must be quite a number to be reported from all the Cosmology crunchers when it is restarted - there\'s a hundred plus on each of my two machines for a start!

According to the status page the number to validate remains at zero and the number of work units and results to delete remain as they were three hours ago.

It seems the validator has finished its queue but the deleter has stopped working, however the status page still shows all green!

Looks to me like the status page is fibbing....:-)

I do hope I\'m wrong and there is a better explanation!

Phoneman1

P.S. Originally posted in the wrong thread - too many tabs open on the browser.
ID: 5700 · Report as offensive
Profile kevint

Send message
Joined: 30 Aug 07
Posts: 46
Credit: 6,502,980
RAC: 0
Message 5701 - Posted: 3 Apr 2008, 14:44:38 UTC
Last modified: 3 Apr 2008, 14:45:35 UTC

I am just saying..... And this is just on 1 of many machines..

4/3/2008 8:41:40 AM|Cosmology@Home|Sending scheduler request: Requested by user. Requesting 953866 seconds of work, reporting 183 completed tasks
4/3/2008 8:41:55 AM|Cosmology@Home|Scheduler request succeeded: got 0 new tasks
4/3/2008 8:41:55 AM|Cosmology@Home|Message from server: Project is temporarily shut down for maintenance
4/3/2008 8:42:55 AM|Cosmology@Home|Fetching scheduler list
4/3/2008 8:43:00 AM|Cosmology@Home|Master file download succeeded
4/3/2008 8:43:05 AM|Cosmology@Home|Sending scheduler request: Requested by user. Requesting 955810 seconds of work, reporting 183 completed tasks
4/3/2008 8:43:15 AM|Cosmology@Home|Scheduler request succeeded: got 0 new tasks
4/3/2008 8:43:15 AM|Cosmology@Home|Message from server: Project is temporarily shut down for maintenance

ID: 5701 · Report as offensive
Profile m4rtyn
Avatar

Send message
Joined: 23 Aug 07
Posts: 18
Credit: 372,460
RAC: 0
Message 5702 - Posted: 3 Apr 2008, 15:47:57 UTC

At this rate by the time they get round to restarting the project there will be such a flood of wu\'s to be reported that the validator will jam up again.
m4rtyn
************************** *************************
ID: 5702 · Report as offensive
Buster Gunn

Send message
Joined: 21 Jul 07
Posts: 5
Credit: 1,702,164
RAC: 0
Message 5704 - Posted: 3 Apr 2008, 16:23:34 UTC

I guess making sure the daemon\'s were actually running correctly was not part of the deal. The validator finished it\'s cleanup 18 hours ago and the deleters should have finished long before now, but they are not running. Those number haven\'t changed in those same 18 hours.

Come on folks, at least look in every 12 hours to make sure things are running. Aside from the science, this project has dropped in ratings to the bottom. Time for me to move on.

ID: 5704 · Report as offensive
STE\/E
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,522,388
RAC: 0
Message 5707 - Posted: 3 Apr 2008, 19:52:39 UTC - in response to Message 5702.  

At this rate by the time they get round to restarting the project there will be such a flood of wu\'s to be reported that the validator will jam up again.


I agree, I know I had over 730 at last count waiting to turn in awhile ago, or maybe that\'s just another wild exageration on my part ... ;)
ID: 5707 · Report as offensive
Previous · 1 . . . 8 · 9 · 10 · 11 · 12 · 13 · 14 . . . 18 · Next

Forums : Technical Support : URGENT Problems Discussion Thread