Forums :
Technical Support :
URGENT Problems Thread
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next
Author | Message |
---|---|
![]() Volunteer moderator Project administrator Project developer ![]() Send message Joined: 1 Apr 07 Posts: 662 Credit: 13,742 RAC: 0 |
I had to temporarily turn down the frequency of the server status page update, so it\'s not showing the most recent statistics. The database was acting up again earlier, so I had to play around with it a bit. I looks to be up and running now, though. Scott Kruger Project Administrator, Cosmology@Home |
[ESL Brigade] Redbill Send message Joined: 29 Mar 08 Posts: 9 Credit: 790,400 RAC: 0 |
I had to temporarily turn down the frequency of the server status page update, so it\'s not showing the most recent statistics. My Pending and also the non validated WU\'s are still increasing... I think the Problem isn\'t fix right now Redbill |
![]() ![]() Send message Joined: 11 Jan 08 Posts: 9 Credit: 49,099 RAC: 0 |
I had to temporarily turn down the frequency of the server status page update, so it\'s not showing the most recent statistics. Well, at least on my side there is NOTHING up & running ATM !! Regards Kurt ![]() |
STE\/E Volunteer tester Send message Joined: 12 Jun 07 Posts: 375 Credit: 16,539,257 RAC: 0 |
I don\'t think anything is fixed either Scott, the Workunits waiting for validation @ 122,985 now & still climbing. The Server Status shows all Green but that probably is not the case ... :) |
![]() Send message Joined: 31 Oct 07 Posts: 11 Credit: 2,215,741 RAC: 0 |
I\'m pretty sure it\'s not fixed yet. 7/10/2008 4:36:49 AM|Cosmology@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 6 completed tasks 7/10/2008 4:36:55 AM|Cosmology@Home|Scheduler request succeeded: got 0 new tasks 7/10/2008 4:36:55 AM|Cosmology@Home|Message from server: Server error: can\'t attach shared memory |
Warped Send message Joined: 18 Dec 07 Posts: 7 Credit: 77,399 RAC: 0 |
The validator server needs a severe kick up the derrière. The problem has been evident since the weekend and is definitely not resolved. |
![]() Volunteer moderator Project administrator Project developer ![]() Send message Joined: 1 Apr 07 Posts: 662 Credit: 13,742 RAC: 0 |
The validator is not the problem; it\'s been happily churning away at results, even as we speak. It seems that there\'s been a huge bump in file IO recently and the server is having trouble dealing with it. Even working from a terminal, it can take 10 seconds to read or write to a file. We will most likely have to do some more hardware upgrades to handle the load. For now, I want to shut down the daemons for a couple of hours and see if the validation rate improves. Scott Kruger Project Administrator, Cosmology@Home |
Brian Silvers Send message Joined: 11 Dec 07 Posts: 420 Credit: 270,580 RAC: 0 |
The validator is not the problem; it\'s been happily churning away at results, even as we speak. I don\'t know if \"validation rate\" is the metric you need to be looking at. There are some really ODD things that happen with results that are coming in. I have some results that are part of WUs that have reached quorum that are still in Validate State = Initial, while newer reports will kick off and validate, but then some new reports will get hung up and be at Validate State = Initial, just like results where quorum was met 2 or 3 days ago that are still pending with VS=Initial... I don\'t know how the validator is coded to handle problems, and perhaps there is a queue that if the validation process has a problem they go into a FIFO-like queue, so the oldest result that has met quorum should be the first result in the \"to be validated\" stack...? It\'s either that or somewhere in code, someone has assigned the value of \"bar\" to the \"foo\" variable... ![]() |
![]() Volunteer moderator Volunteer tester ![]() Send message Joined: 25 Jun 07 Posts: 508 Credit: 2,282,158 RAC: 0 |
The validation rate seems to have gone to zero over the last few hours. |
![]() Send message Joined: 19 Jan 08 Posts: 180 Credit: 2,500,290 RAC: 0 |
Well, maybe it would validate more if someone would start the validator ;-) |
![]() ![]() Send message Joined: 8 Aug 07 Posts: 54 Credit: 527,780 RAC: 0 |
Well I see we\'re starting to get some green back on the status board. Probably a good idea to keep the feeder off until the problem is fixed with the validation issue. Hope it works now. Edit: so far so good, I see the #\'s starting to drop. keep up the good work! A clear conscience is usually the sign of a bad memory |
![]() Send message Joined: 13 Mar 08 Posts: 4 Credit: 184,450 RAC: 0 |
Getting the following error: Server can\'t open log file (../log_darkmatter/scheduler.log) Can\'t send results neither can I receive new ones. |
Robert![]() Send message Joined: 6 May 08 Posts: 1 Credit: 65,330 RAC: 0 |
7/11/2008 4:37:51 PM|Cosmology@Home|Started upload of wu_070808_102210_0_0_1 7/11/2008 4:38:13 PM||Project communication failed: attempting access to reference site 7/11/2008 4:38:13 PM|Cosmology@Home|Temporarily failed upload of wu_070808_102210_0_0_1: system connect 7/11/2008 4:38:13 PM|Cosmology@Home|Backing off 30 min 12 sec on upload of wu_070808_102210_0_0_1 7/11/2008 4:38:14 PM||Access to reference site succeeded - project servers may be temporarily down. 7/11/2008 4:38:23 PM|Cosmology@Home|Started upload of wu_070808_102210_0_0_0 7/11/2008 4:38:46 PM||Project communication failed: attempting access to reference site 7/11/2008 4:38:46 PM|Cosmology@Home|Temporarily failed upload of wu_070808_102210_0_0_0: system connect 7/11/2008 4:38:46 PM|Cosmology@Home|Backing off 1 hr 41 min 6 sec on upload of wu_070808_102210_0_0_0 7/11/2008 4:38:47 PM||Access to reference site succeeded - project servers may be temporarily down. 7/11/2008 4:43:14 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_3 7/11/2008 4:43:28 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_0 7/11/2008 4:43:38 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file 7/11/2008 4:43:38 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_3: transient upload error 7/11/2008 4:43:38 PM|Cosmology@Home|Backing off 23 min 56 sec on upload of wu_070808_110655_0_0_3 7/11/2008 4:43:43 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file 7/11/2008 4:43:43 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_0: transient upload error 7/11/2008 4:43:43 PM|Cosmology@Home|Backing off 22 min 46 sec on upload of wu_070808_110655_0_0_0 7/11/2008 4:44:34 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_2 7/11/2008 4:44:54 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file 7/11/2008 4:44:54 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_2: transient upload error 7/11/2008 4:44:54 PM|Cosmology@Home|Backing off 36 min 58 sec on upload of wu_070808_110655_0_0_2 7/11/2008 4:46:16 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_1 7/11/2008 4:46:43 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file 7/11/2008 4:46:43 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_1: transient upload error 7/11/2008 4:46:43 PM|Cosmology@Home|Backing off 7 min 9 sec on upload of wu_070808_110655_0_0_1 7/11/2008 4:51:08 PM|Cosmology@Home|Started upload of wu_070808_100118_1_0_3 7/11/2008 4:51:30 PM||Project communication failed: attempting access to reference site 7/11/2008 4:51:30 PM|Cosmology@Home|Temporarily failed upload of wu_070808_100118_1_0_3: system connect 7/11/2008 4:51:30 PM|Cosmology@Home|Backing off 14 min 16 sec on upload of wu_070808_100118_1_0_3 7/11/2008 4:51:31 PM||Access to reference site succeeded - project servers may be temporarily down. 7/11/2008 4:53:53 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_1 7/11/2008 4:54:13 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file 7/11/2008 4:54:13 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_1: transient upload error 7/11/2008 4:54:13 PM|Cosmology@Home|Backing off 8 min 3 sec on upload of wu_070808_110655_0_0_1 7/11/2008 4:54:30 PM||Project communication failed: attempting access to reference site I can\'t upload. |
![]() ![]() Send message Joined: 25 Sep 07 Posts: 17 Credit: 1,471,530 RAC: 0 |
Server Status @ 11 Jul 2008 16:31:12 UTC = 118378 Workunits waiting for validation Server Status @ 11 Jul 2008 20:57:18 UTC = 114267 Workunits waiting for validation Going down @ +- 1k / hour 114k remain = 114 hours = 4 days + ??? |
Andres Melo Send message Joined: 27 Jun 08 Posts: 1 Credit: 20,130 RAC: 0 |
12/07/2008 12:23:23 a.m.|Cosmology@Home|Fetching scheduler list 12/07/2008 12:23:43 a.m.|Cosmology@Home|Master file download succeeded 12/07/2008 12:23:48 a.m.|Cosmology@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 13 completed tasks 12/07/2008 12:23:53 a.m.|Cosmology@Home|Scheduler request succeeded: got 0 new tasks 12/07/2008 12:23:53 a.m.|Cosmology@Home|Message from server: Server can\'t open log file (../log_darkmatter/scheduler.log) 12/07/2008 12:24:44 a.m.||Project communication failed: attempting access to reference site 12/07/2008 12:24:44 a.m.|Cosmology@Home|Temporarily failed upload of wu_070708_185703_1_1_1: http error 12/07/2008 12:24:44 a.m.|Cosmology@Home|Backing off 1 hr 41 min 41 sec on upload of wu_070708_185703_1_1_1 12/07/2008 12:24:46 a.m.||Access to reference site succeeded - project servers may be temporarily down. :(. |
![]() ![]() Send message Joined: 22 Dec 07 Posts: 13 Credit: 115,740 RAC: 0 |
From information gathered from here & there, please correct me if I\'m wrong, the current situation is this: [list=a] [*]The scheduler has been turned off to allow the validator to reduce backlog. [*]It IS reducing backlog, albeit pretty slowly (seeing that there is no new input). [*]Stats are being produced and exported to the various stats sites. [*]When the scheduler is turned back on, the server will be overwhelmed with finished Work units, that are, as I write this, in \"Uploading\" status on most crunchers\' BOINC Managers. [*]IMO, this has to be a problem with the number of SQL connections that Scott\'s using. Far too few. [*]Therefore, with limited connections and floodgates that will burst when Scott turns the scheduler on again, we\'re probably looking at another major server crash in the near future. [/list] If I\'ve missed anything, or made any errors, please don\'t hesitate to tell me (if the web site is up) Cheers, Chris ![]() |
Honza Volunteer tester Send message Joined: 21 May 07 Posts: 26 Credit: 5,222,146 RAC: 0 |
Well, there is not only 250k Results in progress but also 180k of Results ready to send. Once the scheduler will be turned on, not only uploads and reports come in, but also new results are being downloaded to clients. I would let the validator catch up, slowly let the results being upladed and reported now to flood validator again and then send a new work. But I may be worng or it is too difficult to implement in a short time... BOINC Project specifications and hardware requirements |
rroonnaalldd Send message Joined: 10 Apr 08 Posts: 18 Credit: 147,580 RAC: 0 |
Cosmology@Home 13.07.2008 12:32:15 [file_xfer] Temporarily failed upload of wu_071008_025543_1_1_0: transient upload error Cosmology@Home 13.07.2008 12:32:15 [error] Error on file upload: can\'t open log file Cosmology@Home 13.07.2008 12:31:21 [file_xfer] Started upload of file wu_071008_025543_1_1_0 Cosmology@Home 13.07.2008 12:27:51 Access to reference site succeeded - project servers may be temporarily down. Cosmology@Home 13.07.2008 12:27:50 Backing off 2 hr 31 min 3 sec on upload of file wu_070908_072824_3_0_0 Cosmology@Home 13.07.2008 12:27:50 [file_xfer] Temporarily failed upload of wu_070908_072824_3_0_0: connect() failed Cosmology@Home 13.07.2008 12:27:50 Project communication failed: attempting access to reference site Cosmology@Home 13.07.2008 12:27:29 [file_xfer] Started upload of file wu_070908_072824_3_0_0 Cosmology@Home 13.07.2008 12:27:02 Backing off 22 min 19 sec on upload of file wu_070908_001542_0_0_2 Cosmology@Home 13.07.2008 12:27:02 [file_xfer] Temporarily failed upload of wu_070908_001542_0_0_2: transient upload error Cosmology@Home 13.07.2008 12:27:02 [error] Error on file upload: can\'t open log file Cosmology@Home 13.07.2008 12:26:46 [file_xfer] Started upload of file wu_070908_001542_0_0_2 ![]() |
WHRoeder Send message Joined: 4 Nov 07 Posts: 6 Credit: 56,200 RAC: 0 |
Cosmology@Home 13.07.2008 12:32:15 [file_xfer] Temporarily failed upload of wu_071008_025543_1_1_0: transient upload error me also. can\'t open log file |
Stefan Send message Joined: 1 Nov 07 Posts: 4 Credit: 373,000 RAC: 0 |
Cosmology@Home 13.07.2008 12:32:15 [file_xfer] Temporarily failed upload of wu_071008_025543_1_1_0: transient upload error Same here :/ |