Advanced search

Forums : Technical Support : URGENT Problems Thread
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next

AuthorMessage
Profile Scott
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 1 Apr 07
Posts: 662
Credit: 13,742
RAC: 0
Message 6463 - Posted: 10 Jul 2008, 4:22:10 UTC

I had to temporarily turn down the frequency of the server status page update, so it\'s not showing the most recent statistics.

The database was acting up again earlier, so I had to play around with it a bit. I looks to be up and running now, though.
Scott Kruger
Project Administrator, Cosmology@Home
ID: 6463 · Report as offensive
[ESL Brigade] Redbill

Send message
Joined: 29 Mar 08
Posts: 9
Credit: 790,400
RAC: 0
Message 6464 - Posted: 10 Jul 2008, 5:41:08 UTC - in response to Message 6463.  

I had to temporarily turn down the frequency of the server status page update, so it\'s not showing the most recent statistics.

The database was acting up again earlier, so I had to play around with it a bit. I looks to be up and running now, though.


My Pending and also the non validated WU\'s are still increasing...
I think the Problem isn\'t fix right now

Redbill
ID: 6464 · Report as offensive
Profile the silver surfer
Avatar

Send message
Joined: 11 Jan 08
Posts: 9
Credit: 48,970
RAC: 0
Message 6465 - Posted: 10 Jul 2008, 6:10:28 UTC - in response to Message 6463.  

I had to temporarily turn down the frequency of the server status page update, so it\'s not showing the most recent statistics.

The database was acting up again earlier, so I had to play around with it a bit. I looks to be up and running now, though.



Well, at least on my side there is NOTHING up & running ATM !!

Regards
Kurt

ID: 6465 · Report as offensive
STE\/E
Volunteer tester

Send message
Joined: 12 Jun 07
Posts: 375
Credit: 16,522,388
RAC: 0
Message 6466 - Posted: 10 Jul 2008, 11:25:49 UTC

I don\'t think anything is fixed either Scott, the Workunits waiting for validation @ 122,985 now & still climbing. The Server Status shows all Green but that probably is not the case ... :)
ID: 6466 · Report as offensive
Profile Steve Dodd

Send message
Joined: 31 Oct 07
Posts: 10
Credit: 1,967,483
RAC: 511
Message 6467 - Posted: 10 Jul 2008, 11:42:32 UTC

I\'m pretty sure it\'s not fixed yet.

7/10/2008 4:36:49 AM|Cosmology@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 6 completed tasks
7/10/2008 4:36:55 AM|Cosmology@Home|Scheduler request succeeded: got 0 new tasks
7/10/2008 4:36:55 AM|Cosmology@Home|Message from server: Server error: can\'t attach shared memory
ID: 6467 · Report as offensive
Warped

Send message
Joined: 18 Dec 07
Posts: 7
Credit: 71,500
RAC: 0
Message 6470 - Posted: 10 Jul 2008, 16:28:43 UTC

The validator server needs a severe kick up the derrière.

The problem has been evident since the weekend and is definitely not resolved.
ID: 6470 · Report as offensive
Profile Scott
Volunteer moderator
Project administrator
Project developer
Avatar

Send message
Joined: 1 Apr 07
Posts: 662
Credit: 13,742
RAC: 0
Message 6471 - Posted: 10 Jul 2008, 17:30:40 UTC

The validator is not the problem; it\'s been happily churning away at results, even as we speak.

It seems that there\'s been a huge bump in file IO recently and the server is having trouble dealing with it. Even working from a terminal, it can take 10 seconds to read or write to a file. We will most likely have to do some more hardware upgrades to handle the load.

For now, I want to shut down the daemons for a couple of hours and see if the validation rate improves.
Scott Kruger
Project Administrator, Cosmology@Home
ID: 6471 · Report as offensive
Brian Silvers

Send message
Joined: 11 Dec 07
Posts: 420
Credit: 270,580
RAC: 0
Message 6472 - Posted: 10 Jul 2008, 22:23:33 UTC - in response to Message 6471.  

The validator is not the problem; it\'s been happily churning away at results, even as we speak.

It seems that there\'s been a huge bump in file IO recently and the server is having trouble dealing with it. Even working from a terminal, it can take 10 seconds to read or write to a file. We will most likely have to do some more hardware upgrades to handle the load.

For now, I want to shut down the daemons for a couple of hours and see if the validation rate improves.


I don\'t know if \"validation rate\" is the metric you need to be looking at.

There are some really ODD things that happen with results that are coming in. I have some results that are part of WUs that have reached quorum that are still in Validate State = Initial, while newer reports will kick off and validate, but then some new reports will get hung up and be at Validate State = Initial, just like results where quorum was met 2 or 3 days ago that are still pending with VS=Initial...

I don\'t know how the validator is coded to handle problems, and perhaps there is a queue that if the validation process has a problem they go into a FIFO-like queue, so the oldest result that has met quorum should be the first result in the \"to be validated\" stack...?

It\'s either that or somewhere in code, someone has assigned the value of \"bar\" to the \"foo\" variable...

ID: 6472 · Report as offensive
Profile Jayargh
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 25 Jun 07
Posts: 508
Credit: 2,282,158
RAC: 0
Message 6473 - Posted: 11 Jul 2008, 1:12:11 UTC

The validation rate seems to have gone to zero over the last few hours.
ID: 6473 · Report as offensive
Profile Ananas

Send message
Joined: 19 Jan 08
Posts: 180
Credit: 2,500,290
RAC: 0
Message 6474 - Posted: 11 Jul 2008, 1:46:53 UTC

Well, maybe it would validate more if someone would start the validator ;-)
ID: 6474 · Report as offensive
Profile caspr
Avatar

Send message
Joined: 8 Aug 07
Posts: 54
Credit: 527,780
RAC: 0
Message 6475 - Posted: 11 Jul 2008, 3:19:26 UTC
Last modified: 11 Jul 2008, 3:39:02 UTC

Well I see we\'re starting to get some green back on the status board. Probably a good idea to keep the feeder off until the problem is fixed with the validation issue. Hope it works now.


Edit: so far so good, I see the #\'s starting to drop. keep up the good work!
A clear conscience is usually the sign of a bad memory
ID: 6475 · Report as offensive
Profile Dany

Send message
Joined: 13 Mar 08
Posts: 4
Credit: 184,450
RAC: 0
Message 6478 - Posted: 11 Jul 2008, 19:00:16 UTC

Getting the following error: Server can\'t open log file (../log_darkmatter/scheduler.log)

Can\'t send results neither can I receive new ones.
ID: 6478 · Report as offensive
Robert
Avatar

Send message
Joined: 6 May 08
Posts: 1
Credit: 65,330
RAC: 0
Message 6479 - Posted: 11 Jul 2008, 21:02:04 UTC

7/11/2008 4:37:51 PM|Cosmology@Home|Started upload of wu_070808_102210_0_0_1
7/11/2008 4:38:13 PM||Project communication failed: attempting access to reference site
7/11/2008 4:38:13 PM|Cosmology@Home|Temporarily failed upload of wu_070808_102210_0_0_1: system connect
7/11/2008 4:38:13 PM|Cosmology@Home|Backing off 30 min 12 sec on upload of wu_070808_102210_0_0_1
7/11/2008 4:38:14 PM||Access to reference site succeeded - project servers may be temporarily down.
7/11/2008 4:38:23 PM|Cosmology@Home|Started upload of wu_070808_102210_0_0_0
7/11/2008 4:38:46 PM||Project communication failed: attempting access to reference site
7/11/2008 4:38:46 PM|Cosmology@Home|Temporarily failed upload of wu_070808_102210_0_0_0: system connect
7/11/2008 4:38:46 PM|Cosmology@Home|Backing off 1 hr 41 min 6 sec on upload of wu_070808_102210_0_0_0
7/11/2008 4:38:47 PM||Access to reference site succeeded - project servers may be temporarily down.
7/11/2008 4:43:14 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_3
7/11/2008 4:43:28 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_0
7/11/2008 4:43:38 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file
7/11/2008 4:43:38 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_3: transient upload error
7/11/2008 4:43:38 PM|Cosmology@Home|Backing off 23 min 56 sec on upload of wu_070808_110655_0_0_3
7/11/2008 4:43:43 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file
7/11/2008 4:43:43 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_0: transient upload error
7/11/2008 4:43:43 PM|Cosmology@Home|Backing off 22 min 46 sec on upload of wu_070808_110655_0_0_0
7/11/2008 4:44:34 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_2
7/11/2008 4:44:54 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file
7/11/2008 4:44:54 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_2: transient upload error
7/11/2008 4:44:54 PM|Cosmology@Home|Backing off 36 min 58 sec on upload of wu_070808_110655_0_0_2
7/11/2008 4:46:16 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_1
7/11/2008 4:46:43 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file
7/11/2008 4:46:43 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_1: transient upload error
7/11/2008 4:46:43 PM|Cosmology@Home|Backing off 7 min 9 sec on upload of wu_070808_110655_0_0_1
7/11/2008 4:51:08 PM|Cosmology@Home|Started upload of wu_070808_100118_1_0_3
7/11/2008 4:51:30 PM||Project communication failed: attempting access to reference site
7/11/2008 4:51:30 PM|Cosmology@Home|Temporarily failed upload of wu_070808_100118_1_0_3: system connect
7/11/2008 4:51:30 PM|Cosmology@Home|Backing off 14 min 16 sec on upload of wu_070808_100118_1_0_3
7/11/2008 4:51:31 PM||Access to reference site succeeded - project servers may be temporarily down.
7/11/2008 4:53:53 PM|Cosmology@Home|Started upload of wu_070808_110655_0_0_1
7/11/2008 4:54:13 PM|Cosmology@Home|[error] Error on file upload: can\'t open log file
7/11/2008 4:54:13 PM|Cosmology@Home|Temporarily failed upload of wu_070808_110655_0_0_1: transient upload error
7/11/2008 4:54:13 PM|Cosmology@Home|Backing off 8 min 3 sec on upload of wu_070808_110655_0_0_1
7/11/2008 4:54:30 PM||Project communication failed: attempting access to reference site

I can\'t upload.
ID: 6479 · Report as offensive
Profile Copycat-Digital for WCG*
Avatar

Send message
Joined: 25 Sep 07
Posts: 17
Credit: 1,471,530
RAC: 0
Message 6480 - Posted: 11 Jul 2008, 21:37:51 UTC

Server Status @ 11 Jul 2008 16:31:12 UTC = 118378 Workunits waiting for validation

Server Status @ 11 Jul 2008 20:57:18 UTC = 114267 Workunits waiting for validation

Going down @ +- 1k / hour
114k remain = 114 hours = 4 days +

???
ID: 6480 · Report as offensive
Andres Melo

Send message
Joined: 27 Jun 08
Posts: 1
Credit: 20,130
RAC: 0
Message 6484 - Posted: 12 Jul 2008, 3:43:51 UTC - in response to Message 6479.  

12/07/2008 12:23:23 a.m.|Cosmology@Home|Fetching scheduler list
12/07/2008 12:23:43 a.m.|Cosmology@Home|Master file download succeeded
12/07/2008 12:23:48 a.m.|Cosmology@Home|Sending scheduler request: Requested by user. Requesting 0 seconds of work, reporting 13 completed tasks
12/07/2008 12:23:53 a.m.|Cosmology@Home|Scheduler request succeeded: got 0 new tasks
12/07/2008 12:23:53 a.m.|Cosmology@Home|Message from server: Server can\'t open log file (../log_darkmatter/scheduler.log)
12/07/2008 12:24:44 a.m.||Project communication failed: attempting access to reference site
12/07/2008 12:24:44 a.m.|Cosmology@Home|Temporarily failed upload of wu_070708_185703_1_1_1: http error
12/07/2008 12:24:44 a.m.|Cosmology@Home|Backing off 1 hr 41 min 41 sec on upload of wu_070708_185703_1_1_1
12/07/2008 12:24:46 a.m.||Access to reference site succeeded - project servers may be temporarily down.

:(.
ID: 6484 · Report as offensive
Profile Seejay
Avatar

Send message
Joined: 22 Dec 07
Posts: 13
Credit: 115,740
RAC: 0
Message 6492 - Posted: 13 Jul 2008, 0:58:39 UTC

From information gathered from here & there, please correct me if I\'m wrong, the current situation is this:

[list=a]

[*]The scheduler has been turned off to allow the validator to reduce backlog.

[*]It IS reducing backlog, albeit pretty slowly (seeing that there is no new input).

[*]Stats are being produced and exported to the various stats sites.

[*]When the scheduler is turned back on, the server will be overwhelmed with finished Work units, that are, as I write this, in \"Uploading\" status on most crunchers\' BOINC Managers.

[*]IMO, this has to be a problem with the number of SQL connections that Scott\'s using. Far too few.

[*]Therefore, with limited connections and floodgates that will burst when Scott turns the scheduler on again, we\'re probably looking at another major server crash in the near future.

[/list]
If I\'ve missed anything, or made any errors, please don\'t hesitate to tell me (if the web site is up)

Cheers,
Chris
ID: 6492 · Report as offensive
Honza
Volunteer tester

Send message
Joined: 21 May 07
Posts: 26
Credit: 5,222,146
RAC: 0
Message 6494 - Posted: 13 Jul 2008, 7:28:17 UTC

Well, there is not only 250k Results in progress but also 180k of Results ready to send. Once the scheduler will be turned on, not only uploads and reports come in, but also new results are being downloaded to clients.

I would let the validator catch up, slowly let the results being upladed and reported now to flood validator again and then send a new work. But I may be worng or it is too difficult to implement in a short time...
BOINC Project specifications and hardware requirements
ID: 6494 · Report as offensive
rroonnaalldd

Send message
Joined: 10 Apr 08
Posts: 18
Credit: 147,580
RAC: 0
Message 6496 - Posted: 13 Jul 2008, 11:01:42 UTC

Cosmology@Home 13.07.2008 12:32:15 [file_xfer] Temporarily failed upload of wu_071008_025543_1_1_0: transient upload error
Cosmology@Home 13.07.2008 12:32:15 [error] Error on file upload: can\'t open log file
Cosmology@Home 13.07.2008 12:31:21 [file_xfer] Started upload of file wu_071008_025543_1_1_0
Cosmology@Home 13.07.2008 12:27:51 Access to reference site succeeded - project servers may be temporarily down.
Cosmology@Home 13.07.2008 12:27:50 Backing off 2 hr 31 min 3 sec on upload of file wu_070908_072824_3_0_0
Cosmology@Home 13.07.2008 12:27:50 [file_xfer] Temporarily failed upload of wu_070908_072824_3_0_0: connect() failed
Cosmology@Home 13.07.2008 12:27:50 Project communication failed: attempting access to reference site
Cosmology@Home 13.07.2008 12:27:29 [file_xfer] Started upload of file wu_070908_072824_3_0_0
Cosmology@Home 13.07.2008 12:27:02 Backing off 22 min 19 sec on upload of file wu_070908_001542_0_0_2
Cosmology@Home 13.07.2008 12:27:02 [file_xfer] Temporarily failed upload of wu_070908_001542_0_0_2: transient upload error
Cosmology@Home 13.07.2008 12:27:02 [error] Error on file upload: can\'t open log file
Cosmology@Home 13.07.2008 12:26:46 [file_xfer] Started upload of file wu_070908_001542_0_0_2

ID: 6496 · Report as offensive
WHRoeder

Send message
Joined: 4 Nov 07
Posts: 6
Credit: 56,200
RAC: 0
Message 6498 - Posted: 13 Jul 2008, 19:29:00 UTC - in response to Message 6496.  

Cosmology@Home 13.07.2008 12:32:15 [file_xfer] Temporarily failed upload of wu_071008_025543_1_1_0: transient upload error
Cosmology@Home 13.07.2008 12:32:15 [error] Error on file upload: can\'t open log file
Cosmology@Home 13.07.2008 12:31:21 [file_xfer] Started upload of file wu_071008_025543_1_1_0

me also. can\'t open log file
ID: 6498 · Report as offensive
Stefan

Send message
Joined: 1 Nov 07
Posts: 4
Credit: 373,000
RAC: 0
Message 6499 - Posted: 13 Jul 2008, 20:20:14 UTC - in response to Message 6498.  

Cosmology@Home 13.07.2008 12:32:15 [file_xfer] Temporarily failed upload of wu_071008_025543_1_1_0: transient upload error
Cosmology@Home 13.07.2008 12:32:15 [error] Error on file upload: can\'t open log file
Cosmology@Home 13.07.2008 12:31:21 [file_xfer] Started upload of file wu_071008_025543_1_1_0

me also. can\'t open log file


Same here :/
ID: 6499 · Report as offensive
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 16 · Next

Forums : Technical Support : URGENT Problems Thread