Advanced search

Forums : News : Recent outage explanation
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Marius
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 29 Jun 15
Posts: 467
Credit: 4,276
RAC: 0
Message 21924 - Posted: 11 Sep 2018, 13:40:41 UTC

Hi all,

Over the last week we suffered a database corruption due to some disk errors. I've spent the last several days recovering the database from backups and from the corrupted files. Unfortunately, records of workunits from the last several weeks were lost, which means you will not receive credit for any of these jobs. I greatly apologize for this, and we've taken steps to make sure this doesn't happen again. The good news is that this was the only thing which could not be recovered, everything else is fine.

We're continuing to monitor things as the server comes back online, please report any problems you may find here.

Marius
ID: 21924 · Report as offensive     Reply Quote
Profile Marius
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 29 Jun 15
Posts: 467
Credit: 4,276
RAC: 0
Message 21925 - Posted: 11 Sep 2018, 14:43:40 UTC - in response to Message 21924.  
Last modified: 11 Sep 2018, 17:14:39 UTC

Note that work generation will likely struggle to keep up with the demand over the next several hours as everyone's computers are requesting work. This may cause you to receive a message that C@H has no available workunits, which should be temporary.
ID: 21925 · Report as offensive     Reply Quote
Tim Kunz

Send message
Joined: 20 Dec 07
Posts: 12
Credit: 6,640,611
RAC: 1,723
Message 21927 - Posted: 13 Sep 2018, 21:51:49 UTC - in response to Message 21924.  

Really? I lost thousands in credit...wasted CPU and power. We should have been notified so we could have switched to other projects in the interim.
ID: 21927 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 27 Sep 17
Posts: 60
Credit: 3,911,243
RAC: 2,487
Message 21938 - Posted: 21 Sep 2018, 1:34:24 UTC - in response to Message 21925.  

Marius, how is everything running? The server status page doesn't show many work units in waiting to be sent and the transitioner backlog seems to be growing to almost ten hours now.
ID: 21938 · Report as offensive     Reply Quote
Profile Marius
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 29 Jun 15
Posts: 467
Credit: 4,276
RAC: 0
Message 21940 - Posted: 24 Sep 2018, 12:11:42 UTC - in response to Message 21927.  

Really? I lost thousands in credit...wasted CPU and power. We should have been notified so we could have switched to other projects in the interim.

I do really apologize for this. During working on this I was hopeful we were not going to lose anything, although in the end we obviously did. In the future, I'll make sure to alert users earlier if there's any worry of losing work, so they can pause workunits (although hopefully this will never be the case, after the changes we've put in place).

Marius, how is everything running? The server status page doesn't show many work units in waiting to be sent and the transitioner backlog seems to be growing to almost ten hours now.

The workload is definitely reduced, otherwise the server is getting overloaded. One of the changes we made was switching to a more reliable database driver, but which is slightly slower, which is one of the things you're seeing. We're actually in the process of upgrading hardware so things should improve in the next several weeks.
ID: 21940 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 29
Credit: 8,255,516
RAC: 121,738
Message 21942 - Posted: 1 Oct 2018, 10:37:37 UTC

Something is amiss again:
- No tasks available.
- free-dc and boincstats pulled negative stats updates yesterday.
ID: 21942 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 29
Credit: 8,255,516
RAC: 121,738
Message 21943 - Posted: 1 Oct 2018, 23:19:35 UTC - in response to Message 21942.  

xii5ku wrote:
- No tasks available.

OK, almost no tasks available, and this was explained by the 2nd part of message 21940.
ID: 21943 · Report as offensive     Reply Quote
Variable

Send message
Joined: 21 Oct 17
Posts: 12
Credit: 9,952,098
RAC: 16,551
Message 21952 - Posted: 10 Oct 2018, 12:39:28 UTC

Still having issues I take it? There is a noticeable shortage of work units...
ID: 21952 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 29
Credit: 8,255,516
RAC: 121,738
Message 21953 - Posted: 10 Oct 2018, 17:57:57 UTC
Last modified: 10 Oct 2018, 18:29:32 UTC

While I had difficulties to receive tasks during the last few days, I nevertheless used to get quite a few, as my RAC indicates.

But I have not received a single tasks anymore since today, 10:30:24 UTC. (Talking about camb_boinc2docker only; I have camb_legacy checked off.)

I looked at the top 100 hosts. None of them has received tasks after 10:30:24 UTC either; neither camb_boinc2docker nor camb_legacy. With the following exceptions: Hosts 356704, 330383, 349863, 348148, 338259 received 7 tasks together, all of which are resends of previously failed tasks, not newly generated WUs.

server_status.php claims that camb_boinc2docker_work_generator and camb_legacy_work_generator are both running (status from 10 Oct 2018, 17:49:32 UTC), but there are clearly no new WUs.
ID: 21953 · Report as offensive     Reply Quote
Greg_BE

Send message
Joined: 16 Jun 13
Posts: 23
Credit: 371,772
RAC: 131
Message 21957 - Posted: 10 Oct 2018, 23:08:19 UTC - in response to Message 21953.  

My credits are 0 now.
It's been a month since your outage, what's going on?
I guess I should just abandon your project and find another one to take its place. I'll do that this weekend I guess.
ID: 21957 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 9 May 17
Posts: 10
Credit: 10,014,199
RAC: 54,761
Message 21960 - Posted: 11 Oct 2018, 12:40:29 UTC

It seems that the new server in on duty. Great difference! lightning fast!

Congratulations!
ID: 21960 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 29
Credit: 8,255,516
RAC: 121,738
Message 21962 - Posted: 11 Oct 2018, 21:06:19 UTC - in response to Message 21960.  

@Trotador, do you think so? Maybe the only difference is that the server was finally done validating its backlog of almost 80,000 camb_legacy WUs, and now both the processor load from the validator and the database size should have gone down.

If there was a server upgrade, it was performed without server downtime. Hence I doubt it.

Besides, looking up computer results lists, let alone user results lists, at the Cosmo web server is still an extremely slow undertaking.
ID: 21962 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 9 May 17
Posts: 10
Credit: 10,014,199
RAC: 54,761
Message 21963 - Posted: 12 Oct 2018, 5:48:06 UTC - in response to Message 21962.  

Yeah, it was really fast for a couple of hours, then it worsened again :) but still faster than these last days.

Now, validator is up and down intermittently and number of PVs is skyrocketing
ID: 21963 · Report as offensive     Reply Quote
Greg_BE

Send message
Joined: 16 Jun 13
Posts: 23
Credit: 371,772
RAC: 131
Message 21964 - Posted: 12 Oct 2018, 10:50:25 UTC

I'm out of here. If you can't keep your technology working and create work, then why should I stay on this project?
ID: 21964 · Report as offensive     Reply Quote

Forums : News : Recent outage explanation