Advanced search

Forums : News : Recent outage explanation
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Marius
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 29 Jun 15
Posts: 470
Credit: 4,276
RAC: 0
Message 21924 - Posted: 11 Sep 2018, 13:40:41 UTC

Hi all,

Over the last week we suffered a database corruption due to some disk errors. I've spent the last several days recovering the database from backups and from the corrupted files. Unfortunately, records of workunits from the last several weeks were lost, which means you will not receive credit for any of these jobs. I greatly apologize for this, and we've taken steps to make sure this doesn't happen again. The good news is that this was the only thing which could not be recovered, everything else is fine.

We're continuing to monitor things as the server comes back online, please report any problems you may find here.

Marius
ID: 21924 · Report as offensive     Reply Quote
Profile Marius
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 29 Jun 15
Posts: 470
Credit: 4,276
RAC: 0
Message 21925 - Posted: 11 Sep 2018, 14:43:40 UTC - in response to Message 21924.  
Last modified: 11 Sep 2018, 17:14:39 UTC

Note that work generation will likely struggle to keep up with the demand over the next several hours as everyone's computers are requesting work. This may cause you to receive a message that C@H has no available workunits, which should be temporary.
ID: 21925 · Report as offensive     Reply Quote
Tim Kunz

Send message
Joined: 20 Dec 07
Posts: 13
Credit: 8,689,743
RAC: 6,217
Message 21927 - Posted: 13 Sep 2018, 21:51:49 UTC - in response to Message 21924.  

Really? I lost thousands in credit...wasted CPU and power. We should have been notified so we could have switched to other projects in the interim.
ID: 21927 · Report as offensive     Reply Quote
Jonathan

Send message
Joined: 27 Sep 17
Posts: 108
Credit: 5,209,864
RAC: 3,040
Message 21938 - Posted: 21 Sep 2018, 1:34:24 UTC - in response to Message 21925.  

Marius, how is everything running? The server status page doesn't show many work units in waiting to be sent and the transitioner backlog seems to be growing to almost ten hours now.
ID: 21938 · Report as offensive     Reply Quote
Profile Marius
Project administrator
Project developer
Project scientist
Avatar

Send message
Joined: 29 Jun 15
Posts: 470
Credit: 4,276
RAC: 0
Message 21940 - Posted: 24 Sep 2018, 12:11:42 UTC - in response to Message 21927.  

Really? I lost thousands in credit...wasted CPU and power. We should have been notified so we could have switched to other projects in the interim.

I do really apologize for this. During working on this I was hopeful we were not going to lose anything, although in the end we obviously did. In the future, I'll make sure to alert users earlier if there's any worry of losing work, so they can pause workunits (although hopefully this will never be the case, after the changes we've put in place).

Marius, how is everything running? The server status page doesn't show many work units in waiting to be sent and the transitioner backlog seems to be growing to almost ten hours now.

The workload is definitely reduced, otherwise the server is getting overloaded. One of the changes we made was switching to a more reliable database driver, but which is slightly slower, which is one of the things you're seeing. We're actually in the process of upgrading hardware so things should improve in the next several weeks.
ID: 21940 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 34
Credit: 13,568,154
RAC: 0
Message 21942 - Posted: 1 Oct 2018, 10:37:37 UTC

Something is amiss again:
- No tasks available.
- free-dc and boincstats pulled negative stats updates yesterday.
ID: 21942 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 34
Credit: 13,568,154
RAC: 0
Message 21943 - Posted: 1 Oct 2018, 23:19:35 UTC - in response to Message 21942.  

xii5ku wrote:
- No tasks available.

OK, almost no tasks available, and this was explained by the 2nd part of message 21940.
ID: 21943 · Report as offensive     Reply Quote
Variable

Send message
Joined: 21 Oct 17
Posts: 12
Credit: 14,784,177
RAC: 0
Message 21952 - Posted: 10 Oct 2018, 12:39:28 UTC

Still having issues I take it? There is a noticeable shortage of work units...
ID: 21952 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 34
Credit: 13,568,154
RAC: 0
Message 21953 - Posted: 10 Oct 2018, 17:57:57 UTC
Last modified: 10 Oct 2018, 18:29:32 UTC

While I had difficulties to receive tasks during the last few days, I nevertheless used to get quite a few, as my RAC indicates.

But I have not received a single tasks anymore since today, 10:30:24 UTC. (Talking about camb_boinc2docker only; I have camb_legacy checked off.)

I looked at the top 100 hosts. None of them has received tasks after 10:30:24 UTC either; neither camb_boinc2docker nor camb_legacy. With the following exceptions: Hosts 356704, 330383, 349863, 348148, 338259 received 7 tasks together, all of which are resends of previously failed tasks, not newly generated WUs.

server_status.php claims that camb_boinc2docker_work_generator and camb_legacy_work_generator are both running (status from 10 Oct 2018, 17:49:32 UTC), but there are clearly no new WUs.
ID: 21953 · Report as offensive     Reply Quote
Greg_BE

Send message
Joined: 16 Jun 13
Posts: 23
Credit: 371,772
RAC: 0
Message 21957 - Posted: 10 Oct 2018, 23:08:19 UTC - in response to Message 21953.  

My credits are 0 now.
It's been a month since your outage, what's going on?
I guess I should just abandon your project and find another one to take its place. I'll do that this weekend I guess.
ID: 21957 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 9 May 17
Posts: 10
Credit: 10,170,097
RAC: 0
Message 21960 - Posted: 11 Oct 2018, 12:40:29 UTC

It seems that the new server in on duty. Great difference! lightning fast!

Congratulations!
ID: 21960 · Report as offensive     Reply Quote
xii5ku

Send message
Joined: 1 May 17
Posts: 34
Credit: 13,568,154
RAC: 0
Message 21962 - Posted: 11 Oct 2018, 21:06:19 UTC - in response to Message 21960.  

@Trotador, do you think so? Maybe the only difference is that the server was finally done validating its backlog of almost 80,000 camb_legacy WUs, and now both the processor load from the validator and the database size should have gone down.

If there was a server upgrade, it was performed without server downtime. Hence I doubt it.

Besides, looking up computer results lists, let alone user results lists, at the Cosmo web server is still an extremely slow undertaking.
ID: 21962 · Report as offensive     Reply Quote
Trotador

Send message
Joined: 9 May 17
Posts: 10
Credit: 10,170,097
RAC: 0
Message 21963 - Posted: 12 Oct 2018, 5:48:06 UTC - in response to Message 21962.  

Yeah, it was really fast for a couple of hours, then it worsened again :) but still faster than these last days.

Now, validator is up and down intermittently and number of PVs is skyrocketing
ID: 21963 · Report as offensive     Reply Quote
Greg_BE

Send message
Joined: 16 Jun 13
Posts: 23
Credit: 371,772
RAC: 0
Message 21964 - Posted: 12 Oct 2018, 10:50:25 UTC

I'm out of here. If you can't keep your technology working and create work, then why should I stay on this project?
ID: 21964 · Report as offensive     Reply Quote
Profile lilyjame

Send message
Joined: 1 Aug 19
Posts: 1
Credit: 0
RAC: 0
Message 22197 - Posted: 1 Aug 2019, 4:47:19 UTC

Introduce yourself LA

Hello, I’m Lan Anh.
I from in Thanh Hoa. Now, I am 25 years old. And I’m studying at ABC University school. My favorite hobbies are listening to romantic music in my free time and sometimes.
Nice too meet you
http:///bcn.com
ID: 22197 · Report as offensive     Reply Quote
Profile Quan Nguyen

Send message
Joined: 11 Aug 19
Posts: 1
Credit: 0
RAC: 0
Message 22202 - Posted: 11 Aug 2019, 14:56:04 UTC

Thankyou
ID: 22202 · Report as offensive     Reply Quote
FurryGuy

Send message
Joined: 22 Jul 11
Posts: 4
Credit: 1,137,971
RAC: 870
Message 22265 - Posted: 27 Sep 2019, 3:04:49 UTC
Last modified: 27 Sep 2019, 3:05:06 UTC

This "recent outage" explanation is a year old now, yet there have been several extended outages without any notification.

Doing "behind the scenes" maintenance is one thing, but when the entire Cosmology@home website and the BOINC servers are unfindable by internet domain name servers then the outages are a major problem. Both from a science standpoint as well as a public relations issue.

A couple of hours totally dark on at then internet is understandable. Frustrating, but understandable.

But days without any followup explanation is really unprofessional.

I freely volunteer my computers to perform what I see as a worthy scientific endeavor. Informing users when there are problems with the project is just simple courtesy.

I am seriously debating on deleting Cosmology@home as an active project if the unexplained extended outages keep happening.
ID: 22265 · Report as offensive     Reply Quote
.clair.

Send message
Joined: 4 Nov 07
Posts: 604
Credit: 8,955,219
RAC: 4,081
Message 22266 - Posted: 27 Sep 2019, 12:21:38 UTC - in response to Message 22265.  

What is going on with this project ?
outages with no explanation,
just a quick note of what happened will do
I know it takes time to fix a servers that have gone into meltdown
it only needs five minits to put a note in the news section
total silence from the admin, this is not good,
ID: 22266 · Report as offensive     Reply Quote
Profile Bill F

Send message
Joined: 21 Dec 07
Posts: 12
Credit: 213,646
RAC: 100
Message 22267 - Posted: 29 Sep 2019, 2:28:55 UTC - in response to Message 22266.  

Communications from above on this project has always been and probably always will be when and if the Admin wants to say something he will, and if he does not want to he doesn't.

I suspect that he figures that it is an all volunteer operation with users who will come and go and he doesn't cater to them and probably can't be shamed into it.

I have been with the Project awhile and when other Projects need extra clock cycles I take them from here and other good projects where the Admin's treat Users as commodities.

Bill F
ID: 22267 · Report as offensive     Reply Quote

Forums : News : Recent outage explanation