Forums :
News :
Recent outage explanation
Message board moderation
Author | Message |
---|---|
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
Hi all, Over the last week we suffered a database corruption due to some disk errors. I've spent the last several days recovering the database from backups and from the corrupted files. Unfortunately, records of workunits from the last several weeks were lost, which means you will not receive credit for any of these jobs. I greatly apologize for this, and we've taken steps to make sure this doesn't happen again. The good news is that this was the only thing which could not be recovered, everything else is fine. We're continuing to monitor things as the server comes back online, please report any problems you may find here. Marius |
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
Note that work generation will likely struggle to keep up with the demand over the next several hours as everyone's computers are requesting work. This may cause you to receive a message that C@H has no available workunits, which should be temporary. |
Tim Kunz Send message Joined: 20 Dec 07 Posts: 19 Credit: 15,083,087 RAC: 10,198 ![]() |
Really? I lost thousands in credit...wasted CPU and power. We should have been notified so we could have switched to other projects in the interim. |
Jonathan Send message Joined: 27 Sep 17 Posts: 185 Credit: 8,306,392 RAC: 5,376 ![]() |
Marius, how is everything running? The server status page doesn't show many work units in waiting to be sent and the transitioner backlog seems to be growing to almost ten hours now. |
![]() Project administrator Project developer Project scientist ![]() Send message Joined: 29 Jun 15 Posts: 470 Credit: 4,276 RAC: 0 |
Really? I lost thousands in credit...wasted CPU and power. We should have been notified so we could have switched to other projects in the interim. I do really apologize for this. During working on this I was hopeful we were not going to lose anything, although in the end we obviously did. In the future, I'll make sure to alert users earlier if there's any worry of losing work, so they can pause workunits (although hopefully this will never be the case, after the changes we've put in place). Marius, how is everything running? The server status page doesn't show many work units in waiting to be sent and the transitioner backlog seems to be growing to almost ten hours now. The workload is definitely reduced, otherwise the server is getting overloaded. One of the changes we made was switching to a more reliable database driver, but which is slightly slower, which is one of the things you're seeing. We're actually in the process of upgrading hardware so things should improve in the next several weeks. |
xii5ku Send message Joined: 1 May 17 Posts: 36 Credit: 48,351,964 RAC: 0 |
Something is amiss again: - No tasks available. - free-dc and boincstats pulled negative stats updates yesterday. |
xii5ku Send message Joined: 1 May 17 Posts: 36 Credit: 48,351,964 RAC: 0 |
xii5ku wrote: - No tasks available. OK, almost no tasks available, and this was explained by the 2nd part of message 21940. |
Variable Send message Joined: 21 Oct 17 Posts: 12 Credit: 14,784,230 RAC: 0 |
Still having issues I take it? There is a noticeable shortage of work units... |
xii5ku Send message Joined: 1 May 17 Posts: 36 Credit: 48,351,964 RAC: 0 |
While I had difficulties to receive tasks during the last few days, I nevertheless used to get quite a few, as my RAC indicates. But I have not received a single tasks anymore since today, 10:30:24 UTC. (Talking about camb_boinc2docker only; I have camb_legacy checked off.) I looked at the top 100 hosts. None of them has received tasks after 10:30:24 UTC either; neither camb_boinc2docker nor camb_legacy. With the following exceptions: Hosts 356704, 330383, 349863, 348148, 338259 received 7 tasks together, all of which are resends of previously failed tasks, not newly generated WUs. server_status.php claims that camb_boinc2docker_work_generator and camb_legacy_work_generator are both running (status from 10 Oct 2018, 17:49:32 UTC), but there are clearly no new WUs. |
Greg_BE Send message Joined: 16 Jun 13 Posts: 23 Credit: 371,772 RAC: 0 |
My credits are 0 now. It's been a month since your outage, what's going on? I guess I should just abandon your project and find another one to take its place. I'll do that this weekend I guess. |
Trotador Send message Joined: 9 May 17 Posts: 10 Credit: 11,131,899 RAC: 0 |
It seems that the new server in on duty. Great difference! lightning fast! Congratulations! |
xii5ku Send message Joined: 1 May 17 Posts: 36 Credit: 48,351,964 RAC: 0 |
@Trotador, do you think so? Maybe the only difference is that the server was finally done validating its backlog of almost 80,000 camb_legacy WUs, and now both the processor load from the validator and the database size should have gone down. If there was a server upgrade, it was performed without server downtime. Hence I doubt it. Besides, looking up computer results lists, let alone user results lists, at the Cosmo web server is still an extremely slow undertaking. |
Trotador Send message Joined: 9 May 17 Posts: 10 Credit: 11,131,899 RAC: 0 |
Yeah, it was really fast for a couple of hours, then it worsened again :) but still faster than these last days. Now, validator is up and down intermittently and number of PVs is skyrocketing |
Greg_BE Send message Joined: 16 Jun 13 Posts: 23 Credit: 371,772 RAC: 0 |
I'm out of here. If you can't keep your technology working and create work, then why should I stay on this project? |
![]() Send message Joined: 1 Aug 19 Posts: 1 Credit: 0 RAC: 0 |
Introduce yourself LA Hello, I’m Lan Anh. I from in Thanh Hoa. Now, I am 25 years old. And I’m studying at ABC University school. My favorite hobbies are listening to romantic music in my free time and sometimes. Nice too meet you http:///bcn.com |
![]() Send message Joined: 11 Aug 19 Posts: 1 Credit: 0 RAC: 0 |
Thankyou |
FurryGuy Send message Joined: 22 Jul 11 Posts: 5 Credit: 1,294,252 RAC: 0 |
This "recent outage" explanation is a year old now, yet there have been several extended outages without any notification. Doing "behind the scenes" maintenance is one thing, but when the entire Cosmology@home website and the BOINC servers are unfindable by internet domain name servers then the outages are a major problem. Both from a science standpoint as well as a public relations issue. A couple of hours totally dark on at then internet is understandable. Frustrating, but understandable. But days without any followup explanation is really unprofessional. I freely volunteer my computers to perform what I see as a worthy scientific endeavor. Informing users when there are problems with the project is just simple courtesy. I am seriously debating on deleting Cosmology@home as an active project if the unexplained extended outages keep happening. |
.clair. Send message Joined: 4 Nov 07 Posts: 626 Credit: 12,068,402 RAC: 0 |
What is going on with this project ? outages with no explanation, just a quick note of what happened will do I know it takes time to fix a servers that have gone into meltdown it only needs five minits to put a note in the news section total silence from the admin, this is not good, |
![]() Send message Joined: 21 Dec 07 Posts: 19 Credit: 240,704 RAC: 0 |
Communications from above on this project has always been and probably always will be when and if the Admin wants to say something he will, and if he does not want to he doesn't. I suspect that he figures that it is an all volunteer operation with users who will come and go and he doesn't cater to them and probably can't be shamed into it. I have been with the Project awhile and when other Projects need extra clock cycles I take them from here and other good projects where the Admin's treat Users as commodities. Bill F ![]() |
UBT - Timbo Volunteer tester Send message Joined: 8 Jun 07 Posts: 17 Credit: 1,212,717 RAC: 0 |
Hi all Looks like another outage, which knocked the project off the internet. Glad to see that it's come back...but there must be some questions asked about the issues that seem to be occurring all too often. If the project doesn't want to keep it's volunteer crunchers "happy" then the project will slowly fall into disuse if members are unhappy with the way their "donations" of free electricity and free processing power are being abused. Perhaps the "admin" needs to be a bit more pro-active to keep everyone "on board"? regards Tim |