1) Forums : Technical Support : Long Running Planck Tasks (Message 20922)
Posted 7 Mar 2016 by Gary Charpentier
Post:
Ouch! http://www.cosmologyathome.org/results.php?hostid=245939
Found a job at 99.99% done and just about to pass time deadline, however the CPU was doing no work at all. For some reason the job would not finish. Then I looked at the log and saw many jobs that failed because they ran over the time limit, yet other jobs are finishing and validating. Is there some error in the data generation that is causing some jobs to not be able to finish?

Aborted a batch because I'm not getting credit, they aren't doing science as they don't finish and it is wasting electricity!

Hi Gary, long running jobs will happen sometimes, but if the CPU has stopped working and the job is still going that means something is actually wrong. Do you have any specific examples where that happened you might be able to point me to?

http://www.cosmologyathome.org/result.php?resultid=36896285
<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
exceeded elapsed time limit 11812.98 (897654.00G/75.99G)
</message>
<stderr_txt>
2016-02-22 20:33:55 (8428): vboxwrapper (7.7.26182): starting

</stderr_txt>
]]>

Or any others from this computer with 11K plus run times and zero CPU times that finished in error.
2) Forums : Technical Support : Long Running Planck Tasks (Message 20892)
Posted 26 Feb 2016 by Gary Charpentier
Post:
Ouch! http://www.cosmologyathome.org/results.php?hostid=245939
Found a job at 99.99% done and just about to pass time deadline, however the CPU was doing no work at all. For some reason the job would not finish. Then I looked at the log and saw many jobs that failed because they ran over the time limit, yet other jobs are finishing and validating. Is there some error in the data generation that is causing some jobs to not be able to finish?

Aborted a batch because I'm not getting credit, they aren't doing science as they don't finish and it is wasting electricity!
3) Forums : Technical Support : URGENT PROBLEMS THREAD (2009 and after) (Message 10015)
Posted 8 Feb 2012 by Gary Charpentier
Post:
When my computer attempts to download
2/7/2012 5:48:51 PM Cosmology@Home Started download of params_012312_091752_2.ini
this crashes the BOINC manager and all processing for all projects stops.
4) Forums : Technical Support : URGENT PROBLEMS THREAD (2009 and after) (Message 8184)
Posted 21 Apr 2009 by Gary Charpentier
Post:
When attempting to update community preferences:
http://www.cosmologyathome.org/edit_forum_preferences_action.php
Database Error

Warning: Cannot modify header information - headers already sent by (output started at /home/boincadm/projects/cosmohome/html/inc/db_conn.inc:44) in /home/boincadm/projects/cosmohome/html/user/edit_forum_preferences_action.php on line 168
5) Forums : Technical Support : URGENT Problems Thread (Message 8141)
Posted 19 Apr 2009 by Gary Charpentier
Post:
Are you saying that they already had a regular saved backup or do
you just asume that the 'second RAID drive' could be used as a backup?

Exactly. Most likely contain the 'second RAID disk' only a part or fragments
of the data, if they are lucky they may restore some of it but it can hardly be
used as a backup.

We restored all our project files from the second RAID drive so obviously the data on that drive is also corrupt.

Drive, not disk. All files, not parts or fragments of files. As it was phrased by Admin it's logically a disk if they run a simple RAID 1 which most likely wouldn't be the case given other projects that are run by Universities. This would show a severe lack of foresight. I run a RAID 0 containing 2 disks but my computer sees it as a single drive. I doubt this website and forum, with all its threads/posts, and our acct info would be here exactly as it was if there wasn't a regular saved backup to recover from. And having a backup for the website but not the BOINC work wouldn't make any sense. I would say the assumption is yours and my mathematics are correct.

I was thinking more along the lines of RAID 1+1, but then I'm anal about my data.

In any case what happened is the backup got overwritten with corrupt data from the source. They need to go to an older backup.
6) Forums : Technical Support : URGENT Problems Thread (Message 8134)
Posted 18 Apr 2009 by Gary Charpentier
Post:
Update!
First off I am VERY sorry for leaving all of you hanging in the dust. Last week and this week have been midterm weeks for me and since Scott hasn't been replying to email or phone calls I take it he is probably as swamped with work as I am. The executable and a whole bunch of files are corrupt. We restored all our project files from the second RAID drive so obviously the data on that drive is also corrupt.
If anyone has a WORKING copy of camb_2.16(or 2.15)_windows_intelx86.exe please email it to me (akanaki2@illinois.edu). given the variety of failed/passed WU results I have seen over the past couple of days it seems like the linux app is broken too so if anyone has a WORKING copy of any of the following apps:

camb_2.16_windows_intelx86.exe
camb_2.15_i686-pc-linux-gnu
camb_2.16_x86_64-pc-linux-gnu
camb_2.15_windows_intelx86.ex
camb_2.15_x86_64-pc-linux-gnu
camb_2.16_i686-pc-linux-gnu

PLEASE send it to me.

We've got a little conundrum on our hands here because NONE of the apps you downloaded AFTER the project was restarted will work. We need apps from people who have not attached to cosmo@home after the server crash but its not likely any of those people monitor the forums regularly.

On the bright side, I received a lot of help from the BOINC admin mailing list so I am implementing those ideas as we speak to see if they fix some/all of our issues.
I need to recover the old code sign public key to get rid of that signature verification error - problem is we might have lost those due to the system failure too.
Another plan of action recommended to me was to create a whole new copy of the project and rename the app to see if this issue goes away (Thank you Paul!).

All-in-all, thank you everyone for waiting this long. The server crash was something we should have been prepared for but were not, lesson learned. You will see the project intermittently shut down and start up again over the next few days but that is because Scott and I will both be working on trying to fix it. I have 1 last midterm tomorrow and then pretty much nothing else so I can dedicate all my time to Cosmo@home. Thanks again. Stay tuned for updates.




You should consider this post very carefully. Your executables may be fine. It could be parts of the BOINC server code that are what is trashed.
7) Forums : Technical Support : URGENT Problems Thread (Message 8133)
Posted 18 Apr 2009 by Gary Charpentier
Post:

Minus this:
We restored all our project files from the second RAID drive so obviously the data on that drive is also corrupt.


Equals this.


Are you saying that they already had a regular saved backup or do
you just asume that the 'second RAID drive' could be used as a backup?

We don't know. RAID without a level is rather meaningless. In any case it is obvious that with the hardware failure of the controller all data connected to that controller got trashed, or must be assumed to be trashed until proven otherwise. All backups of data off the drives connected to that controller are also suspect until the time of failure can be pinpointed.

Lesson learned (again) allow no singular failure point.

8) Forums : Technical Support : URGENT Problems Thread (Message 8132)
Posted 18 Apr 2009 by Gary Charpentier
Post:
The problem is that if they build a new executable they need a new signing key if I understand the security system of BOINC correctly. And there is the nub. if the binary got corrupted and does not match the keys ... well ... whole new ball game...

The problem is that it is difficult to get from here to there and if you have not lived through one of these messes as a systems guy it is easy to assume how easy it is to get everything squared away.

As far as back-ups go... I made a religion out of making back ups in the old days and every time I tried to use one of the back-ups that had been so carefully made (with validation turned on), I could not recover much of anything at all off the tapes.

But you obviously didn't do one important thing and that was before you put the system online, make a backup and purposefully format the drives to prove the backup/restore facility was in working order and the procedures correct. Almost reminds me of a person I knew who had a good backup but didn't have a copy of the restore program except on the backup!
9) Forums : Technical Support : URGENT Problems Thread (Message 8119)
Posted 17 Apr 2009 by Gary Charpentier
Post:
Is the project open source?

10) Forums : Technical Support : Problem getting work` (failed downloads) (Message 8107)
Posted 16 Apr 2009 by Gary Charpentier
Post:
There needs to be a message on the front page of the project! Unless they can't even get to the front page, but then I'd think they should just take the whole thing down rather than not even bother to post something in the message boards.

As far as I understand, the news on the front page isn't easily edited. That's a whole hassle, which needs to be done on the machine itself. So someone needs to physically be behind the keyboard of that server to type in that bit of news, while they can easily do everything else from home, on the road, or wherever they are.

If their machine is set up so they can't remote log in to edit the news then I would suspect it is also set up so they can't remote login to replace the bad MD5 checksum file for the executable which means they can't work on a fix at least to my understanding of the BOINC server software.

And yes I could see a setup where they need to at least be on the campus LAN to remote login, but that should come up often enough campus wide that IT would have VPN or some such set up to tunnel through the firewall.

And I could see where they are on vacation and at conferences away from campus for an extended period. But silence isn't a good strategy to communicate information to volunteers.

The blasts here in the boards are here because the admin's have forgotten that there are human beings out here.

That is your assumption which isn't true. I would appreciate it if you stopped saying that here on these boards.

Then they should stop acting that way.

And they will continue as long as the admin's continue to forget that.

On the one hand, that sounds like a threat.

Not a threat, just an observation of human nature.

On the other hand, that may be where I come in to earn my pay, then. :-)

There is an adage: The sooner and in more detail you announce the bad news the better. Forget it at your own peril.
11) Forums : Technical Support : Problem getting work` (failed downloads) (Message 8105)
Posted 15 Apr 2009 by Gary Charpentier
Post:
Anshul is busy with fixing the problems. Those of you who are subscribed to the BOINC Projects email list know about this. Please keep it civil when posting your frustration, I am sure the admins here are as frustrated about the whole problem.

They also have their own lives away from the project, plus "away time" due to school holidays.

But it's being worked on. Just a little more patience, please.

Jord,
Is the BOINC Projects e-mail list one of the ones that says:
Note: These email lists do not provide tech support for SETI@home or other BOINC projects.
which begs the question why would anyone who is a regular cruncher subscribe.

There needs to be a message on the front page of the project! Unless they can't even get to the front page, but then I'd think they should just take the whole thing down rather than not even bother to post something in the message boards.

The blasts here in the boards are here because the admin's have forgotten that there are human beings out here. And they will continue as long as the admin's continue to forget that. How hard is it to write a news item that says, We have a problem downloading the app, we are working on a fix but we are on spring break, we don't know how long it will take to fix, we will post here when it is fixed, check back once a day. If they can't get that on the front page, then they must enjoy the abuse.

And finally Jord, you shouldn't be the one telling the world about this, although I do appreciate it.

12) Forums : Technical Support : URGENT Problems Thread (Message 8057)
Posted 10 Apr 2009 by Gary Charpentier
Post:
Updating community preferences:
Database Error

Warning: Cannot modify header information - headers already sent by (output started at /home/boincadm/projects/cosmohome/html/inc/db_conn.inc:44) in /home/boincadm/projects/cosmohome/html/user/edit_forum_preferences_action.php on line 168