Advanced search

Forums : Technical Support : Some information on cosmology@home
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Frank Lassowski

Send message
Joined: 18 Feb 08
Posts: 5
Credit: 89,420
RAC: 0
Message 7027 - Posted: 15 Aug 2008, 10:27:10 UTC - in response to Message 7018.  

Scott actually tells me that C@H is functional now.

Ben


No, Ben, it isn\'t. Not for me, running 32bit Linux Debian etch and CAMB 2.15 on BOINC 5.10.21. I still have SIGSEGV in all WUs, the last one is this one.

What is happening there?

greets
Frank
ID: 7027 · Report as offensive     Reply Quote
Profile Jord
Volunteer moderator
Volunteer tester
Avatar

Send message
Joined: 15 Jun 07
Posts: 345
Credit: 50,500
RAC: 0
Message 7028 - Posted: 15 Aug 2008, 12:04:53 UTC - in response to Message 7027.  

What is happening there?

Since you\'ve been having these errors since the 23rd of July, have you checked that it cannot be your own computer? Checked your RAM with memtest86+ and tried to write your swap file to another drive or partition?

You could also try to update BOINC, in the least to 5.10.45, but perhaps to one of the 6.2 versions.
ID: 7028 · Report as offensive     Reply Quote
Stephen Balch 2

Send message
Joined: 19 Jul 08
Posts: 9
Credit: 49,950
RAC: 0
Message 7031 - Posted: 15 Aug 2008, 18:42:30 UTC

Ben,

First, I did not \"flock\" to this project because you \"would be very generous with handing out credit\" either, I tend to work on physics/astronomy/cosmology related projects. I also tend to fall in the category of those who work on a project because of the science, not because of the credit given out. I understand the meaning of the term \'beta\'.

Second, let me try to establish my credentials. I was professionally active in data processing/computers for over 35 years. I have worked on both mainframe and personal computers, both Mac and PC, starting with the pre-PC MITS Altair 8800b (an early S-100 bus system built from a kit). Previously, I was an electronics tech in the U.S. Air Force. I am now retired, but over those 35 years I held positions ranging from junior Programmer through senior Systems Programmer (the last 15 years) specializing in IBM mainframe data communications (SNA/SNI networks using VTAM and NCP) on MVS/ESA and OS/390 systems. I\'ve always done technical support. Systems Programmers are generally considered to be in the top 10% of the profession (he says, trying not to break his arm patting himself on the back <GRIN>). I have recently returned to processing BOINC projects. I started with classic SETI@home and early BOINC so I\'ve been around for a while. You can find my previous work by removing the \'2\' from my current user-ID and looking.

I have to agree with Brian Silvers and others on the project management of this project.

A few things I\'ve learned over the years: 1) communicate with your users and coworkers; 2) document changes well so if someone else needs to take over they can know what is going on; 3) test, test, test, then test again; 4) never make changes immediately before a long weekend, a vacation, or even if you are going be \"out of pocket\" for the day; 5) if you are understaffed, don\'t release new applications/versions until you can properly support them unless it\'s an \"emergency\" fix, then stand by to support the change (and see 4 above); 6) don\'t annoy your users, they can make life miserable if you do; 7) the user wouldn\'t be complaining about the issue if they didn\'t perceive a problem exists (there are some exceptions to this on BOINC projects<GRIN>); 8) work with the user to resolve the issue since they may know more about the symptoms of the perceived problem than you do (and they are living with the results of your actions); 9) I don\'t know everything (!!!); 10) even a \"mature\" mainframe operating system, with all of IBM\'s resources behind it to find and fix problems, has something on the order of 10K significant bugs; that\'s what PTFs (\"Program Temporary Fixes\", sometimes referred to as \"Permanent Temporary Fixes\") are for.

I also question the functionality of the project if you need to \"blanket cancel\" \"In progress\" WUs to clear problems with the system. I don\'t believe your \"extremely strict validation system\" is responsible for all of the following, which I\'m trying to understand. Were these due to the \"blanket cancel\"? Around August 17th 2008, I have 5 WUs that have an outcome of \"Didn\'t need\" (and a \"minimum quorum 2\"). The five WUs show the following:
1) WU 4862924 - after 2 \"Success\" outcomes and 2 \"No reply\" outcomes were received, the WU was sent out again after the 2nd \"Success\" was received (quorum received but WU sent out again);
2) WU 4940069 - 2 outcomes were marked \"Didn\'t need\" when 2 previous outcomes were \"No reply\" (no quorum received);
3) WU 4940068 - after 1 \"Success\" outcome and 2 \"No reply\" outcomes received, WU marked \"Didn\'t need\" (no quorum received);
4) WU 4920607 - after 1 \"No reply, 2 \"Client error\", and 2 \"Didn\'t need\" outcomes, the WU was sent out twice again and currently showing \"In progress\" (no quorum received when WUs marked \"Didn\'t need\");
5) WU 4920725 - after 1 \"Success\", and 1 \"Redundant result\" (Canceled by server) outcomes, WU sent out again then marked \"Didn\'t need\" (no quorum received when WU marked \"Didn\'t need\").

According to the \"Outcome\" explanation: \"Didn\'t need The task wasn\'t sent to a computer because enough other tasks were completed for this workunit.\" but the WU was obviously was sent out again or it wouldn\'t have been linked to my computer. Also, \"Redundant result\" is not defined in the \"Outcome\" explanations and \"Cancelled by server\", while self-explanatory, is not defined in the \"Client state\" explanations. You may need to pass these on to BOINC development.

I don\'t know if I actually processed these WUs before they were marked \"Don\'t need\" (since the server does not show a \"Received\" date or \"application version\" and \"CPU time\" and \"Claimed credit\" are 0, and I don\'t watch every WU being processed) or if the server just didn\'t record the information because of the \"Don\'t need\" outcome.

Cheers,

Stephen



ID: 7031 · Report as offensive     Reply Quote
Nothing But Idle Time

Send message
Joined: 27 Aug 07
Posts: 84
Credit: 148,380
RAC: 0
Message 7032 - Posted: 15 Aug 2008, 19:16:05 UTC

Rosetta@Home experienced increasingly severe operational difficulties from it\'s inception around Oct 2005; by Christmas it was bad enough that they called in Rom Walton to help them solve their issues and get the project rolling. I\'d say that Cosmo is in a similar pickle. Any chance of someone knowledgable like Rom to help Cosmo get it\'s HR and work fetching problems solved and back on track? (...just a thought)
ID: 7032 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 11 Dec 07
Posts: 420
Credit: 270,580
RAC: 0
Message 7033 - Posted: 15 Aug 2008, 21:52:43 UTC - in response to Message 7032.  

Rosetta@Home experienced increasingly severe operational difficulties from it\'s inception around Oct 2005; by Christmas it was bad enough that they called in Rom Walton to help them solve their issues and get the project rolling. I\'d say that Cosmo is in a similar pickle. Any chance of someone knowledgable like Rom to help Cosmo get it\'s HR and work fetching problems solved and back on track? (...just a thought)


I had previously suggested that they contact Bruce Allen. I think I remember that Scott went up to visit the University of Wisconsin campus sometime over the past year... Einstein doesn\'t use HR, but I\'m sure Bruce could help or at least point him in the direction of someone who can (like Rom)...
ID: 7033 · Report as offensive     Reply Quote
Frank Lassowski

Send message
Joined: 18 Feb 08
Posts: 5
Credit: 89,420
RAC: 0
Message 7036 - Posted: 16 Aug 2008, 0:16:42 UTC - in response to Message 7028.  

What is happening there?

Since you\'ve been having these errors since the 23rd of July, have you checked that it cannot be your own computer? Checked your RAM with memtest86+ and tried to write your swap file to another drive or partition?

You could also try to update BOINC, in the least to 5.10.45, but perhaps to one of the 6.2 versions.


Same procedure as last time. I did an update to 5.10.45, swap is ok, RAM is ok, all other BOINC projects are running as expected without any errors. Tomorrow I will test my other two computers, both are running fresh installations of Debian etch.

Frank
ID: 7036 · Report as offensive     Reply Quote
Nothing But Idle Time

Send message
Joined: 27 Aug 07
Posts: 84
Credit: 148,380
RAC: 0
Message 7040 - Posted: 16 Aug 2008, 3:12:31 UTC - in response to Message 7033.  

Rosetta@Home experienced increasingly severe operational difficulties from it\'s inception around Oct 2005; by Christmas it was bad enough that they called in Rom Walton to help them solve their issues and get the project rolling. I\'d say that Cosmo is in a similar pickle. Any chance of someone knowledgable like Rom to help Cosmo get it\'s HR and work fetching problems solved and back on track? (...just a thought)


I had previously suggested that they contact Bruce Allen. I think I remember that Scott went up to visit the University of Wisconsin campus sometime over the past year... Einstein doesn\'t use HR, but I\'m sure Bruce could help or at least point him in the direction of someone who can (like Rom)...

Errr forgive me... I thought Bruce moved to Germany to work at AEI, or was that temporary?
ID: 7040 · Report as offensive     Reply Quote
Brian Silvers

Send message
Joined: 11 Dec 07
Posts: 420
Credit: 270,580
RAC: 0
Message 7041 - Posted: 16 Aug 2008, 4:24:41 UTC - in response to Message 7040.  

Rosetta@Home experienced increasingly severe operational difficulties from it\'s inception around Oct 2005; by Christmas it was bad enough that they called in Rom Walton to help them solve their issues and get the project rolling. I\'d say that Cosmo is in a similar pickle. Any chance of someone knowledgable like Rom to help Cosmo get it\'s HR and work fetching problems solved and back on track? (...just a thought)


I had previously suggested that they contact Bruce Allen. I think I remember that Scott went up to visit the University of Wisconsin campus sometime over the past year... Einstein doesn\'t use HR, but I\'m sure Bruce could help or at least point him in the direction of someone who can (like Rom)...

Errr forgive me... I thought Bruce moved to Germany to work at AEI, or was that temporary?


Yes, he is still over there, but you and I are talking over this \"internet\" thing... ;-)

The campus in Wisconsin still has a role in the project (there are still 2 job postings listed on the main Einstein web page, although the links are giving 404 errors, probably due to the server code upgrade) and Dr. Allen is still an adjunct faculty member...
ID: 7041 · Report as offensive     Reply Quote
Nothing But Idle Time

Send message
Joined: 27 Aug 07
Posts: 84
Credit: 148,380
RAC: 0
Message 7043 - Posted: 16 Aug 2008, 16:16:07 UTC - in response to Message 7041.  

Yes, he [Bruce] is still over there [Germany], but you and I are talking over this \"internet\" thing... ;-)
True, but it doesn\'t seem practical to solve Cosmo\'s problems via the internet. A more hands-on or on-location avenue would seem necessary, but then maybe I\'m simply deficient in the use these tools where others are not.
ID: 7043 · Report as offensive     Reply Quote
Previous · 1 · 2

Forums : Technical Support : Some information on cosmology@home