State-of-Health
From tigwiki
This page describes how to use the State-of-Health (SoH) machines (tigsoh01, etc.) It includes descriptions of how to monitor the system, how to configure it, and how to respond to notification events, warnings, critical events, and super-critical events. This page is organized by order of the most common and least critical operations, to the least common and most critical.
This page is very much under development.
Contents |
[edit] Routine Operations
[edit] Monitoring
Monitoring can be done by anyone with web access.
To view the state of health, open a browser to http://tigsoh01.triumf.ca:8081. Use the History button to view a history of state-of-health parameters.
[edit] Configuration
Configuration is to be done by TIGRESS local staff.
[edit] Notifications
Notifications require no immediate response. They may indicate an ongoing or growing problem that can be rectified on the time scale of a week.
[edit] Warnings
Warnings require immediate response. The first response to warnings will be TIGRESS local staff. During normal operating hours, all staff are expected to respond. Off-hours, the Cryogenics Officer is responsible for first response. It is expected that warning-level events will be handled appropriately by TIGRESS staff before the situation becomes critical.
[edit] Critical Problems
Critical problems involve automatic responses. These normally occur when a situation exceeds a higher-level threshold than warnings. Fail-safe procedures will automatically engage to protect equipment. If these procedures fail, the situation becomes super-critical.
[edit] Super-Critical Problems
Super-Critical problems indicate problems that have escalated beyond notification, warning, and critical (automatic) response. These errors are relayed to ISAC operations through the EPICS alarm handler. These errors require ISAC operator response. ISAC operators are authorized to take appropriate actions to correct the issue.
[edit] Shack Temperature Super-Critical
The temperature in the shack has exceeded a high threshold. At this point it is likely that the air conditioner has failed, the automatic shutdown system has failed, and TIGRESS staff have been unable to respond (possibly due to incapacitation).
- First, try to contact Greg Hackman.
- Verify that the temperature inside the shack is indeed uncomfortably hot, i.e. ~40 deg C. If so,
- Notify any experimenters on duty at ext. 6887 that you are about to turn off all power to the TIGRESS shack. Do NOT take any crap from them.
- Turn off power to the three red CAEN HV modules at the top of the racks (add picture)
- Turn off power to all VME and VXI crates (add picture)
- Remove the bungee cords from the doors to the shack and open them wide.
- Turn off the air conditioner. (add picture)
- The shack temperature should stabilize around 30 deg C.
- If the shack temperature is NOT uncomfortably not, then there might simply be a failure with the temperature sensors. This requires a repair. Contact Greg Hackman immediately.
[edit] State-of-Health Watchdog Failure
The EPICS watchdog has lost communication with the shack. This is probably a software error. While it does not put the equipment at immediate risk, it should be addressed as soon as possible.
- Contact Greg Hackman. If you can get hold of Greg, he will (probably) either fix it himself, or will talk you through how to do it. The instructions below are for if you can't get hold of Greg.
- Reset the computer labeled TIGSOH01. (add picture)
- Push the reset button.
- Wait 15 minutes.
- If the EPICS alarms have not cleared in 15 minutes, repeat the processes above. If you still cannot make it work
- If at any time the temperature becomes uncomfortably hot (above 30 deg C), immediately treat the problem as #Shack Temperature Super-Critical and follow those steps.
[edit] Super-Critical Contact
If there is a super-critical event, contact Greg Hackman.
- Try to contact Greg by phone first. Please attempt all these numbers in order. Do NOT wait for a callback; if you cannot reach Greg at all, simply proceed per the instructions for the supercritical event.
- ext. 7441
- Site-wide page (operator or MCR)
- 604-324-9668
- 778-788-1869
- If you cannot reach Greg at any of these numbers, DO NOT wait for a callback. Proceed immediately to the next step in the super-critical error response.

