I have 2 questions about Solaris SMF. (I am an SMF newbie.)
I set up the Oracle RDBMS service in SMF as per https://docs.oracle.com/cd/E37838_01/html/E61677/odbstartstop.html
The database part works entirely as expected, so I added a listener as another service instance seeing as the method script has an option of 'listener' as an argument instead of 'db' and will run a lsnrctl start ${LISTENER}
instead of using sqlplus
to access and then start or stop a database instance.
The svcadm enable and svcadm disable of the service start and stop the listener as expected. The issue is that the framework senses if lsnrctl
is running but does nothing to restart it, if it has stopped. See below:
svc:/site/oracle/db/oracle12lsnr:LISTENER4 (?)
State: maintenance since May 21, 2020 03:25:39 PM BST
Reason: Method failed.
See: http://support.oracle.com/msg/SMF-8000-8Q
See: /var/svc/log/site-oracle-db-oracle12lsnr:LISTENER4.log
Impact: This service is not running.
The - Reason: Method failed. - is not congruent with the fact that invoking the method via svcadm enable (or disable) shows that the method works just fine.
Further investigation - I killed the lsnrctl process from root and got this from svcs -Lv
[ May 22 14:13:30 Executing stop method ("/lib/svc/method/svc-oracle12-database lsnr stop LISTENER4"). ]
LSNRCTL for Solaris: Version 12.1.0.2.0 - Production on 22-MAY-2020 14:13:30
Copyright (c) 1991, 2016, Oracle. All rights reserved.
Connecting to (DESCRIPTION=(ADDRESS=(PROTOCOL=TCP)(HOST=orahost.some.domain)(PORT=1521)))
TNS-12541: TNS:no listener
TNS-12560: TNS:protocol adapter error
TNS-00511: No listener
Solaris Error: 146: Connection refused
[ May 22 14:13:30 Method "stop" exited with status 95. ]
So the first question has changed and is now: Why would it run the stop method? The db version of this service runs the start method when the database service goes down.
Answer to Q1: the service framework runs the stop method followed by the start method. Once this was established a fresh look at the method script revealed a flaw. It error exited in the stop method if it couldn't contact the tnslsnr process. (Logic fail. If the tnslsnr process was killed you can't test a connection to it!)
To be honest I am struggling with the sheer volume of information to get through. I am currently reading through the pdf version of the URL above. I had a quick look here at Moellenkamp's blog http://blog.moellenkamp.org/archives/18-Auditing-a-single-SMF-service-revisted.html but I've not implemented that auditing service yet - assuming it would help anyway. If anyone has any thoughts as to why this isn't working I'd be really grateful.
The second question is this:
In the example the manifest is stored in /lib/svc/manifest/site/oracle/db
and first time around I changed this to /lib/svc/manifest/site/oracle12db
since 2 subdirectories (after .../site) seemed a little over the top and this resulted in the service just failing to work in any way (always in maintenance). I had adjusted the manifest xml file to match the changed directory structure. I was baffled and after fiddling around I simply changed the xml files and directory structure to match the example and it all worked. Why would that be? Is there some formula to the layers in the service_name or service_bundle?
I haven't yet read anything that says the directory structure has to be extended as per the example. I had not typo'd the xml file as far as I can tell - especially as revoking the changes to match the original example was simply to alter the service_name and service_bundle lines to match the extended directory structure.
To diagnose the reason for a service failure, always start with the service log, path for which is in the svcs output. Or just use "svcs -Lv " to display it directly.