KEMBAR78
Monitoring Platform Common Issues | PDF | Microsoft Sql Server | Server (Computing)
0% found this document useful (0 votes)
183 views12 pages

Monitoring Platform Common Issues

1. The document lists common errors encountered on the APM platform and steps to resolve them. 2. Issues include site scope URL availability, APMDB maintenance job failures, high CPU utilization, disk space alerts, memory alerts, opctrapi service issues, event sync buffer issues, OM agent alerts, backup job failures, and slow event delivery from NNMi to OMi. 3. Resolving the issues involves restarting services, checking server resources, clearing caches, rerunning jobs, and contacting appropriate support teams.

Uploaded by

raj0809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
183 views12 pages

Monitoring Platform Common Issues

1. The document lists common errors encountered on the APM platform and steps to resolve them. 2. Issues include site scope URL availability, APMDB maintenance job failures, high CPU utilization, disk space alerts, memory alerts, opctrapi service issues, event sync buffer issues, OM agent alerts, backup job failures, and slow event delivery from NNMi to OMi. 3. Resolving the issues involves restarting services, checking server resources, clearing caches, rerunning jobs, and contacting appropriate support teams.

Uploaded by

raj0809
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as XLSX, PDF, TXT or read online on Scribd
You are on page 1/ 12

S.

NoCommon Errors

1 Site scope URL Availability

2 Platform Availability

APMDB maintenance job issues


3
([APM Platform] Database Job Failed )

4 CPU Utilization

5 Disk Space Alerts for C: Drive

6 Physical Memory or Virtual Memory


7 opctrapi

8 Sync_Buffer_issue

Event delay (sitescope/BSMC/NNMi)


9
to OMi console
10 OM Agent alerts

Execute the Downtime Service launcher


11 (weekly once) to avoid the suppression
issues

12 ROD/Boomi issues

13 Atrium to UCMDB integration job

14 UCMDB to RTSM integration job

15 Backup jobs failure


NNMi Events are not coming from 15
16
mins

17 NNM Health Status : error


Steps to be taken

1.Restart the Sitescope service


2.Check the URL availability and drop the mail.
1.We have 3 checks : for WDE, LoadBalancerVerify and Health Check
2.Receive WDE error, the WDE status for the GW should be checked
3.The following are the links to the specific checks for the GWs
1)WDE : http://<GW>8181/ext/mod_mdrv_wrap.dll?type=test
2)HealthCheck : https://<GW>/healthcheck.html
3)LoadBalancerVerify : https://<GW>/topaz/topaz_api/loadBalancerVerify_centers.jsp

[APM Platform] Database Log Space Alert [P] :


1.Please note these emails being sent to the mailbox require an action on our part.
2.If you look at the attached log sent with the email, at the bottom, you see the reason why the job failed -
3."Msg 9002, Sev 17, State 2, Line 25 : The transaction log for database 'BSM_PROFILE_EU' is full.
To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases [SQLSTATE 42000]"
4.If you see such a message, please connect to the SQL Server and open a new query –
->Execute the query in the context of the database for which the failure was reported
sp_helpfile
5.This tells you that the max size the LOG file is allowed to grow to is 160GB
6.Execute the query –
DBCC SQLPERF(LOGSPACE)
7.Once the space is clear please follow below steps to run the job again
->right click on the SQL server agent ->select Job Activity Monitor and right click select ->View job activity
->Go to the respective profile job which is failed and rerun the job,select right click and start-> job at step

1.Login to the respective server RDP.


2.Open task manager check the performance bar.
3.If the Utilization is high continuously, Restart the Sitescope service.

1.Login to the respective RDP.


2.Go to <C:\ProgramData\HP\HP BTO Software\datafiles>
3.Check for the coda files.
4.Execute the following queries in command prompt.
ovc -stop coda
ovc -start coda
5. Check if there are any other highly utilized folder which is not system or application related & clean it if it is not important fo
6.Check the disk space and reply to the mail.
If the alert is for Sitescope Server
1.Stop the sitescope service.
2.Observer the memory and CPU for 5 minutes.
3.If the usage is reduced start the service.
4.Observer the server for another 5 minutes.
5.If the usage is not reduced raise an INC to wintel team.
6.If it is not Sitescope server reach out to wintel team.
1.Login to the respective Server RDP.
2.Type ovc in the command prompt.
3.If the trapi service is aborted type ovc -start opctrapi in the command prompt.
4.Type ovc in the command prompt.
5.Go to <C:\ProgramData\HP\HP BTO Software\log> and check the system.txt file for error message.
6.4.Go to <C:\ProgramData\HP\HP BTO Software\log\OpC> and check the logs are upto date.
7.Take the screen shot and reply to the mail.

1.Login to PHCHBS-SQ350001 and execute the following queries.


1. SELECT COUNT(1) FROM BSM_EVENT.dbo.EVENT_SYNC_BUFFER (NOLOCK) - To Verify the Event Count.
2. SELECT IDENTIFIER, COUNT(1) FROM BSM_EVENT.dbo.EVENT_SYNC_BUFFER (NOLOCK)GROUP BY IDENTIFIER - To verify
3. SELECT * FROM BSM_EVENT.dbo.EVENT_SYNC_LOAD_BALANCE_LOCK (NOLOCK) WHERE TARGET_IDENTIFIER LIKE 'extran
(GW Details)

2. If you observe events struck/Increasing the event count in any one of the GW based on the above queries please proceed th
a. Got to GW server with your ADM account .
b. Go to C:\inetpub\wwwroot to another path
c. Stop the GW services
d. Clear the cache data from the below path.
i. D:\HPBSM\EJBContainer\server\mercury\tmp\sessions
ii. D:\HPBSM\EJBContainer\server\mercury\tmp\deploy
iii. D:\HPBSM\EJBContainer\server\mercury\work\jboss.web
e. Start the GW services
f. Replace the Health check file in C :\inetpub\wwwroot

1.Login to the Primary DPS server


2. Click on All Programs- Aurea-Sonic Management Console
3.Change the connection URL path value “localhost” to Primary DPS” and click on ok
4. Go to Managed Objects->Containersphuseh-SXXXX( issueGateway Server where the events are blocked)-> phuseh-SXXXX
5. Opr_gateway_queue1 messages should be in 0 state.
6. If we observe any count please follow below steps
a. Login in to the gateway server phuseh-SXXXX console URL using admin credentials
http://phuseh-sXXXX.Company.net:11021/invoke?operation=showServiceInfoAsHTML&objectname=Foundations%3Atype%3D
b. Click on restart button hpbsm_opr-scripting-host
7. Please follow the 4,5,6 steps for all the gateway servers as well
8. Then please check the queue size and then event console if events are coming or not in OMI.

If identified that the GWs are queuing up events, please run the following command multiple times from both GW and DPS ser
required for getting the required analysis completed from HPE.
%topaz_home%\opr\support\opr-support-utils.sh -gtd OPR
Failed to report data to HP OM Agent:
1. Login to the server where the issue is reported.
2. Execute the following commands:
ovc –kill
ovpacmd stop ( specific to windows servers)
remove the contents of the location : %ovdatadir%/tmp/OpC/
remove the coda* files from the location : %ovdatadir%/datafiles
ovc –start
ovpacmd start
ovc –status ( all process should be running)
perfstat -p ( all processes should be running)

http://<DPS server where HAC services are running>:8080/jmx-console/HtmlAdaptor?action=inspectMBean&name=Topaz%3A


Step 1: Click Void Stop () and wait for 1 or 2 mins
Step 2 : Click Void Start ()

For incidents creation or updates issue contact to Remedy and Boomi teams
Resolver Group:GL_RoD-Incidents,GroupMail ID:<remedy.support@Company.com>;<is.support@Company.com>
Verification of Atrium - Full Sync Job
Atrium Integration Job:
1. Logon to https://ucmdb.Company.net/ucmdb using admin credentials
2. Go to Data Flow Management > Integration Studio and select the integration point -- Atrium
3. Verify all the Integration jobs are completed successfully
4. If the Integration job is failed then performs the below steps:
a. Click on the failed integration job – “Import Delta” only and verify the job errors
b. Try to re-run that Import Delta integration job by clicking the Delta Synchronization option
c. If the job still fails even after re-running the Import Delta integration job, then go to the probe log path - \\phchbs-
sp320157.Company.net\d$\HP\UCMDB\DataFlowProbe\runtime\log
d. Verify the log file – RemoteProcesses.log and search for the error message –
Remote process DS_Atrium_Import Delta finished with success status false
e. If you are seeing this message then seems to be some issue on Atrium side, please raise a Service request for this issue.
For reference, use the Incident # INC000003772435
f. Please provide username as “HPUCMDB ” if Atrium/ROD team requests

Please verify the below steps if you observe any integration job fails
1. Please login with http://phuseh-s3241:8080/status/ URL using admin credentials.
2. Please verify Default Client status.
3. Status should be “UP” for any one of the server and for another server status should be “Up: Reader”.
4. Please verify the status for below 2 URLS.
http://phuseh-s3324.Company.net:8080/ping/?restrictToWriter=true
http://phuseh-s3241.Company.net:8080/ping/?restrictToWriter=true
5. One URL result should show the status as “UP” and another URL result should shows status as “DOWN”.

In a case we see three consecutive backup failures on APM Servers INC should be created to B&R team.
Resolver Group:GL_Backup-Recovery_HCL
1)Log in NNMi console and check the latest events
2)If events is coming to NNMi and not in OMi then follow the below steps
To check NNMi service using NNMi command - follow the below steps
Login to NNMi target server as administrator or ADM account using RDP.
Open command prompt and go to path
D:\>ovstatus –c
Sample Output
Name PID State Last Message(s)
OVsPMD 1324 RUNNING -
nnmaction 6988 RUNNING Initialization complete.
nnmtrapreceivermd 1408 RUNNING Initialization complete. Trap Receiver is running.
nmsdbmgr 1396 RUNNING Database available.
ovjboss 5064 RUNNING Initialization complete.
D:\>
1.1 Steps to restart NNMi services:
To restart NNMi services using NNMi command, Follow the below steps…
1) Login to NNMi active server as administrator using RDP.
2) Open command prompt and enter command
a) ovstop [To stop nnmi services]
b) ovstart [To start nnmi services]
Note: Do not use –c option for ovstart and ovstop as the application failover is configured for all RNM and GNM servers.
Current status of the services can be checked by ovstatus –c command as shown below

To check NNMi service using NNMi command - follow the below steps
Login to NNMi target server as administrator or ADM account using RDP.
Open command prompt and go to path
D:\>ovstatus –c
Sample Output
Name PID State Last Message(s)
OVsPMD 1324 RUNNING -
nnmaction 6988 RUNNING Initialization complete.
nnmtrapreceivermd 1408 RUNNING Initialization complete. Trap Receiver is running.
nmsdbmgr 1396 RUNNING Database available.
ovjboss 5064 RUNNING Initialization complete.
D:\>
Point of Contact

APM Support

APM Support

APM Support

APM Support

APM Support

Wintel
APM Support

APM Support

APM Support
APM Support

APM Support

ROD/Boomi Team

APM Support

APM Support

B&R Team
APM Support

APM Support

You might also like