Sunday, June 19, 2011

Problem Determination Steps

The following questions should be considered and if possible answered when trying to diagnose a problem:

1. What is the problem?
2. Where did it occur?
3. When did it begin happening?
4. What action was being performed?
5. Were any messages issued?

Check the server activity log for error messages.

If error messages are in the server activity log, check 30 minutes before and after the time that the error message was issued. Often the problem encountered is actually a symptom of another problem and seeing the other error messages that were issued may help to isolate this.

Did the Explanation or User Response section of the TSM message offer any suggestions on how to resolve the problem?

6. How frequently does this error occur?
7. Check any system error logs:

On Windows(R)
Check the application log.
On AIX(R) and other UNIX(R) platforms
Check the error report.

8. Check with others that may have made changes in the environment that could affect TSM. Some others in a typical IT environment include:

SAN Administrator
Network Administrator
Database Administrator
Client or machine owners

9. Check the TSM error logs. The following TSM error logs:

dsmserv.err - Server error file. This is located on the same machine as the server. The dsmserv.err file is typically in the server install directory. Note that the storage agent may also create a dsmserv.err file to report errors.
dsmerror.log - Client error log. This is located on the same machine as the client.
dsmsched.log - Client log for scheduled client operations. This is located on the same machine as the client.
db2diag.log, db2alert.log, userexit.log - DB2(R) log files. These are useful when troubleshooting a problem when backing up a DB2 database using Tivoli Data Protection for DB2. These are located on the same machine where DB2 is installed.

tdpess.log - Default error log file used by the Data Protection for Enterprise Storage Server(R) client.
tdpexc.log - Default error log file used by the Data Protection for Exchange client.
dsierror.log - Default error log for the client API.
tdpoerror.log - Default error log for the Data Protection for Oracle client.
tdpsql.log - Default error log for the Data Protection for SQL client.

10. Verify that devices are still accessible to the system and to TSM.
11. Search the online Knowledge Base for matching error messages or problem descriptions.
12. Test other operations to better determine the scope and impact of the problem. This may also help to determine if it is a specific sequence of events that causes the problem.

Understanding TSM Backups: Selective vs Incremental



Selective: The selective command backs up files that you specify. If these files become damaged or lost, you can replace them with backup versions from the server. When you run a selective backup, TSM backs up all the files unless they are excluded from backup in your include-exclude list, or they do not meet management class requirements for serialization.

During a selective backup, TSM sends copies of the files to the server even if they have not changed since the last backup. This might result in more than one copy of the same file on the server. If this occurs, you might not have as many different down-level versions of the file on the server as you intended. Your version limit might consist of identical files. To avoid this, use the incremental command to back up only new and changed files. You can selectively back up single files or directories. You can also use wildcard characters to back up groups of related files. During a selective backup, a directory path may be backed up, even if the specific file that was targeted for backup is not found. For example: dsmc selective "/dir1/dir2/bogus.txt" still backs up dir1 and dir2 even if the file bogus.txt does not exist.

If the selective command is retried because of a communication failure or session loss, the transfer statistics will display the number of bytes TSM attempts to transfer during all command attempts. Therefore, the statistics for bytes transferred may not match the file statistics, such as those for file size.

Incremental: The incremental command backs up all new or changed files or directories in the default client domain or from file systems, directories, or files you specify that are not excluded from backup services. To incrementally back up selected files or directories, specify the file specification in the command. The default is to back up files or directories in the default domain.

The following attributes in the management class assigned to the file or directory affect whether the data is actually backed up:
Frequency - The number of days that must elapse between successive backups for the file. This attribute is only used during a full incremental backup.
Mode - Permits you to back up only files that changed since the last backup (modified), or to back up the files whether they changed or not (absolute).
Serialization - Permits or denies backup of files or directories according to the following values:

static: In order to be backed up, data must not be modified during backup or archive.
shrstatic: If data in the file or directory changes during each of the four attempts to back up or archive it, it is not backed up or archived.
dynamic: The object is backed up or archived on the first attempt whether or not data changes during the process.
shrdynami: The object is backed up or archived on the last attempt, even if data changes during the process.

You can assign the default management class to a file, or you can assign a specific management class to a file using the include option in an include-exclude list.

You can perform either a full incremental backup or an incremental by date backup. The default is a full incremental backup. You can also the selective command to perform a selective backup that backs up only the files, directories or empty directories that you specify.

A full incremental backs up all files or directories that are new, or have changed since the last incremental backup. During a full incremental backup, the client queries the server to determine the exact condition of your storage. TSM uses this information to:

Back up new files or directories.
Back up files or directories whose contents have changed.
Mark inactive backup versions on the server for files or directories that are deleted from the workstation.
Rebind backup versions to management classes if the management class assignments change.

An incremental-by-date backup, backs up new and changed files with a modification date later than the date of the last incremental backup stored at the server, unless the files are excluded from backup by an exclude statement. If an incremental-by-date is performed on only part of a file system, the date of the last full incremental is not updated, and the next incremental-by-date will back up these files again. Therefore, changes to the access control lists (ACL) are not backed up during an incremental-by-date. Use the query filespace command to determine the date and time of the last incremental backup of the entire file system.

To perform an incremental-by-date backup, use the -incrbydate option with the incremental command.

Unlike a full incremental, an incremental-by-date does not maintain current server storage of all your workstation files because:

It does not expire backup versions of files that are deleted from the workstation.
It does not rebind backup versions to a new management class if the management class has changed.
It does not back up files with attributes that have changed, unless the modification dates and times have also changed.
It ignores the copy group frequency attribute of management classes.

For these reasons, it is recommended that if you have limited time during the week to perform backups, but extra time on the weekends, you can use a partial incremental backup on weekdays, and a full incremental backup on weekends to maintain current server storage of your workstation files. If the incremental command is retried because of a communication failure or session loss, the transfer statistics will display the number of bytes TSM attempted to transfer during all commands attempts. Therefore, the statistics for bytes transferred may not match the file statistics, such as those for file size.


The following are examples of tasks you might perform using the incremental command.

Task Run an incremental backup of the default client domain specified in your client options file.
Command: Incremental

Task Run an incremental backup for the /home, /usr, and /proj file systems.
Command: Incremental /home /usr /proj

Task Run an incremental backup for the /proj/test directory.
Command: Incremental /proj/test/

Task Run an incremental-by-date backup for the /home file system.
Command: Incremental -incrbydate /home

Task Run an incremental backup of all files in the /fs/dir1 directory that begin with the string abc.
Command: Incremental -subdir=yes "/fs/dir1/abc*"

Task Run an incremental backup of the abc file in the /fs/dir1 directory.
Command: Incremental -subdir=yes /fs/dir1/abc

Task Run an incremental backup of the directory object /fs/dir1, but not any of the files in the /fs/dir1 directory.
Command: Incremental -subdir=yes /fs/dir1

Task Run an incremental backup of the directory object /fs/dir1 and all of the files in the /fs/dir1 directory.
Command: Incremental -subdir=yes /fs/dir1/

Restore commands:

Task Restore a single file named budget.
Command: restore /home/devel/projecta/budget

Task Restore a single file named budget.finbudget which resides in the current directory.
Command: restore file budget

Task Restore all files with a file extension of .c from the /home/devel/projecta directory.
Command: restore "/home/devel/projecta/*.c"

Task Restore files in the /user/project directory. Use the pick and inactive options to select active and inactive backup versions.
Command: restore "/user/project/*" –pick –inactive

Task Restore all files from the /home/devel/projecta directory that end with the character .c to the /home/newdevel/projectn/projecta directory. If the projectn or the projectn/projecta directory does not exist, it is created.
Command: restore "/home/devel/projecta/*.c" /home/newdevel/projectn/

Task Restore all files in the /home/mydir directory to their state as of 1:00 PM on August 17, 1998.
Command: res -pitd=8/17/1998 -pitt=13:00:00 /home/mydir/

Task Restore all objects in the /home/myid/ directory. Since this restore is fully wildcarded, if the restore process is interrupted, a restartable restore session is created. Use the restart restore command to restart a restartable restore session. Use the cancel restore command to cancel a restartable restore session.
Command: res /home/myid/*

how much data in GB has been backed up in a certain timeframe?

Select:

sum(cast(bytes/1024/1024/1024 as decimal(6,2))) Backed_up_in_GB

from summary where start_time>=current_timestamp - 24 hours and activity="BACKUP"

Troubleshooting Unix MISSED backups

Four part strategy...

1) LOGON AND CHECK FOR A SCHEDULER DAEMON
Log onto TSM Backup/Archive (TSM B/A) client host via SSH. To find your target IP use the following..
q node xxxx f=d
Look for the TCP/IP Address: value
If the IP address does not show up here, try this command:
q actlog begind=-3 endd=today msgno=0406 search=xxxx
Look for the IP address that the host is talking to TSM server with
Once SSH’ed into host, sudo up to root:
Sudo su -

Find out what type of Unix OS you are dealing with:

uname -a

Find out if the TSM B/A scheduler daemon is running:

NON-Linux unix OS'es:

ps -ef | grep dsm

Linux unix OS'es:

ps -ef | grep tsm


You should see something similar to the following output:

NON-Linux unix OS’es:

root 2608 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule


Linux unix OS'es:

You may see multiple daemons returned from your ‘ps –ef | grep tsm’ command. This is ok, as there should be one ‘master’ daemon and 4-5 ‘child’ daemons


2) CHECK FOR A HUNG SCHEDULER DAEMON

View the dsm.sys config file to see where dsmerror.log and dsmsched.log files are being written (on AIX, replace /opt/ with /usr/):

more /opt/tivoli/tsm/client/ba/bin/dsm.sys


Find the SCHEDLOGName entry - this typically points to /var/adm/dsmsched.log

Find the ERRORLOGName entry - this typically points to /var/adm/dsmerror.log


Wherever the two log files point to, cd to that directory:

cd /var/adm


Find out when files were last updated:

ls -ltr | grep dsm


Find out the current time of this host:

date


If the dsmerror.log file has a timestamp pretty close (within a couple of hours) to the current host time, look at the last few entries to see what’s going on:

tail -500 dsmerror.log


Sometimes this shows the TSM B/A client continuously trying to establish a connection to TSM server, but unable to do so. If this is the case, the scheduler daemon is probably hung, and needs to be killed/restarted


If no errors appear to indicate that agent is hung, move onto next check


Check dsmsched.log for last few entries

tail -100 dsmsched.log


If the last few entries seem to indicate that a backup is still running, yet the date/time stamps are old (ie. not near the current time), the scheduler daemon is probably hung and needs to be killed/restarted.


3) KILLING/RESTARTING A HUNG SCHEDULER DAEMON

Get the daemon process ID of the TSM B/A scheduler daemon that is running:


NON-Linux unix OS'es:

ps -ef | grep dsm


Linux unix OS'es:

ps -ef | grep tsm


You should see something similar to the following output:


NON-Linux unix OS’es:

root 2608 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule

Linux unix OS'es:


You may see multiple daemons returned from your ‘ps –ef | grep tsm’ command. This is ok, as there should be one ‘master’ daemon and 4-5 ‘child’ daemons


Kill the TSM B/A scheduler daemon:

kill -9 2608


The number 2608 is the PID in this example command is based on the output from the above ps –ef commands.


In reality, your PID number will be different from the above example.

Be sure you are killing the correct PID!

Verify that the daemon automatically restarted itself:


NON-Linux unix OS’es:

root 3512 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule


You may see multiple daemons returned from your ‘ps –ef | grep tsm’ command. This is ok, as there should be one ‘master’ daemon and 4-5 ‘child’ daemons


If you do see output similar to the above example, verify that TSM B/A scheduler daemon successfully retrieved next job from TSM server:

tail -20 /var/adm/dsmsched.log

You should see output similar to:

08/17/06 13:41:12 Querying server for next scheduled event.

08/17/06 13:41:12 Node Name: server

08/17/06 13:41:12 Session established with server server: AIX-RS/6000

08/17/06 13:41:12 Server Version 5, Release 2, Level 2.0

08/17/06 13:41:12 Server date/time: 08/17/06 12:24:57 Last access: 08/17/06 12:16:03

08/17/06 13:41:12 --- SCHEDULEREC QUERY BEGIN

08/17/06 13:41:12 --- SCHEDULEREC QUERY END

08/17/06 13:41:12 Next operation scheduled:

08/17/06 13:41:12 ------------------------------------------------------------

08/17/06 13:41:12 Schedule Name: 0000MST

08/17/06 13:41:12 Action: Incremental

08/17/06 13:41:12 Objects:

08/17/06 13:41:12 Options:

08/17/06 13:41:12 Server Window Start: 00:00:00 on 08/18/06

08/17/06 13:41:12 ------------------------------------------------------------

08/17/06 13:41:12 Command will be executed in 11 hours and 36 minutes.



4) MANUALLY STARTING A SCHEDULER DAEMON

Ensure TSM B/A client can communicate with TSM server:

dsmc query sched

If command completes successfully, and output returns no errors, TSM B/A client can communicate with TSM server, proceed to starting TSM B/A scheduler daemon


Start TSM B/A client scheduler daemon (on AIX, replace /opt/ with /usr/):

/opt/tivoli/tsm/client/ba/bin/dsmc schedule >/dev/null 2>&1 &

Ensure TSM B/A client scheduler daemon is running:


ps -ef | grep -v grep | grep dsm


NON-Linux unix OS'es:


ps -ef | grep -v grep | grep tsm


You should see output similar to:

root 4561 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule

Tuesday, June 14, 2011

Other Cool stuff...

Total client data stored (TB)
tsm: SERVER1> SELECT CAST(FLOAT(SUM(logical_mb)) / 1024 / 1024 AS DEC(8,2)) FROM occupancy
Unnamed[1]
----------
73.04
Some TSM Server information
tsm: SERVER1> SELECT server_name, platform, -
VARCHAR(version)||'.'||VARCHAR(release)||'.'||VARCHAR(level)||'-'||VARCHAR(sublevel), -
server_hla, server_lla, server_url, logmode, crossdefine, licensecompliance FROM status
SERVER_NAME: TSM-SERVER1
PLATFORM: AIX-RS/6000
Unnamed[3]: 5.3.3-2
SERVER_HLA: 10.10.10.5
SERVER_LLA: 1500
SERVER_URL:
LOGMODE: NORMAL
CROSSDEFINE: ON
LICENSECOMPLIANCE: VALID
SQL Table Catalog
tsm: SERVER1>SELECT tabschema,tabname,remarks FROM tables
TABSCHEMA TABNAME REMARKS
--------- ------------------ ------------------
ADSM ACTLOG Server activity log
ADSM ADMINS Server administrators
ADSM ADMIN_SCHEDULES Administrative command schedules
ADSM ARCHIVES Client archive files
ADSM AR_COPYGROUPS Management class archive copy groups
ADSM ASSOCIATIONS Client schedule associations
ADSM AUDITOCC Server audit occupancy results
ADSM BACKUPS Client backup files
ADSM BACKUPSETS Backup Set
ADSM BU_COPYGROUPS Management class backup copy
...


DRM Info....

DRM
Information about drm volumes
tsm: SERVER1> SELECT drmedia.volume_name, volumes.stgpool_name, drmedia.state, drmedia.voltype, volumes.status, -
volumes.pct_utilized FROM drmedia, volumes WHERE drmedia.volume_name=volumes.volume_name ORDER BY drmedia.state
VOLUME_NAME STGPOOL_NAME STATE VOLTYPE STATUS PCT_UTILIZED
------------------ ------------------ ------------------ ------------ ------------------ ------------
tape06 OFFSITE COURIERRETRIEVE CopyStgPool EMPTY 0.0
tape18 OFFSITE VAULT CopyStgPool FILLING 50.6
tape38 OFFSITE VAULT CopyStgPool FILLING 80.9
tape79 OFFSITE VAULT CopyStgPool FILLING 91.0
...
Information about drm volumes in the library
tsm: SERVER1> SELECT drmedia.volume_name, drmedia.state, drmedia.voltype FROM drmedia, libvolumes WHERE -
drmedia.volume_name=libvolumes.volume_name ORDER BY voltype
VOLUME_NAME STATE VOLTYPE
------------------ ------------------ ------------
tape48 MOUNTABLE CopyStgPool
tape59 MOUNTABLE CopyStgPool
...
Information about drm volumes in the library (another way)
tsm: SERVER1> SELECT volume_name, state, voltype FROM drmedia WHERE -
volume_name IN ( SELECT volume_name FROM libvolumes ) ORDER BY voltype
VOLUME_NAME STATE VOLTYPE
------------------ ------------------ ------------
tape48 MOUNTABLE CopyStgPool
tape59 MOUNTABLE CopyStgPool
...
Information about drm volumes in the library with state different from "MOUNTABLE"
tsm: SERVER1> SELECT drmedia.volume_name, drmedia.state, drmedia.voltype FROM drmedia, libvolumes WHERE -
drmedia.volume_name=libvolumes.volume_name AND drmedia.state<>'MOUNTABLE'
VOLUME_NAME STATE VOLTYPE
------------------ ------------------ ------------
tape36 COURIER CopyStgPool
tape82 COURIER CopyStgPool
...
Drm volumes with tsm db backups
tsm: SERVER1> SELECT volume_name, state, upd_date, location, voltype FROM drmedia -
WHERE voltype='DBBackup' OR voltype='DBSnapshot'
VOLUME_NAME STATE UPD_DATE LOCATION VOLTYPE
------------------ ------------------ ------------------ ------------------ ------------
tape10 VAULT 2008-03-05 Iron Mountain DBBackup
11:00:00.000000
tape15 VAULT 2008-03-04 Iron Mountain DBBackup
11:00:00.000000
tape45 VAULT 2008-03-03 Iron Mountain DBBackup
...