Recovery Plan (example)

> I am very interesting on the rollforward database to a remote site to
> try to recover database as much as you can. 
> 
> Can you tell me more detail about how did you do it ? Do you do 
> checkpoint once a day and turn on journal, then copy the journal file
> to remote node during the day. So you can rollforward database on
> the other machine if anything happen ?
> 
> Because I am thinking about doing it recently, but did not figure out
> a really good way to do it ? Do you mind send me back a mail and give
> me some info to help ? I have talk to CA about the duplicate server. 
> We are run openingres 1.2/01 on open vms. But CA said the replicate
> server use event log will generate a lot of locking issue. It did not
> sounds a good idea. 
> 
> Thanks . . . .
> 
> tyyang


    To recover a database to a remote host with a checkpoint taken on you
    current host is actually quite straight forward. However, it is totally
    unsupported by Ingres. 

    The main problem to overcome is that to use a checkpoint/dump and journal
    to recover your database you need a configuration file associated with
    that database that correctly states the checkpoint requirements. Remember
    that a configuration file contains many hardwired paths eg checkpoint area
    , journal, dump and database area(s) for the database. Furthermore, it
    also contains pointers to your current hosts log file that will have no
    meaning to your remote host. This last point is worth remembering as the
    attempt to use this configuration file may screw the log on your remote
    host! But we'll go into that later.
    
    Note that you CANNOT edit a configuration file. Hence it is necessary to
    fool it using symbolic links! I can do this on UNIX but as you are using
    VMS I'm not sure what the equivalent is. Something like a logical name
    comes to mind but I'm no expert in VMS.

    Using a symbolic link is not too difficult. However experience has shown
    that you are better off puting your symbolic links at the highest level in
    the path not the lowest. This can become critical when attempting to
    recover several databases form several hosts onto a single host. 

    A real example to clarify the matter.
    I wish to recover a database from my host avalon to my host bucket. The
    critical paths on avalon are all prefixed /avalon/prod. The equivalent
    paths on bucket are all prefixed /bucket/drp. As a complication there are
    /bucket/prod directories used by a seperate installation that I dont wish
    to get confused with.
    Hence on bucket I make the directory /avalon (with suitable permissions).
    In this directory I place a symbolic link prod -> /bucket/drp.
    Now any reference to say /avalon/prod/ingres/ingres/data/default/merlin
    will actually traverse the path
    /bucket/drp/ingres/ingres/data/default/merlin.

    Having set up the symbolic links once they can be kept in place. After all
    why reinvent the wheel.

    Now, the fun bit! On your remote host:

    1. Create a database named after the database you wish to recover, It must
    	be owned by the same username.
    2. Spread the database to all extended locations that the lost database
    	used.
    3. Checkpoint this database turning journaling on. By turning journaling
    	on all the journal and dump locations are created for you.

    NOTE: To do this you must make sure that at any checkpoint on your lost
    hosts that you prepare a file (we call it a Basics file) of sufficient
    details to be able to do this. Having created the Basics File copy it to
    your remote host. When you need it it may as well be there waiting for you!

    I suggest the following should be included in your Basics file:

    1. from iidbdb taking a snapshot of the tables:
	    select * from iiuser;
	    select * from iilocations;
	    select * from iidbaccess;
	    select * from iiusergroup;
	    select * from iiextend;
	    select * from iidatabase;
	    select * from iirole;
    2. From each of your important databases take.
	    select * from iifile_info
	    the copy.in prepared by copydb -c dbname
    3.  As you're running OI I also suggest putting your config.dat into the 
    	Basics file. 
    4. There may be many other site dependant files that you will need on
    	your remote host to get your recovered database back to the point that
    	your users can really use it.

    You now have an empty database of the same name/owner as the lost
    database. We now just have to get the data back into it. To Do this:

    1. Take a copy of the configuration files of the newly created database.
    	These can be very handy. Remember there are two one in the database area
    	and one in the dump area of the database. 
    2.  Take the configuration file of the lost database and copy it into the
    	database area and dump area of the newly created database.

    If you now perform an infodb -udbowner dbname yuo should now see that that
    the database is described as if it were from the lost host! Magic.

    We now simply have to copy all the journals, dumps and saved configuration
    files from the lost host into the appropriate areas of the new database.

    If you are using a disk checkpoint than copy this in too.

    NOTE: You are correct in surmising that the journals get copied to my
    remote host. Ditto the configuration file, and the dump area comprising
    both the dump files themseleves and the saved configuration files.

    Now CHECK what we have got. You should use the hardcoded paths in the
    infodb output (ie the lost host paths that should be symlinked to new host
    equivalents) to check that all data locations exist, and that only the
    default location has any data files in it (these are your system
    catalogs). Furthermore check that all required journals for the recovery
    from the checkpooint exist and are uncompressed. Ditto the dumps and saved
    configuration files.
    
    Now you can do your rollforwarddb.

    Remember I said that the configuration file contains pointers to the log
    files that may cause grief. This is where your going to find out if they
    will or not. I have found that 9 times out of 10 there is no problem.
    However, every now and then the log file corrupts hopelessly. At which
    point you will have to do an rcpconfig -force_init. HENCE, DURING THIS
    RECOVERY MAKE DAMN SURE NO ONE ELSE IS USING THE INSTALLATION. SET SERVER
    CLOSED is a damn useful option!

    Hence I suggest you practice on an installation you are quite prepared to
    lose for a an extended period why you reformat the log file.

    Normally, all is okay and your database will now simply rollforward and
    recover quite happily.

    This simply recovers your database. It ain't necessarily usable! What about
    your user access rights, application code, any specific files your users
    require to access the system. Any daemon process that need to be running
    etc. A lot of this is site dependant and I'll let you fill in the gaps.

    BUT, the critical thing is to test it! Get used to the procedure so that
    it becomes second nature to you. Try a real test, ie recover a production
    database to some host and then ask you users to test that databaase and
    see if they can access it and use it as if it were on the real host.

    The GOLDEN RULE in Disaster Recovery Planning is if you need to access
    information from your lost host you FAIL!


    One thing I forgot to mention wass that the the process that transfers
    your journals, dumps, current and saved configuration files from your
    production host(s) to your backup host should be run from cron (or
    whatever the VMS equivalent is) at a frequency that gives acceptable data
    loss. I run my transfer script at 15 minute intervals. A further
    complication is that your Archiver process must be firing up more
    regularly than this interval. I set my Archiver to wake up at 1% of log
    file. This normally gives me pretty good journalling.

    The frequency you choose is a site dependant issue. My people were happy
    at a 15 minute data loss. I could go more frequent as the transfer process
    normally takes very little time to run.
    
    Furthermore the script should execute on your backup host and
    fireoff remote jobs on your production host to accomplish its job. I also
    have a script that monitors the progress of this transfer script and it
    will send me warnings if the journals or current configuration file for
    any database is over two hours old.

    I suggest that you transfer the data to equivalent areas on your backup
    host. This is for two reasons. One is that these areas will probably be
    set up with sufficient disk space to handle the job. the second is that it
    makes sense to store like with like as you'll probably remember where to
    find it and its easier to explain to others.
    ie Journals --> II_JOURNAL/recovery/hostname/dbname
       Dumps/Saved Configuration files --> II_DUMP/recovery/hostname/dbname
       Current Config File --> II_DATABASE/recovery/hostname/dbname

    eg On my Disaster Recovery Host (DRH) named bucket I store data from my
    three production hosts (avalon,electra and cascade). Hence the Journal 
    area on bucket has directories:
        II_JOURNAL/recovery/avalon
        II_JOURNAL/recovery/electra
        II_JOURNAL/recovery/cascade
    Under each of these is a seperate directory for each of my production
    databases.


    PS. One other thing. It is esential that your Disaster Recovery Plan (DRP)
    be documented. That the document be stored offsite. It should be written
    in such a manner that any Ingres DBA could walk into the smouldering
    remains of your office, sit down and commence the recovery of your
    databases. I dont know if I've achieved that last bit yet, but I am
    working on it!

    Martin Bowes.


Hi Tyyang,
> 
> 1. Where did you setup your archiver wake up every 15 mins. Is it in
>    the configuration files ? 
> 

    You set the archiver wakep by percentage of log file not by time interval.
    This is a configurable parameter in you logging system. It defaults to 5%.
    To see its current value run 'showrcp' if you use 6.4 or go into CBF
    Logging Parameters and check the appropriate value. If you wish to change
    it, then you really have to shutdown the installation and then reconfigure
    your logging system.

    You can get a feel for what timeframe X% of the log file translates to by
    monitoring the II_ACP.LOG file.

> 2. How did you decide when to copy which journal file to remote machine
>    every 15 mins. Will it happen that the journal file is written by the
>    archiver at that point ? Or you just copy all the journal files every
>    15 mins, no matter the journals have been copy before ? 

    My process checks its current set of journals/configuration files first
    and then determines what the last successful transfer time was and then
    accesses the production hosts looking for anything that has changed since
    then.

    Since I wrote this script a few years ago some kind soul beat me around the
    ears and told me of this wonderful utility called 'rdist' which would do
    the job a lot more easil!!! I havent had a chance to recode it yet. I
    dont know of a VMS equivalent to rdist.

    Journal files are not flocked by the archiver and may be examined at will.
    If the archiver happens to be busy writing to one end of it this is not a
    problem as whatever you miss this time will be picked up next time. If the
    machine goes down in the mean time at least the journal you've got is
    still usable and heaps better than no journal at all.

> 
> 3. Did you do on-line checkpoint ($ckpdb dbname ) or off-line checkpoint
>    ($ckpdb -l dbname ) every night ? To turn on journal on the database
>    is using $ckpdb +j dbname. It will do the off-line checkpoint and
>    start the journal file. After that do I need do $ckpdb +j dbname
>    every night or I just need do $ckpdb dbname.
> 

    All my checkpoints are online. The only time I do an offline checkpoint is
    when a dire emergency forces me to do so!

    Having turned journaling on with ckpdb +j there is no need to do it again.
    All subsequent checkpoints may be with just ckpdb dbname. After all,
    turning journalling off/on requires an offline checkpoint, and the loss of
    access to the database for your users may piss them off!

    One further thing you could consider for your checkpoints is checkpointing
    directly to your backup host from your production host. This can be very
    usefull and save considerable time in a backup both by having instant
    access to a checkpoint file(s) on your backup host and by removing the
    chance of not correctly positioning a tape during the recovery phase.

    To do this I export a disk from my backup host and mount it on my 
    production hosts. I then replace the checkpoint directory of a production
    database with a symbolic link to an appropriatly named directory on this
    disk. 

    I still use tape checkpoints of all my production databases, furthermore
    these tape checkpoints occur at a much higher frequency than my disk
    checkpoints. But the disk checkpoints are always my first resort when
    attempting to recover a database. The extra time required to process more
    journals from an older checkpoint is not enough to worry me.

    Martin Bowes



Hi Con,

    My Ex Manager has actually met people whose Recovery Strategy is based on
    the statement 'Oh, we do a checkpoint about once every fortnight, isnt
    that enough?'.

    Aye Carrumba, Someones Dogs are barking tonight!

    I absolutely agree with both points you have raised, namely:
    1. Having completed your checkpoint to tape, you MUST backup the II_DUMP,
       II_JOURNAL and II_SYSTEM areas to tape. Preferably the same tape! Other
       areas could also be backed up eg the home areas of all users,
       particularly the ingres user. At my site we do a full system dump of
       everything except II_DATABASE after the checkpoint.
    2. The first part of testing a new release should be an attempt to
       recover a database. Any decent database for such a test must be 
       multilocation, have many tables of all structures , some of which 
       should be multilocation and some should have secondary indicies. I like 
       to test that I can recover the correct number of tables, rows on the 
       table, users, grants, usergroups roles indicies etc. A good test should 
       also include the recovery of dumping and journaling activity.

    I think we should all remember the famous multilocation BTREE recovery
    failure of a few years ago. Ouch!

    Martin Bowes
Ingres Q & A
Back to William's Home Page

© William Yuan 2000

Email William