Disaster Recovery Plan (example)

> From: Roger Hill 
> Subject: Re: Replicating without replicator
> 
> 
> >On Wed, 3 Mar 1999, Mark Crowder wrote:
> >
> >>The powers that be in my organisation have decreed that they want to
> >>replicate an OI1.2 database between DEC Unix machines. Some of the
> >>table names are too long, and the chance of getting these changed is
> >>virtually nil. I have a recollection of other people doing clever
> >>things to achieve similar effects on system, e.g. copying journal
> >>files. Does anyone have any suggestions?
> >>
> >Mark, what I am doing at the moment is something like this:
> >
> >Run a checkpoint.
> >Back up the dump, journal, and checkpoint areas (I actually just
> >create a tar file on disk)
> >
> >FTP the tar file to machine #2, which has an IDENTICAL structure in
> >terms of directories, locations etc for Ingres.
> >Delete all files from the dump,journal and checkpoint areas on
> >machine 2, after dropping Ingres.
> >Untar, bring up Ingres
> >Rollforward each DB.
> >
> >This gives me a hot backup, albeit up to about 24 hours old, that I
> >can use for recovery if the main machine crashes. I use it for running
> >test reports, and for development as well.
> >
> >Works for me.
> >
> >(Thanks to Mike Leo for the original suggestion).
> >Roger
> >--
> >===========================================================================
> =
> >Roger Hill, McEnearney Alstons (Barbados) Ltd    E-mail:rhill@mcalbds.com
> >Tel:246-426-5764/230-9596                           Fax:246-228-5796
> >===========================================================================

Hi,

    I use this method in my Disaster Recovery Plan and have done so for some
    considerable time. With great success.

    In short, I can recover any of our administrative databases from a
    checkpoint taken on any of their hosts to a single site. Databases here
    are of the order of 5G. A recovery can take typically 3hours to achieve.

    It is very easy to automate the transfer of data from your production
    hosts to their designated disaster recovery host. I run a transfer of
    journals, dumps and configuration files from each database of interest on 
    a 15 minute cycle. Although for special times of year I increase this cycle
    to 5 minutes.

    I also have the checkpoints of these databases going via a symlink to an
    NFS mounted dik exported from the Disaster Recovery Host.

    Hence, at all times on my Disaster Recovery Host I have access to the most
    recent Disk checkpoint (Tape checkpoints will do just as well though). I
    also have (to within 15 minutes) the current configuration file, journals,
    dumps and saved configuration files (c*.dmp). 

    Even this alone is not quite enough to guarantee a recovery. You also need
    to know the details of each database you wish to recover. For instance the
    database owner, what locations it uses, who is permitted to acces the
    database etc. It's to help in thjis area that I prepare what I refer to as
    a 'Basics File'. It simply contains basic details of the databases and
    installation. And is simply the result of the following queries run on the
    production hosts iidbdb:
        select * from iiuser;
        select * from iilocations;
        select * from iidbaccess;
        select * from iidbpriv;
        select * from iiusergroup;
        select * from iiextend;
        select * from iidatabase;
        select * from iirole;
    
    For completeness, I also include the fule iifile_info listing from each
    database in the installation as well as the copy.in (from unloaddb) for
    each database. 

    Also, throw in the config.dat, symbol.tbl, termcaps, maps. (It wont hurt)

    The basics file is prepared at every checkpoint and transferred
    automatically by the checkpoint script to the Disaster Recovery Host.

    You do have to be carefull with the values of II_DATABASE, II_CHECKPOINT,
    II_JOURNAL, II_DUMP, II_WORK on each of the seperate hosts. If you have a
    semi decent naming strategy you will find that symlinks can be installed
    so that any path on your admin host can be appropriatly replicated on your
    Disaster Recovery Host.

    Finally to recover the database. 
    1. Use the information on database ownership and default locations and
    access type for iidatabase in the basics file to create the database on
    the DRH.

    2. Checkpoint this database turning journaling on (this creates the
    default journal and dump directories)

    3. Save the new databases config files (both dump and database copies).
    These will be VERY useful if something really screws up.

    4. Copy the config file transferred from the production database to the
    datbase and dump areas of the new database.

    You should now be able to issue command infodb dbname on the Disaster
    Recovery Host and see an infodb listing that is taken purely from the
    production host. ie production host path names and the checkpoint, journal
    and dump history from the production database.

    5. Now copy in the appropriate journals for your recovery from the storage
    area on the Disaster recovery Host, ditto the dumps and saved config
    files. 

    6. If using a disk checkpoint then simply copy the checkpoint tar files to
    the appropriate directory. If using a tape simply mount and position the
    tape.

    7. Check everything is in the correct spot. ie use the infodb listing of
    paths to confirm you have your symlinks right! Remember we are using
    symlinks to fool the config file. If these arent right the recovery will
    spew bile at you!

    8. Roll from checkpoint exactly as you would normally.

    This method will work on 6.4 and OI. Its actually better in OI as it has a
    fixed alterdb command which can be usefull.
    
    The reason I say that is because of the shortcoming in this method.
    Remember that we are using the config file form another installation. It
    contains many pointers to the log file of its installation. Thay have no
    relevance on this installation and MAY cause the installation to believe
    it has a corrupted log file. That isnt fun.

    To guarantee that this wont occur:
    in OI: alterdb -disable_journaling dbname followed by ckpdb +j dbname
    in 6.4: You have to scrub the log file with an iistartup -init.

    Martin Bowes

    PS. Of course recovering a database is one thing. Now we have to recover
    the user details, allowing for some users having access to multiple hosts,
    and being in widly differnt UNIX groups. You have to be particularly
    careful here as you must construct a methodology that will allow you to
    restore the users back from the Disaster Recovery Host to the original
    hosts without scrambeling their files. They get real pissed off if you
    dont get this right!

    PPS. Dont forget printers as well. 

    PPPS. The obvious expansion from this method is clustering. I'm currently
    investigating this option and I must say it looks very promising.






> I am very interesting on the rollforward database to a remote site to
> try to recover database as much as you can. 
> 
> Can you tell me more detail about how did you do it ? Do you do 
> checkpoint once a day and turn on journal, then copy the journal file
> to remote node during the day. So you can rollforward database on
> the other machine if anything happen ?
> 
> Because I am thinking about doing it recently, but did not figure out
> a really good way to do it ? Do you mind send me back a mail and give
> me some info to help ? I have talk to CA about the duplicate server. 
> We are run openingres 1.2/01 on open vms. But CA said the replicate
> server use event log will generate a lot of locking issue. It did not
> sounds a good idea. 
> 
> Thanks . . . .
> 
> tyyang


Hi tyyang,

    To recover a database to a remote host with a checkpoint taken on you
    current host is actually quite straight forward. However, it is totally
    unsupported by Ingres. 

    The main problem to overcome is that to use a checkpoint/dump and journal
    to recover your database you need a configuration file associated with
    that database that correctly states the checkpoint requirements. Remember
    that a configuration file contains many hardwired paths eg checkpoint area
    , journal, dump and database area(s) for the database. Furthermore, it
    also contains pointers to your current hosts log file that will have no
    meaning to your remote host. This last point is worth remembering as the
    attempt to use this configuration file may screw the log on your remote
    host! But we'll go into that later.
    
    Note that you CANNOT edit a configuration file. Hence it is necessary to
    fool it using symbolic links! I can do this on UNIX but as you are using
    VMS I'm not sure what the equivalent is. Something like a logical name
    comes to mind but I'm no expert in VMS.

    Using a symbolic link is not too difficult. However experience has shown
    that you are better off puting your symbolic links at the highest level in
    the path not the lowest. This can become critical when attempting to
    recover several databases form several hosts onto a single host. 

    A real example to clarify the matter.
    I wish to recover a database from my host avalon to my host bucket. The
    critical paths on avalon are all prefixed /avalon/prod. The equivalent
    paths on bucket are all prefixed /bucket/drp. As a complication there are
    /bucket/prod directories used by a seperate installation that I dont wish
    to get confused with.
    Hence on bucket i make the directory /avalon. In this directory I place a
    symbolic link prod -> /bucket/drp.
    Now any reference to say /avalon/prod/ingres/ingres/data/default/merlin
    will actually traverse the path
    /bucket/drp/ingres/ingres/data/default/merlin.

    Having set up the symbolic links once they can be kept in place. Afterall
    why reinvent the wheel.

    Now, the fun bit!
    on your remote host:
    1. Create a database named after the database you wish to recover, It must
    be owned by the same username.
    2. Spread the database to all extended locations that the lost database
    used.
    3. Checkpoint this database turning journaling on. By turning journaling
    on all the journal and dump locations are created for you.

    NOTE: To do this you must make sure that at any checkpoint on your lost
    hosts that you prepare a file (we call it a Basics file) of sufficient
    details to be able to do this. Having created the Basics File copy it to
    your remote host. When you need it it may as well be there waiting for
    you!
    I suggest the following should be included in your Basics file:
    1. from iidbdb taking a snapshot of the tables:
    select * from iiuser;
    select * from iilocations;
    select * from iidbaccess;
    select * from iiusergroup;
    select * from iiextend;
    select * from iidatabase;
    select * from iirole;
    2. From each of your important databases take.
        select * from iifile_info
        the copy.in prepared by copydb -c dbname
    3.  As your running OI I also suggest putting your config.dat into the 
    Basics file. 
    4. There may be many other site dependant files that yuo will need to on
    your remote host to get your recovered database back to the point that
    your users can really use it.

    You now have an empty database of the same name/owner as the lost
    database. We now just have to get the data back into it. To Do this:
    1. Take a copy of the configuration files of the newly created database.
    These can be very handy. Remember there are two one in the database area
    nad one in the dump area of the database. 
    2.  Take the configuration file of the lost database and copy it into the
    database area and dump area of the newly created database.

    If you now perform an infodb -udbowner dbname yuo should now see that that
    the database is described as if it were from the lost host! Magic.

    We now simply have to copy all the journals, dumps and saved configuration
    files from the lost host into the appropriate areas of the new database.

    If you are using a disk checkpoint than copy this in too.

    NOTE: You are correct in surmising that the journals get copied to my
    remote host. Ditto the configuration file, and the dump area comprising
    both the dump files themseleves and the saved configuration files.

    Now CHECK what we have got. You should use the hardcoded paths in the
    infodb output (ie the lost host paths that should be symlinked to new host
    equivalents) to check that all data locations exist, and that only the
    default location has any data files in it (these are your system
    catalogs). Furthermore check that all required journals for the recovery
    from the checkpooint exist and are uncompressed. Ditto the dumps and saved
    configuration files.
    
    Now you can do your rollforwarddb.

    Remember I said that the configuration file contains pointers to the log
    files that may cause grief. This is where your going to find out if they
    will or not. I have found that 9 times out of 10 there is no problem.
    However, every now and then the log file corrupts hopelessly. At which
    point you will have to do an rcpconfig -force_init. HENCE, DURING THIS
    RECOVERY MAKE DAMN SURE NO ONE ELSE IS USING THE INSTALLATION. SET SERVER
    CLOSED is a damn usefull option!

    Normally, all is okay and your database will now simply rollforward and
    recover quite happily.

    This simply recovers your database. It aint necessarily usable! What about
    your user access rights, application code, any specific files your users
    require to access the system. Any daemon process that need to be running
    etc. A lot of this is site dependant and I'll let you fill in the gaps.

    BUT, the critical thing is to test it! Get used to the procedure so that
    it becomes second nature to you. Try a real test, ie recover a production
    database to some host and then ask you users to test that databaase and
    see if they can access it and use it as if it were on the real host.

    The GOLDEN RULE in Disaster Recovery Planning is if you need to access
    information from your lost host you FAIL!


    Martin Bowes.



Hi Tyyang,

    Hope you have success!

    One thing I forgot to mention wass that the the process that transfers
    your journals, dumps, current and saved configuration files from your
    production host(s) to your backup host should be run from cron (or
    whatever the VMS equivalent is) at a frequency that gives acceptable data
    loss. I run my transfer script at 15 minute intervals. A further
    complication is that your Archiver process must be firing up more
    regularly than this interval. I set my Archiver to wake up at 1% of log
    file. This normally gives me pretty good journalling.

    The frequency you choose is a site dependant issue. My people were happy
    at a 15 minute data loss. I could go more frequent as the transfer process
    normally takes very little time to run.
    
    Furthermore the script should execute on your backup host and
    fireoff remote jobs on your production host to accomplish its job. I also
    have a script that monitors the progress of this transfer script and it
    will send me warnings if the journals or current configuration file for
    any database is over two hours old.

    I suggest that you transfer the data to equivalent areas on your backup
    host. This is for two reasons. One is that these areas will probably be
    set up with sufficient disk space to handle the job. the second is that it
    makes sense to store like with like as you'll probably remember where to
    find it and its easier to explain to others.
    ie Journals --> II_JOURNAL/recovery/hostname/dbname
       Dumps/Saved Configuration files --> II_DUMP/recovery/hostname/dbname
       Current Config File --> II_DATABASE/recovery/hostname/dbname

    eg On my Disaster Recovery Host (DRH) named bucket I store data from my
    three production hosts (avalon,electra and cascade). Hence the Journal 
    area on bucket has directories:
        II_JOURNAL/recovery/avalon
        II_JOURNAL/recovery/electra
        II_JOURNAL/recovery/cascade
    Under each of these is a seperate directory for each of my production
    databases.

    Martin Bowes.

    PS. One other thing. It is esential that your Disaster Recovery Plan (DRP)
    be documented. That the document be stored offsite. It should be written
    in such a manner that any Ingres DBA could walk into the smouldering
    remains of your office, sit down and commence the recovery of your
    databases. I dont know if I've achieved that last bit yet, but I am
    working on it!


Hi Tyyang,
> 
> 1. Where did you setup your archiver wake up every 15 mins. Is it in
>    the configuration files ? 
> 

    You set the archiver wakep by percentage of log file not by time interval.
    This is a configurable parameter in you logging system. It defaults to 5%.
    To see its current value run 'showrcp' if you use 6.4 or go into CBF
    Logging Parameters and check the appropriate value. If you wish to change
    it, then you really have to shutdown the installation and then reconfigure
    your logging system.

    You can get a feel for what timeframe X% of the log file translates to by
    monitoring the II_ACP.LOG file.

> 2. How did you decide when to copy which journal file to remote machine
>    every 15 mins. Will it happen that the journal file is written by the
>    archiver at that point ? Or you just copy all the journal files every
>    15 mins, no matter the journals have been copy before ? 

    My process checks its current set of journals/configuration files first
    and then determines what the last successful transfer time was and then
    accesses the production hosts looking for anything that has changed since
    then.

    Since I wrote this script a few years ago some kind soul beat me around the
    ears and told me of this wonderful utility called 'rdist' which would do
    the job a lot more easialy!!! I havent had a chance to recode it yet. I
    dont know of a VMS equivalent to rdist.

    Journal files are not flocked by the archiver and may be examined at will.
    If the archiver happens to be busy writing to one end of it this is not a
    problem as whatever you miss this time will be picked up next time. If the
    machine goes down in the mean time at least the journal you've got is
    still usable and heaps better than no journal at all.

> 
> 3. Did you do on-line checkpoint ($ckpdb dbname ) or off-line checkpoint
>    ($ckpdb -l dbname ) every night ? To turn on journal on the database
>    is using $ckpdb +j dbname. It will do the off-line checkpoint and
>    start the journal file. After that do I need do $ckpdb +j dbname
>    every night or I just need do $ckpdb dbname.
> 

    All my checkpoints are online. The only time I do an offline checkpoint is
    when a dire emergency forces me to do so!

    Having turned journaling on with ckpdb +j there is no need to do it again.
    All subsequent checkpoints may be with just ckpdb dbname. Afterall,
    turning journalling off/on requires an offline checkpoint, and the loss of
    access to the database for your users may piss them off!

    One further thing you could consider for your checkpoints is checkpointing
    directly to your backup host from your production host. This can be very
    usefull and save considerable time in a backup both by having instant
    access to a checkpoint file(s) on your backup host and by removing the
    chance of not correctly positioning a tape during the recovery phase.

    To do this I export a disk from my backup host and mount it on my 
    production hosts. I then replace the checkpoint directory of a production
    database with a symbolic link to an appropriatly named  directory on this
    disk. 

    I still use tape checkpoints of all my production databases, furthermore
    these tape checkpoints occur at a much higher frequency tahn my disk
    checkpoints. But the disk checkpoints are always my first resort when
    attempting to recover a database. The extra time required to process more
    journals from an older checkpoint is not enough to worry me.

    Martin Bowes
Ingres Q & A
Back to William's Home Page
Email William