CS Mutex bug

csy@cypress.com (Steve Yang / IS Software Engineer) writes:

>Dear Net-friends:

>	Our engineering database (Ingres 6.4/05) has run into this problem
>	for a couple of times. What happens is that it seems like all the
>	processes accessing the db are frozen, and if you try to
>	'quel' the db, it will sit there for ever.

>	I checked the db server using 'iimonitor', and 'show sessions',
>	here is what I see:

>------------message from iimonitor--------------------------------------------	
>Session 00535AC0 (summary   ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:
>Session 005A4880 (cjk       ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:
>Session 005AEC80 (cjk       ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:
>Session 006E6880 (ams       ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:
>Session 00738040 (ams       ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:
>Session 007407C0 (cjk       ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:
>Session 00776D80 (ams       ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:
>Session 00796040 (cjk       ) cs_state: CS_MUTEX ((x) 0030810C) cs_mask:

>------------message from iimonitor--------------------------------------------	

>	I was not able to remove these sessions using 'remove sid' from
>	iimonitor.  Everytime this happens, we had to restart Ingres.
>	I am wondering whether anyone has encountered the same problem
>	before, and whether there is a better solution ?

In releases prior to 6.4/05/02 this problem was caused when a session
was removed using the "remove session ..." command within iimonitor.
According to TS, this is a bug that has been fixed in 6.4/05/02.  The
only work around I know off is the dreaded "kill -9" thing on the iidbms
server.

==============================================================================
Nabil Courdy               |
York & Associates, Inc.    |
Minneapolis, MN            |  ncourdy@winternet.com
(612)831-0077              |
(612)831-0887 (FAX)        |
==============================================================================



>(James Bullock) wrote:

>2) 6.4/04 starting under HP-UX 8.02 
 
>Without any useful explanation, we made most of this go away by: 
>1) Bumping up the number of semaphores in the OS. 
>2) Creating twice the number of locks that could ever be needed. For
>example for 300 sessions at 400 locks each (max), we hav 240000 locks, not
>the 120000 it would imply. 

Yup. Been there, done that, with 6.4/04 under HP-UX 9.something. I spent
many evenings on the phone with Jerry Lohr from Tech support trying to
find out why it was happening - and then trying to make it come back again
to see if we could figure out what we did to get rid of it. The
database/server we had the problems with had a *lot* of escalating locks,
and quite a few rules/procedures, if that data point helps at all. Still,
the main thing we did was thump the number of locks up to something fairly
silly.


There was much dark  muttering from the tech-meister about some
undocumented features in the memory handling when locks escalate...

-- 



rhook@gil.ipswichcity.qld.gov.au 
Robert Hook
Brisbane, Qld, Australia



>Steve Yang / IS Software Engineer (csy@cypress.com) wrote:
>: Dear Net-friends:
>
>: 	Our engineering database (Ingres 6.4/05) has run into this problem
>: 	for a couple of times. What happens is that it seems like all the
>: 	processes accessing the db are frozen, and if you try to
>: 	'quel' the db, it will sit there for ever.
>
>: 	I checked the db server using 'iimonitor', and 'show sessions',
>: 	here is what I see:
>
>
>: 	I was not able to remove these sessions using 'remove sid' from
>: 	iimonitor.  Everytime this happens, we had to restart Ingres.
>: 	I am wondering whether anyone has encountered the same problem
>: 	before, and whether there is a better solution ?
>
>: 	thank you,
>: 	- steve
>
>
This condition will occur when some internal resource is not available and 
doesn't become available for a reasonable amount of time.  I've heard that there
is a bug somewhere that can cause it.  I've also seen this symptom when the
archiver has died and the installation has reached LOG-FULL-LIMIT (not FORCE-
ABORT-LIMIT).  Any new connection first tries to register with the logging
system, and can't because the logging system is stuck.  Look at your logging
and locking system resources to be sure that there are plenty available.

-- 
Steve Caswell           |   (404) 448-7727    |  "The opinions expressed are my
Principal Consultant    |   sfc@tpghq.com     |   own.  They may not be perfect,
The Palmer Group        |   uunet!tpghq!sfc   |   but they're all I've got."
On 4 Aug 1995 in article , 'csy@cypress.com
(Steve Yang / IS Software Engineer)' wrote: 


 
>Hi, 
>	After I posted the question regarding the CS_MUTEX problem, 
>	I got a few responses. Thanks for all the people who provide 
>	suggestions and information. 
> 
>	I also contacted tech support from CA. 
> 
>	Here is the updates about progress: 
> 
>	1. Forget to mention that we are operating in the following env: 
> 
>		Sun 690 with 4 CPUs 
>		SunOS 4.1.3 
>		Ingres 6.4/05 (su4.u42/00) 
>		4 shared-cache servers running 
> 
>		When the CS_MUTEX happens, the load is not high, also 
>		not too many users. log file is ok, enough disk space. 
>		no msg in errlog.log. each dbms has about 8 - 10 sessions. 
>		Trying to connect to the db via 'quel' or 'ipm' all fails. 
> 
> 
>	2. Yesterday, I encountered the same problem again on our 
>	engineering db. But, I was lucky to get hold of CA and had them 
>	take a closer look at the problem when it happened. 
> 
>	The lockstat, logstat and various information were taken, all 
>	seems fine. CA thought that the problem seems due to iidbdb.  
>	Because batch jobs that were contacted to the db before the hang  
>	were still running fine, it was just the new sessions that couldn't 
>	get connected anymore. 
> 
>	Later on, CA told me that this might be a bug (bug 33328), although 
>	they need more information to confirm it. So, they sent me a 
>	procedure desc. rgs. how to collect data during the hang for 
>	analysis. 
> 
>	I will post the result once we find out. 
> 
>	thanks for all your help, 
>	- steve 
> 
> 
>+---------------------------------------------------------------+ 
>| Steve Yang			| 2401 E. 86th Street		| 
>| IS Software Engineer		| Bloomington, MN 55425		| 
>| Cypress Semiconductor, Inc.	| Voice: (612) 851-5203		| 
>| csy@cypress.com		| Fax:   (612) 851-5199		| 
>+---------------------------------------------------------------+ 
 
Does this mean that CA has sent you a procedure for dumping some/all of the
internal state of the server in the event it gets corked-up? 
 
If so, is this procedure now or going to become part of the standard
systems administration documentation, or made available via other support
mechanisms? 
 
-- 
James Bullock - Rare Bird Enterprises 
620 Park Avenue # 171 
Rochester, NY. 
716-242-4824 
716-244-9072 (Fax) 
 
"Onward, into the fog . . ." 
jbullock@nyc.pipeline.com (James Bullock) wrote:

>  ... snip ...

>2) 6.4/04 starting under HP-UX 8.02 
> 
>The MUTEX threads only happened once in a while under heavy load, and under
>the same conditions, we would occasionally get a report of "insufficient
>locks", usually when there was much escalation to table level.   ... Report had to be
>erroneous because locks_available >> sessions * maxlocks + internal lock
>use as reported in the I & O guide. 
> 
>Without any useful explanation, we made most of this go away by: 
>1) Bumping up the number of semaphores in the OS. 
>2) Creating twice the number of locks that could ever be needed. For
>example for 300 sessions at 400 locks each (max), we hav 240000 locks, not
>the 120000 it would imply. 
> 
>James Bullock - Rare Bird Enterprises 

James et al.

	Just as an aside, your problem, as it occurred under heavy loads, may
be similar to one I had. I do not have my tech. support log with me or
I  could provide more detailed information but ...

	I was receiving the "out of locks" message when under heavy loads
also. I did the same things that you suggest above (bump up Kernal
parameters - I am running on a Sequent Platform (Dynix/ptx) UNIX - and
more importantly, the number of locks available to the locking system.
No matter how high I set these parameters, however, my DBMS servers
kept locking up and refusing connections. 

	The problem, I discovered, was that INGRES will allow you to set the
number of available locks to whatever you want, BUT, the actual number
that it can use is ~65000. I was able to prove this by watching in IPM
as my system approached 65000 locks, the servers would start to hang
and refuse connections.

	The solution that was required for my installation was to separate the
developers from the production users by installing a separate INGRES
installation on the same machine. This ended the resource problems.

Sean O'Hara	INGRES DBA		NCH Promotional Services	
Saint John, NB CANADA
Ingres Q & A
To William's Home Page
Email William