PDA

View Full Version : FlexLM file descriptors



boise57
03-04-2008, 04:05 PM
I'm running lmgrd (v11.4.100.0 build 52167) on Solaris 9 in the triad configuration. From time to time, lmgrd seems to run out of file descriptors and loses the ability to keep the quorum. At one time, I was able to trace the problem to a NAT box that did not have vendor daemon port open. (So, all the license requests were able to connect, but vendor daemon wouldn't, leaving the first connection on CLOSE_WAIT state and therefore running out of file descriptors.) This issue has been fixed. However, I still get the same kind of error but not clear what is causing it this time.

13:41:05 (MLM) Lost communications with lmgrd. Exiting.
13:41:05 (MLM) EXITING DUE TO SIGNAL 39
....
13:41:58 (FluentLm) Lost quorum, exiting
13:41:59 (IFUL) Lost quorum, exiting
13:41:59 (ESRI) Lost quorum, exiting
13:41:59 (maplelmg) Lost quorum, exiting
13:41:59 (ansyslmd) Lost quorum, exiting
13:41:59 (FEMLAB) Lost quorum, exiting
13:41:59 (INTEL) Lost quorum, exiting

Another issue, probably not related is that I keep seeing the following in the log files:

0:00:09 (MLM) ERROR: Non-activation-capable daemon activation invoked with non-client-request event type

This thread from last year seems to reference it as well.
http://community.macrovision.com/showthread.php?t=168887

Oh year, I have these in the license server start-up scripts:

ulimit -n 1024
ulimit -H -n 1024

and, I can't seem to go beyond the count 1024.

Any help with the file descriptor problem would be appreciated.

gu2008
03-04-2008, 06:01 PM
I have exactly the same problem

boise57
03-14-2008, 03:08 PM
I was hoping that someone from Macrovision might respond, but ...
Just for the sake of completeness, let me elaborate on what is happening
with lmgrd.

I have three machines dedicated to the FlexLM license services. I
downloaded the latest version of lmgrd and lmutil for SPARC Solaris from
http://www.globes.com/support/fnp_utilities_download.htm#unixdownload

I have combined all of the license.dat files for various licenses as
instructed in the lmgrd user guide. I keep all the various vendor
daemons also in the local disks on each machine.

In my FlexLM start-up shell script, I have "ulimit -n 1024" and "ulimit -H
-n 1024" to increase the number of file descriptors. That is the max I
can go to.

Also per FlexLM guide, I have the following in one of my system
init scripts:

# By default on Solaris, upon stopping a license server,
# 1 to 5 minutes are required for the port to free up so it
# will restart, which can result in checkout failures.
# The command below resets this default to 2.4 seconds

/usr/sbin/ndd -set /dev/tcp tcp_time_wait_interval 2400

Upon FlexLM license server start up, I see this in the flexlm log files:

(MLM) TCP_NODELAY NOT enabled

I do not see equivalent lines for other vendor daemons. In other
words, the other vendor daemons either don't do the "TCP_NODELAY"
(ie: disable the Nagle algorithm) or when they do, it isn't
being prevented.

And, every night at 0:00:xx I see this:

0:00:39 (MLM) ERROR: Non-activation-capable daemon activation invoked with non-client-request event type


Now, I don't know if any of the above are related to the issue
that's making the license server to hang every once in a while.
They may be related, or may be not.

When the license service stops, these are the symptoms:

Netstat -a shows lots of connections in CLOSE_WAIT status like this:

host1.1700 vpn1-9-13.1185 17640 0 50253 0 CLOSE_WAIT
host1.1700 doppler.37346 5840 0 49085 0 CLOSE_WAIT
host1.1700 prospector.59784 5840 0 49085 0 CLOSE_WAIT
host1.1700 host2.33178 49640 0 49409 0 CLOSE_WAIT
host1.1700 vpn23-3.1893 17640 0 50253 0 CLOSE_WAIT
host1.1700 taurasi.2102 65535 0 49493 0 CLOSE_WAIT
...

(Note that host1.1700 is the lmgrd port, not the vendor daemon
port for MLM.)

As a result of accumulation of above CLOSE_WAIT connections, lmgrd
eventually runs out of file descriptors. When that happens, it
can't (a) check out any more licenses, (b) can't talk to the vendor
daemons (c) can't talk to the other peers so it loses quorum.

13:41:05 (MLM) Lost communications with lmgrd. Exiting.
13:41:05 (MLM) EXITING DUE TO SIGNAL 39
13:41:05 (MLM) IN: "MATLAB" wbender1@phatmos (SHUTDOWN)
13:41:05 (MLM) IN: "MATLAB" bolatto@fornax (SHUTDOWN)
...

I can't figure out why the above connections go to CLOSE_WAIT
status. Is it due to MLM (vendor daemon) closing the connection
improperly? Is it a bug in the lmgrd where it should be closing
the connection after waiting for 2-times the roundtrip-time of the
previous packet(s)?

[[ In order to further debug this issue, is there a way for lmgrd to log
(i) the __date__ and time in each log entry
(ii) the IP address of the client ]]

I suspect that this really is a lmgrd but that's being triggered by MLM.
I want to emphasize that the running out of file descriptors is only a
symptom. The root cause is that some of the TCP connections (from server
side) end up on CLOSE_WAIT status instead of simply closing the properly,
eating up the file descriptors.

I've been asked to see this page:
http://www.unixguide.net/sun/sunobscure.shtml#II.B2
But I have tweaked all the parameters mentioned there and the CLOSE_WAIT
processes are still accumulating.

And, it looks like exploiting this lmgrd bug isn't all that hard. With
the right condition(s) (all too easy to think of) a single machine can
make lmgrd run out of file descriptors and thereby causing a denial of
service attack.

raffie
03-15-2008, 10:21 PM
By any chance are you using version 10.8.0.1 of the MLM vendor daemon?
There is a bug in that version of FLEXlm that left some ports open. If this is the problem then you should just need to upgrade to a newer version of MLM.

boise57
03-17-2008, 07:09 AM
I'm using these versions:

lmgrd FLEXnet Licensing v11.4.100.0 build 52167 sun64_u8 (liblmgr.a)
MLM FLEXnet Licensing v11.4.0.0 build 31341 (liblmgr.a)

(ie: MLM that came with Matlab 2007b)

nick_hong
12-19-2008, 12:54 PM
Please check all machines in the same subnet if there is any one that is using the port number of lmgrd before starting license daemon.
Just running 'lmstat -a | grep <port_number_lmgrd>' in each machine in the subnet will help.
- Nick.

flexnetrw
06-08-2009, 10:39 AM
We are experiencing the same problem with an older version. Were you able to resolve the issue with the CLOSE_WAIT states?

patriet
07-01-2009, 03:01 PM
Does anyone have a solution to this problem? I am also having the error that states it can't open /usr/tmp/.flexlm/lmgrdl.3214 errno: 24. I have also used ulimit to increase the open files limit to 1024, but my license still crashes every couple of days.
The post about the vendor daemon MLM caught my eye because my Cadence license daemon is also at version 10.8.0.1. I will look into updating that, but I am also wondering if anyone has found what exactly the problem is and how to solve it. Thanks!

baljit77
06-25-2010, 08:05 AM
if a flexlm client is checking out 5 licenses; does it mean 5 file descriptors are used or only one?

galen144
07-15-2010, 02:56 PM
We run 21 different vendor daemons in our Solaris 10 license server quorum environment. I have never been a fan of catenating license files together. We keep each license file separate and each vendor daemon has it's own lmgrd running, therefore each has a unique port. File descriptors are set to 256 (nofiles 256) on each server. Currently, all daemons have been running since May 9th. Have you tried not catenating your license files together?