Huge variance in JNDI lookup times

Asked 30/8, 2011 at 9:6 Answered 30/8, 2011 at 12:27

We're experiencing huge variance in response times on a legacy J2EE web application running on Weblogic 10.3. The system consists of two Weblogic server instances (frontend and backend) which run on the same physical server machine and an Oracle database running on a separate host. External measurement tool alerts us every time a login to the system takes more than four seconds. Lately these warnings have been frequent. Looking at the log written by the servlet handling the login requests reveals that the time is spent on an EJB call from the frontend to the backend.

Example of the measured times:

time    ms   
8:40:43 25
8:42:14 26
8:44:04 26
8:44:25 26
8:44:47 26
8:46:06 26
8:46:41 7744
8:47:00 27
8:47:37 27
8:49:00 26
8:49:37 26
8:50:03 8213
8:50:57 27
8:51:04 26
8:51:06 25
8:57:26 2545
8:58:13 26
9:00:06 5195

As can be seen, most of the requests (70%, taken from a larger sample) complete in timely fashion but a significant portion of them take a very long time to complete.

The steps performed during the measured time are the following:

JNDI lookup of the session bean offering the authentication interface (frontend)
Calling the authentication method of the session bean (frontend->backend)
Reserving a JDBC connection from the connection pool (backend)
Making a query to the user database (table size is very moderate and the table should be properly indexed) (backend)
Reading the result set, creating the POJO user object (backend)
Returning the POJO user object (backend->frontend)

The load on the server machine is very small (99% idle) and the number of users is very moderate. Amount of free memory reported by Weblogic varies between 60% and 90% on both servers. Garbage collection is logged. Major collections are rare and complete in 2-3 seconds when they do occur. Further, the major GC occurances seem not to happen at the same time when long response times are seen. Long response times occur on both busy and non-busy hours. The JDBC connection pool max size is currently set to 80 which is more than the number of concurrent users.

Update:

Got a permission to reboot the system with some more performance logging added. The log clearly shows that the JNDI lookup is the part where the time is spent:

03:01:23.977 PERFORMANCE: looking up foo.bar.Bar from JNDI took 6 ms
03:14:47.179 PERFORMANCE: looking up foo.bar.Bar from JNDI took 2332 ms
03:15:55.040 PERFORMANCE: looking up foo.bar.Bar from JNDI took 1585 ms
03:29:25.548 PERFORMANCE: looking up foo.bar.Bar from JNDI took 7 ms
03:31:09.010 PERFORMANCE: looking up foo.bar.Bar from JNDI took 6 ms
03:44:25.587 PERFORMANCE: looking up foo.bar.Bar from JNDI took 6 ms
03:46:00.289 PERFORMANCE: looking up foo.bar.Bar from JNDI took 7 ms
03:59:28.028 PERFORMANCE: looking up foo.bar.Bar from JNDI took 2052 ms

Looking at the GC logs of the frontend and backend shows that GC is not done when the slow JNDI lookups occur.

The context is got the following way when a session is created:

Hashtable ht = new Hashtable();
ht.put(Context.PROVIDER_URL, url);
ht.put(Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory");
jndiContext = new InitialContext(ht);

where url is a t3 url pointing to the DNS name and port of the backend server. This should be ok, right?

First thing to pop into mind is to cache the references got from JNDI, at least this was the preferred way to go 10 years ago... But shouldn't Weblogic's InitialContext implementation already do this caching, or does it really fetch the reference from the backend server on each call?

What could be causing the frequent slow JNDI lookups? Is there a workaround for this (e.g. would caching the references help)?

Praetorian answered 30/8, 2011 at 9:6 Comment(7)

Have you tried to put log messages between the steps described above, to determine which of them is taking most of the extra time? – Inculpable 30/8, 2011 at 9:28

Did that. It's the JNDI lookup. Please see the edited question above. – Praetorian 31/8, 2011 at 7:52

One more thing: it may be useful to eliminate the simple obvious stuff, so: have you checked the Hard Disk on that machine? It may be something as simple as faulty hardware, too! – Eyeless 31/8, 2011 at 8:19

The Hashtable.put code to fetch a context is absolute standard. Do you have separate times for the reflection to call "create" and "invoke" - before and after? – Stag 2/9, 2011 at 6:57

Reflection calls complete in a matter of a few milliseconds 100% of the time, so no problem there. The problem is the JNDI lookups, which normally complete in 5-7 milliseconds but take a very long time quite regularly. Moreover, the long times tend to get longer the longer the server has been running. – Praetorian 2/9, 2011 at 10:9

Is the backend servers part of a clustered environment, and possible certain requests go to a different sevrer and hence slower? – Stag 2/9, 2011 at 11:16

In this case there is no clustering involved i.e. all requests go to the same server. – Praetorian 2/9, 2011 at 12:9

So what could be causing this rather erratic behavior?

Anything we say is likely to be a guess. Rather than playing that game, here are some suggestions for investigating the problem:

Try using a profiler to see where the time is being spent.
Try using a network tool (like WireShark) to see if there is anomalous network traffic.
Add some logging / tracing at key points to see where the time is going.
Look for Thread.sleep(...) calls. (Ooops ... that's a guess.)

Hydrogenolysis answered 30/8, 2011 at 9:33 Comment(4)

+1: The profiler is the most likely to pick up where the time is being spent. I would try creating a load test which loads up the test to reproduce the problem. If you can find if a specific action causes the problem it can be informative. – Aruabea 30/8, 2011 at 10:1

Thanks for the suggestions! The problem is that this issue occurs on the production server and so far we haven't been able to reproduce this on a test server. We will add more logging on the next patch and probably use a load generator and a profiler in test environment to drill down the issue. – Praetorian 30/8, 2011 at 10:12

You can do most of the above even on a production server. Yes, it may impact somewhat on performance for the duration of your investigation, but the effect is not likely to be worse than the problem your already suffering from. (Just make sure that you can quickly "back out" of any temporary investigative changes.) – Hydrogenolysis 30/8, 2011 at 11:22

Unless you can reproduce the problem on test, you have to profile production. It shouldn't impact your times that much. I suggest you use a commercial profiler like YourKit (on eval if you don't have the cash) rather than VisualVM which is free but can have noisy results. – Aruabea 30/8, 2011 at 14:18

As a first step, I would to try and isolate which part of those steps performed caused the problem by logging the amount of time each of them take. That way you can eliminate things that are irrelevant and focus on the right area, any may be post here again when you figure that out so that people can give specific advice.

Eyeless answered 30/8, 2011 at 9:31 Comment(0)

As StephenC says, some of this is guesswork without adequate logging statements in between. You have clearly listed each element in the transaction, but I assume you don't have a logger.debug which you can switch on which has timestamps on it.

Some questions to look at:

How many beans in pool for each of the front end and backend beans - it should be the max-beans-in-free-pool element of the weblogic-ejb-jar.xml

If you have more requests to the backend EJB than beans available, then there will be a waiting pileup.

Similarly on the JDBC front, you can use the Weblogic console to monitor any contention with getting connections - are you hitting High Counts and Waits in the JDBC Monitoring tab? That should be the next thing to check.

Stag answered 30/8, 2011 at 12:27 Comment(3)

Thanks, I checked max-beans-in-free-pool. It's 100, so it shouldn't be a problem. Please see my edited question with some more information. There was a debug switch after all... – Praetorian 30/8, 2011 at 12:58

Right, any link with the GC pattern then, with reflection I wonder if many classes loaded in young space and Server GC coincides with response time spikes. Check in the console – Stag 30/8, 2011 at 14:14

Checked the GC logs and there seems not to be any GC activity during the slow JNDI lookups. Rebooting did help a bit though: the occasional long lookups are still there but they are not quite as long nor frequent any more. – Praetorian 31/8, 2011 at 8:49

Recommended topics

Hot tags