We're experiencing huge variance in response times on a legacy J2EE web application running on Weblogic 10.3. The system consists of two Weblogic server instances (frontend and backend) which run on the same physical server machine and an Oracle database running on a separate host. External measurement tool alerts us every time a login to the system takes more than four seconds. Lately these warnings have been frequent. Looking at the log written by the servlet handling the login requests reveals that the time is spent on an EJB call from the frontend to the backend.
Example of the measured times:
time ms
8:40:43 25
8:42:14 26
8:44:04 26
8:44:25 26
8:44:47 26
8:46:06 26
8:46:41 7744
8:47:00 27
8:47:37 27
8:49:00 26
8:49:37 26
8:50:03 8213
8:50:57 27
8:51:04 26
8:51:06 25
8:57:26 2545
8:58:13 26
9:00:06 5195
As can be seen, most of the requests (70%, taken from a larger sample) complete in timely fashion but a significant portion of them take a very long time to complete.
The steps performed during the measured time are the following:
- JNDI lookup of the session bean offering the authentication interface (frontend)
- Calling the authentication method of the session bean (frontend->backend)
- Reserving a JDBC connection from the connection pool (backend)
- Making a query to the user database (table size is very moderate and the table should be properly indexed) (backend)
- Reading the result set, creating the POJO user object (backend)
- Returning the POJO user object (backend->frontend)
The load on the server machine is very small (99% idle) and the number of users is very moderate. Amount of free memory reported by Weblogic varies between 60% and 90% on both servers. Garbage collection is logged. Major collections are rare and complete in 2-3 seconds when they do occur. Further, the major GC occurances seem not to happen at the same time when long response times are seen. Long response times occur on both busy and non-busy hours. The JDBC connection pool max size is currently set to 80 which is more than the number of concurrent users.
Update:
Got a permission to reboot the system with some more performance logging added. The log clearly shows that the JNDI lookup is the part where the time is spent:
03:01:23.977 PERFORMANCE: looking up foo.bar.Bar from JNDI took 6 ms
03:14:47.179 PERFORMANCE: looking up foo.bar.Bar from JNDI took 2332 ms
03:15:55.040 PERFORMANCE: looking up foo.bar.Bar from JNDI took 1585 ms
03:29:25.548 PERFORMANCE: looking up foo.bar.Bar from JNDI took 7 ms
03:31:09.010 PERFORMANCE: looking up foo.bar.Bar from JNDI took 6 ms
03:44:25.587 PERFORMANCE: looking up foo.bar.Bar from JNDI took 6 ms
03:46:00.289 PERFORMANCE: looking up foo.bar.Bar from JNDI took 7 ms
03:59:28.028 PERFORMANCE: looking up foo.bar.Bar from JNDI took 2052 ms
Looking at the GC logs of the frontend and backend shows that GC is not done when the slow JNDI lookups occur.
The context is got the following way when a session is created:
Hashtable ht = new Hashtable();
ht.put(Context.PROVIDER_URL, url);
ht.put(Context.INITIAL_CONTEXT_FACTORY, "weblogic.jndi.WLInitialContextFactory");
jndiContext = new InitialContext(ht);
where url
is a t3 url pointing to the DNS name and port of the backend server. This should be ok, right?
First thing to pop into mind is to cache the references got from JNDI, at least this was the preferred way to go 10 years ago... But shouldn't Weblogic's InitialContext implementation already do this caching, or does it really fetch the reference from the backend server on each call?
What could be causing the frequent slow JNDI lookups? Is there a workaround for this (e.g. would caching the references help)?