My blog has moved!

You should be automatically redirected in 3 seconds. If not, visit
http://marc.blog.foreach.com/
and update your bookmarks.

Thursday, 26 February 2009

OC4J blocked threads ... part 2

And thus it became christmass and the new year 2009 was coming. When an upgrade was planned to the Oracle RAC so we could benefit from a new and faster storage.

Quite some preparation work had been done by the developers and the database administrators to ensure a (temporary) replication could be setup between the old RAC and the new RAC, so all data would be transferred more or less "automatically" into the new RAC.

And then the big day came. We stopped the OC4J containers, we changed the datasources while the database administrators synced the RACs a last time and when ready we started the OC4J containers back to run on the new RAC.

Uh-oh...blocked threads

We were happy the upgrade went smooth and the websites were running more crispier than ever until ... the containers started to lock up for some non-apparent reason. Nothing we hadn't done before, so I took a thread dump of the OC4J to see what was going on.
Thread t@788: (state = BLOCKED)
- oracle.jdbc.pool.OracleImplicitConnectionCache.retrieveCacheConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=0, line=651 (Compiled frame; information may be imprecise)
- oracle.jdbc.pool.OracleImplicitConnectionCache.getCacheConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=4, line=549 (Compiled frame)
- oracle.jdbc.pool.OracleImplicitConnectionCache.getConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=65, line=430 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=31, line=432 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection(java.lang.String, java.lang.String) @bci=16, line=203 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection() @bci=33, line=179 (Compiled frame)
This thread was actually deadlocked by
Thread t@29: (state = BLOCKED)
- oracle.jdbc.driver.PhysicalConnection.closeLogicalConnection() @bci=0, line=1581 (Compiled frame; information may be imprecise)
- oracle.jdbc.driver.LogicalConnection.cleanupAndClose(boolean) @bci=31, line=267 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCache.closeCheckedOutConnection(oracle.jdbc.pool.OraclePooledConnection, boolean) @bci=45, line=1503 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCacheThread.runAbandonedTimeout(long) @bci=155, line=258 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCacheThread.run() @bci=93, line=81 (Interpreted frame)
Also we were seeing a lot of connections to the OC4J TCP port which were on TIME_WAIT.
netstat -an | grep 12501 | wc -l
---> 444
A ticket was opened at Oracle in which we explained the situation. The first deadlock was caused by the abandoned-connection-timeout property which we had put to 60. Oracle advised us to remove the entry or put it on -1. We had this in our datasource just for security reason, if ever a query took more than 60 seconds we would break it off rather than keep it running for possibly much longer.

I found it a strange solution since we had not changed anything to the application level, never did we upgrade JDBC drivers or anything else ... but ok ... after setting the abandoned-connection-timeout property to -1 or just removing it, we started getting another behaviour. Now we would get Connection Timeout exceptions from the JDBC driver ... well ... kinda logica if you don't kill anything anymore, I guess the pool got used up for some reason. The stacktrace would look somewhat like this
08/12/23 03:32:26 java.sql.SQLException: Io exception: Io exception: Connection timed out
08/12/23 03:32:26 at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:77)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:132)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:200)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:264)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:575)
08/12/23 03:32:26 at oracle.jdbc.driver.ClobAccessor.getString(ClobAccessor.java:296)
08/12/23 03:32:26 at oracle.jdbc.driver.T4CClobAccessor.getString(T4CClobAccessor.java:77)
08/12/23 03:32:26 at oracle.jdbc.driver.OracleResultSetImpl.getString(OracleResultSetImpl.java:2138)
Ok we do heavily use Clobs, as sometimes was not adviced by some people who also run sister-sites. But avoiding a feature in Oracle because you know it might give you problems - we thought that was too far fetched. We saw this exception when using one of the latest JDBC-drivers available. As Oracle told us to "try this one out".

Meanwhile we started writing restart scripts, using a bit of python, when the socket would not answer after a decent period of time, we'd issue an OC4J restart so at least we could serve client request more or less decently. And since christmas and new year was coming closer, everything went just a tad slower.

What followed was a whole list of snoops and database traces. If you care to follow this SR just let me know your email-address as a comment. Meanwhile I wrote a wrapper for our DAO and noticed that queries that were using CLOBs were sometimes not finishing. This means that some of those queries stayed "running" and thus eventually we ran out of connections in the connection pool.

We were told to fiddle around with the
sqlnet.expire_time. After experimenting a bit we found out that this value was set to 10 minutes. Setting the sqlnet.expire_time to 4 minutes helped us quite a bit, as it took much longer for JDBC to fill all te connections in the connection pool. At least our website was more stable (and needing less restarts).

Quite some days related we encountered the following snoop (IP addresses obfuscated)
61035 173.904541111.11.111.11222.22.22.222 TCP ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61073 174.318971111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61274 175.158981111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61998 176.839470111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
63332 180.198899111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
This was said to be a retransmission problem by Oracle. Liking this explanation more we opened a ticket at Sun to see if they had noticed any issues like this. Sun coordinated with Oracle to look at the snoops and proposed us the following kernel parameter:
set ip:dohwcksum=0
And with this kernel parameter ... all problems went away. Now I really am not a Solaris expert to know what this kernel parameter does. I read it's a hardware checksum. I personally think it's weird to remove a "check" which is supposed to say if things are correct or not ... but hey ... Sun and Oracle told us to do it, and the problems are gone.

And after 3 months the end of this SR reached a Soft Close state and our website ran smoothly again. Never think a database upgrade might not impact your site obviously.