Quite some preparation work had been done by the developers and the database administrators to ensure a (temporary) replication could be setup between the old RAC and the new RAC, so all data would be transferred more or less "automatically" into the new RAC.
And then the big day came. We stopped the OC4J containers, we changed the datasources while the database administrators synced the RACs a last time and when ready we started the OC4J containers back to run on the new RAC.
Uh-oh...blocked threads
We were happy the upgrade went smooth and the websites were running more crispier than ever until ... the containers started to lock up for some non-apparent reason. Nothing we hadn't done before, so I took a thread dump of the OC4J to see what was going on.
Thread t@788: (state = BLOCKED)This thread was actually deadlocked by
- oracle.jdbc.pool.OracleImplicitConnectionCache.retrieveCacheConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=0, line=651 (Compiled frame; information may be imprecise)
- oracle.jdbc.pool.OracleImplicitConnectionCache.getCacheConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=4, line=549 (Compiled frame)
- oracle.jdbc.pool.OracleImplicitConnectionCache.getConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=65, line=430 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection(java.lang.String, java.lang.String, java.util.Properties) @bci=31, line=432 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection(java.lang.String, java.lang.String) @bci=16, line=203 (Interpreted frame)
- oracle.jdbc.pool.OracleDataSource.getConnection() @bci=33, line=179 (Compiled frame)
Thread t@29: (state = BLOCKED)
- oracle.jdbc.driver.PhysicalConnection.closeLogicalConnection() @bci=0, line=1581 (Compiled frame; information may be imprecise)
- oracle.jdbc.driver.LogicalConnection.cleanupAndClose(boolean) @bci=31, line=267 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCache.closeCheckedOutConnection(oracle.jdbc.pool.OraclePooledConnection, boolean) @bci=45, line=1503 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCacheThread.runAbandonedTimeout(long) @bci=155, line=258 (Interpreted frame)
- oracle.jdbc.pool.OracleImplicitConnectionCacheThread.run() @bci=93, line=81 (Interpreted frame)
netstat -an | grep 12501 | wc -lA ticket was opened at Oracle in which we explained the situation. The first deadlock was caused by the abandoned-connection-timeout property which we had put to 60. Oracle advised us to remove the entry or put it on -1. We had this in our datasource just for security reason, if ever a query took more than 60 seconds we would break it off rather than keep it running for possibly much longer.
---> 444
I found it a strange solution since we had not changed anything to the application level, never did we upgrade JDBC drivers or anything else ... but ok ... after setting the abandoned-connection-timeout property to -1 or just removing it, we started getting another behaviour. Now we would get Connection Timeout exceptions from the JDBC driver ... well ... kinda logica if you don't kill anything anymore, I guess the pool got used up for some reason. The stacktrace would look somewhat like this
08/12/23 03:32:26 java.sql.SQLException: Io exception: Io exception: Connection timed outOk we do heavily use Clobs, as sometimes was not adviced by some people who also run sister-sites. But avoiding a feature in Oracle because you know it might give you problems - we thought that was too far fetched. We saw this exception when using one of the latest JDBC-drivers available. As Oracle told us to "try this one out".
08/12/23 03:32:26 at oracle.jdbc.driver.SQLStateMapping.newSQLException(SQLStateMapping.java:77)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.newSQLException(DatabaseError.java:132)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:200)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:264)
08/12/23 03:32:26 at oracle.jdbc.driver.DatabaseError.throwSqlException(DatabaseError.java:575)
08/12/23 03:32:26 at oracle.jdbc.driver.ClobAccessor.getString(ClobAccessor.java:296)
08/12/23 03:32:26 at oracle.jdbc.driver.T4CClobAccessor.getString(T4CClobAccessor.java:77)
08/12/23 03:32:26 at oracle.jdbc.driver.OracleResultSetImpl.getString(OracleResultSetImpl.java:2138)
Meanwhile we started writing restart scripts, using a bit of python, when the socket would not answer after a decent period of time, we'd issue an OC4J restart so at least we could serve client request more or less decently. And since christmas and new year was coming closer, everything went just a tad slower.
What followed was a whole list of snoops and database traces. If you care to follow this SR just let me know your email-address as a comment. Meanwhile I wrote a wrapper for our DAO and noticed that queries that were using CLOBs were sometimes not finishing. This means that some of those queries stayed "running" and thus eventually we ran out of connections in the connection pool.
We were told to fiddle around with the
Quite some days related we encountered the following snoop (IP addresses obfuscated)
61035 173.904541111.11.111.11222.22.22.222 TCP ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911This was said to be a retransmission problem by Oracle. Liking this explanation more we opened a ticket at Sun to see if they had noticed any issues like this. Sun coordinated with Oracle to look at the snoops and proposed us the following kernel parameter:
61073 174.318971111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61274 175.158981111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
61998 176.839470111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
63332 180.198899111.11.111.11222.22.22.222 TCP [TCP Retransmission] ingreslock > 46083 [PSH, ACK] Seq=115508 Ack=30515 Win=49680 [TCP CHECKSUM INCORRECT] Len=911
set ip:dohwcksum=0And with this kernel parameter ... all problems went away. Now I really am not a Solaris expert to know what this kernel parameter does. I read it's a hardware checksum. I personally think it's weird to remove a "check" which is supposed to say if things are correct or not ... but hey ... Sun and Oracle told us to do it, and the problems are gone.
And after 3 months the end of this SR reached a Soft Close state and our website ran smoothly again. Never think a database upgrade might not impact your site obviously.
