Visitor :

May 4, 2009

Deadlock using commons-pool 1.4 and commons-dbcp 1.2.2


I have found critical situation about deadlock.
Http thread was full. Most of apache workers was in sending reply.

Busy http thread grew up until suspending its function. When the apache could not respond any request, it removed from L4 (it have L7 function) because of no response.

Every web servers (600 ea) always does not happen, but some web application servers(10ea) fell into the trouble per 2-3 days.

Apache Status :

Server
Application Version : 2.0.59, Mod_jk 1.2.18, Tomcat 5.5.17, JDK 6.0 update 6 Linux Kernel 2.6.9-67.0.22.ELsmp #1 SMP,
OS : CentOS
gcc version
: gcc version 3.4.6 20060404 (Red Hat 3.4.6-9)
glibc
버전 glibc-2.3.4-2.39


I was wonderying why happened.

1) I captured apache server status.
Apache Server Status

Server uptime: 4 days 20 hours 41 minutes 2 seconds
Total accesses: 43491259 - Total Traffic: 10.8 GB
CPU Usage: u33.75 s6.42 cu0 cs0 - .00956% CPU load
104 requests/sec - 27.0 kB/second - 266 B/request
109 requests currently being processed, 4 idle workers
WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW_WWW.WWWWWWWWWWWW CWWWWWWWWWWWWWWWWWWWWW_WWWWCWWWWWWWWWRWWWWWW__CCR.C............. ................................................................ ................................................................ 

Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process


2) lsof
The number of CLOSE_WAIT status tcp sockets was 140. The number was close equal to the number of apache workers was in sending reply.
java 16907 www 52u IPv6 1014864218 TCP localhost.localdomain:8009->localhost.localdomain:35205 (CLOSE_WAIT)
java 16907 www 53u IPv6 1014869414 TCP localhost.localdomain:8009->localhost.localdomain:35414 (CLOSE_WAIT)
java 16907 www 54u IPv6 1014861929 TCP localhost.localdomain:8009->localhost.localdomain:35171 (CLOSE_WAIT)
java 16907 www 55u IPv6 1014863474 TCP localhost.localdomain:8009->localhost.localdomain:35200 (CLOSE_WAIT)
java 16907 www 56u IPv6 1014781598 TCP localhost.localdomain:8009->localhost.localdomain:60006 (CLOSE_WAIT)
java 16907 www 57u IPv6 1014860437 TCP localhost.localdomain:8009->localhost.localdomain:35091 (CLOSE_WAIT)
java 16907 www 58u IPv6 1014859292 TCP localhost.localdomain:8009->localhost.localdomain:35062 (CLOSE_WAIT)
java 16907 www 59u IPv6 1014853698 TCP localhost.localdomain:8009->localhost.localdomain:34860 (CLOSE_WAIT)
java 16907 www 60u IPv6 1014864865 TCP localhost.localdomain:8009->localhost.localdomain:35353 (CLOSE_WAIT)
java 16907 www 61u IPv6 1014868132 TCP localhost.localdomain:8009->localhost.localdomain:35356 (CLOSE_WAIT)
java 16907 www 62u IPv6 1014864283 TCP localhost.localdomain:8009->localhost.localdomain:35222 (CLOSE_WAIT)
..


3) Thread dump
I found dead lock situation in dumped thread in java log.

i. java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.commons.pool.impl.GenericObjectPool.borrowObject(GenericObjectPool.java:916)
- waiting to lock <0x8fedb478> (a org.apache.commons.pool.impl.GenericObjectPool)

1. "TP-Processor2" daemon prio=10 tid=0x0821f400 nid=0x4286 waiting for monitor entry [0x88ed3000..0x88ed50a0]

2. "TP-Processor2" daemon prio=10 tid=0x0821f400 nid=0x4286 waiting for monitor entry [0x88ed3000..0x88ed50a0]



Found one Java-level deadlock:

=============================

"TP-Processor200":

waiting to lock monitor 0x080dfd24 (object 0x8fedb478, a org.apache.commons.pool.impl.GenericObjectPool),

which is held by "Timer-0"

"Timer-0":

waiting to lock monitor 0x080dfcc0 (object 0x8fee6f70, a org.apache.commons.dbcp.PoolableConnection),

which is held by "TP-Processor28"

"TP-Processor28":

waiting to lock monitor 0x080dfd24 (object 0x8fedb478, a org.apache.commons.pool.impl.GenericObjectPool),

which is held by "Timer-0"

"TP-Processor200": (Select query) return queryForList("xxx.selectXXXCount",param);

"TP-Processor28": (Select query)

queryForObject("selectmusic",params);



This deadlock reported in "Dead lock using the evictor"(https://issues.apache.org/jira/browse/DBCP-270)
In this jira, there are patch on that.


Commons Dbcp

Dead lock using the evictor

Created: 26/Jun/08 10:18 AM Updated: 27/Jun/08 03:41 AM
Component/s: None Affects Version/s: 1.2.2 Fix Version/s: 1.3

Patch attached, dead lock reported. the abandonedtrace does synchronized(this) when it really only needs to do synchronized(this.trace)

Dead lock reported when using the evictor

=============================
"Timer-3":
waiting to lock monitor 0x0000000053b40548 (object 0x00002aaabf3210f0, a
org.apache.tomcat.dbcp.dbcp.PoolableConnection),
which is held by "TP-Processor27"
"TP-Processor27":
waiting to lock monitor 0x0000000053b404d0 (object 0x00002aaab9fa8b08, a
org.apache.tomcat.dbcp.pool.impl.GenericObjectPool),
which is held by "Timer-3"

Java stack information for the threads listed above:

patch source

Index: src/java/org/apache/commons/dbcp/AbandonedTrace.java
===================================================================
--- src/java/org/apache/commons/dbcp/AbandonedTrace.java (revision 671937)
+++ src/java/org/apache/commons/dbcp/AbandonedTrace.java (working copy)
@@ -172,7 +172,7 @@
* @param trace AbandonedTrace object to add
*/
protected void addTrace(AbandonedTrace trace) {
- synchronized (this) {
+ synchronized (this.trace) {
this.trace.add(trace);
}
setLastUsed();
@@ -182,8 +182,8 @@
* Clear the list of objects being traced by this
* object.
*/
- protected synchronized void clearTrace() {
- if (this.trace != null) {
+ protected void clearTrace() {
+ synchronized(this.trace) {
this.trace.clear();
}
}
@@ -194,8 +194,10 @@
* @return List of objects
*/
protected List getTrace() {
- return trace;
+ synchronized (this.trace) {
+ return new ArrayList(trace);
}
+ }
/**
* If logAbandoned=true, print a stack trace of the code that
@@ -206,7 +208,7 @@
System.out.println(format.format(new Date(createdTime)));
createdBy.printStackTrace(System.out);
}
- synchronized(this) {
+ synchronized(this.trace) {
Iterator it = this.trace.iterator();
while (it.hasNext()) {
AbandonedTrace at = (AbandonedTrace)it.next();
@@ -220,8 +222,8 @@
*
* @param trace AbandonedTrace object to remvoe
*/
- protected synchronized void removeTrace(AbandonedTrace trace) {
- if (this.trace != null) {
+ protected void removeTrace(AbandonedTrace trace) {
+ synchronized(this.trace) {
this.trace.remove(trace);
}
}


Because Commons-dbcp 1.3 is not year released in formal. (now : 4. 2009) and I do not know what willing happen if the snapshop patch version, I used patched commons-dbcp 1.2.2 library.

After patching that version, there is no trouble or problem like that.


1 comment: