[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [xmlblaster] socket reconnect flood



Hi Marcel!

My reconnect flooding problem is getting weirder: it also behaves this way when there is no apparent problem with the truststore file.
Let me present you part of a log file (200 megs of this/5 hours are generated when the bug appears):
(I unfortunately forgot to switch on FINER logging for the java.utils.logging framework...)

2006-11-06 20:20:07,847  INFO [XmlBlaster.PingTimer] (SocketConnection.java:180) - SSL client socket enabled for socket://******:7608, keyStore=/home/disp/disp/conf/nova_disp/truststore
2006-11-06 20:20:07,846  INFO [XmlBlaster.PingTimer] (SocketConnection.java:180) - SSL client socket enabled for socket://******:7608, keyStore=/home/disp/disp/conf/nova_disp/truststore
2006-11-06 20:20:07,849  WARN [ XmlBlaster.PingTimer] (Timeout.java:189) - No connection established, socket://******:7608 still seems to be down after 5239 connection retries.
2006-11-06 20:20:07,858  INFO [XmlBlaster.PingTimer] (SocketConnection.java :180) - SSL client socket enabled for socket://******:7608, keyStore=/home/disp/disp/conf/nova_disp/truststore
2006-11-06 20:20:08,042  INFO [XmlBlaster.PingTimer] (SocketConnection.java:180) - SSL client socket enabled for socket://******:7608, keyStore=/home/disp/disp/conf/nova_disp/truststore
2006-11-06 20:20:08,043  WARN [XmlBlaster.PingTimer] (Timeout.java:189) - No connection established, socket://******:7608 still seems to be down after 2965 connection retries.

The really interesting things to note are:
- The first two messages are in reverse time order. This can only happen if the thread "XmlBlaster.PingTimer" is in fact not ONE thread, but there are more than one threads with the same name, all trying to connect at once!
- No mentioning of truststore not found. This is still the original, unpatched code running.
- Watch the messages by Timeout.java:189. First there were 5239 connection retries, then 2965 connection retries. This is consistent in all 200 megs of logs. Either someone's randomizing numbers :), or there really are separate objects (and threads) trying to connect, each counting on its own.

It seems that somewhere in the code there may be a race condition or similar which, once triggered by something, allows a ravaging horde of connect threads to spawn and kill everything in their way. :)

thanks for help,
Balázs Póka