The Programming Programmer: June 2009

We are having a strange(as of now) problem in our performance test environment.

Few facts

1. First time we have opted for virtual machines over physical.
2. There are 6 web servers in DMZ, 3 APP Servers(all WCF services and fire wall between Web and APP) and 6 DB servers(again firewall between App and DB).
3. As usual VIP in front of APP and DB servers for load balancing.

when we ran tests, all seems to be going haywire(with responses over a minute). So we wanted to narrow down the problem and performed below steps

1. One on one testing (1 Web, App and DB), but all on virtual environment. This doesn't yield much improvement.
2. 1 Web(VM) against 1 APP(Physical), this allows us to near benchmarks.
3. 1 Web(Physical) against 1 APP(VM), again bad result
and much more combinations like this

So we concluded
1. APP(VM) is the problem, as physical server yielded good performance.
2. Cannot be code base, afterall it is the same code performing well in physical server.

We looked at the event logs(Web servers, PIX(CISCO Firewall)) and found out below maximum occured errors(first 3 are related to WCF)

1. EndpointNotFound exception
2. Timeout exception
3. Protocol exception
4. TCP 10048 error(from PIX).

Above errors are thrown randomly and it depends upon configuration we are testing against. I mean

with VIP - we get only EndpointNotFound exception
without VIP - all others

Before i give my suggestions, this is my project in L & P(Disclaimer) and still it is yet to be applied.

1. EndpointNotFound exception
Simply due to caching nature of WCF + Unplanned testing.

First, .net caches previously used TCP sockets.
Second, Testers simply disables certain APP servers and continue to test thinking that it will work out of the box.
So what happens if cached socket points to disabled APP server.

2. Timeout exception
Nothing much to discuss here, due to default value(10) set for 'maxConcurrentSession'. Increased to 1000(though this didn't solve our problem)

3. Protocol exception
No Clue
4. TCP 10048 error(from PIX).
TCP Port exhaustion, all of the temporary 4000 ports are getting used. You can find plenty of articles re this.
We planned to reduce the TIME_WAIT of TCP connections from 4 minutes(default) to 2 seconds. It is a quick win.

One thing i don't understand is, why port exhaustion only on VM?
I will update this post once i have definitive answer.

The Programming Programmer

Monday, 22 June 2009

Load and परफॉर्मेंस - पोस्ट 2

Load & Performance - Post 1