Robin Minto

Software development, security and miscellany

Microsoft Threat Management Gateway web farm publishing issue – “The remote server has been paused or is in the process of being started”

We’ve recently uncovered an issue with the way that I had configured web farm publishing in Microsoft Threat Management Gateway (TMG). When I say “we”, I include Microsoft Support who really got to the bottom of the problem. Of course, they’re in a privileged position as they have access to the source code of the product.

Perhaps I would have resolved it eventually. I’m thankful for MS support though. I didn’t find anything on the web to help me with this problem so, on the off chance it can help someone else, I thought I’d write it up.

The Symptoms

We’ve been switching our web publishing from Windows NLB to TMG web farms for balancing load to our IIS servers and we began seeing an intermittent issue. One minute we were successfully serving pages but the next minute, clients would receive an HTTP 500 error “The remote server has been paused or is in the process of being started” and a Failed Connection Attempt with HTTP Status Code 70 would appear in the TMG logs.

08102011_211144

The issue would last for 30 to 60 seconds and then publishing would resume successfully. This would normally indicate that TMG has detected, using connectivity verifiers for the farm, that no servers are available to respond to requests. However, the servers appeared to be fine from the perspective of our monitoring system (behind the firewall) and for clients connecting in a different way (either over a VPN or via a TMG single-server publishing rule).

The (Wrong) Setup

Let’s say we have a pair of web servers, Web1 and Web2, protected from the Internet by TMG.

08102011_223957

Each web server has a number of web sites in IIS, each bound to port 80 and a different host header. All of the host headers for a single web server map to the same internal IP address like this:

Host name IP address
prod.admin.web1 172.16.0.1
prod.cms.web1 172.16.0.1
prod.static.web1 172.16.0.1
prod.admin.web2 172.16.0.2
prod.cms.web1 172.16.0.2
prod.static.web1 172.16.0.2

In reality, you should fully qualify the host name (e.g. prod.admin.web1.examplecorp.local) but I haven’t for this example.

I’ll assume that you know how to publish a web farm using TMG. We have a server farm configured for each web site with each web server configured like this (N.B. this is wrong as we’ll see later):

08102011_220001

The benefit of this approach is that because we’ve specified the host header (prod.admin.web1) rather than just the server name (web1), we don’t have to specify the host header in the connectivity verifier:

08102011_215828

This setup appears to work but under load, and as additional web sites and farm objects are added, our symptoms start to appear.

The Problem

So what was happening? TMG maintains open connections to the web servers which are part of the reverse-proxied requests from clients on the Internet. Despite the fact that all of host headers in the farm objects resolve to the same IP address, TMG compares them based on the host name and therefore they appear to be different. This means that TMG is opening and closing connections more often than it should.

The Solution

The solution is to specify the servers in the server farm object using the server host name and not the host header name. You have to do this for all farm objects that are using the same servers.

08102011_212545

You then have to specify the host header in the connectivity verifier:

08102011_212849

You could also use the IP address of the server. This is the configuration that Jason Jones recommends but I prefer the clarity of host name over IP address. I’m trusting that DNS will work as it should and won’t add much overhead. If you need support with TMG, Jason is excellent by the way.

Conclusion

Specifying the servers by host header name seemed logical to me. It was explicit and didn’t require that element of configuration to be hidden away in the connectivity verifier.

I switched from host header to IP address as part of testing but it didn’t fix our problem. It didn’t fix the problem because I only used IP addresses for a single farm object and not all of them.

Although TMG could identify open server connections based on IP address, it doesn’t. It uses host name. This has to be taken into account when configuring farm objects. In summary, if you’re using multiple server farm objects for the same servers, make sure you specify the server name consistently. Use IP address or an identical host name.

How many hops? Internet from far, far away.

IMG_20110917_123232

I was lucky enough to spend a week in the Seychelles this month. If your geography is anything like mine you won’t know that the Seychelles is “an island country spanning an archipelago of 115 islands in the Indian Ocean, some 1,500 kilometres (932 mi) east of mainland Africa, northeast of the island of Madagascar” – thank you Wikipedia.

In our connected world, the Internet even reaches into the Indian Ocean so I wasn’t deprived of email, BBC News, Facebook etc. but I did begin to wonder how such a small and remote island nation is wired up. It turns out that I was online via Cable and Wireless and a local ISP that they own called Atlas.

Being a geek I thought I’d run some tests, starting with SpeedTest. The result was download and upload speeds around 0.6Mbps and a latency of around 700ms. In the UK, the average broadband speed at the end of 2010 was 6.2Mbps and I would expect latency to be less than 50ms. Of course, I went to the Seychelles for the sunshine and not the Internet but I was interested in how things were working.

So, I ran a “trace route” to see how my packets would traverse the globe back to the BBC in Blighty. I thought the Beeb was an appropriate destination.

01102011_004019

Then I added some geographic information using MaxMind's GeoIP service and ip2location.com.

The result was this:

HopIP AddressHost NameRegion NameCountry NameISPLatitudeLongitude
1 10.10.10.1            
2 41.194.0.82   Pretoria South Africa Intelsat GlobalConnex Solutions -25.7069 28.2294
3 41.223.219.21     Seychelles Atlas Seychelles -4.5833 55.6667
4 41.223.219.13     Seychelles Atlas Seychelles -4.5833 55.6667
5 41.223.219.5     Seychelles Atlas Seychelles -4.5833 55.6667
6 203.99.139.250     Malaysia Measat Satellite Systems Sdn Bhd, Cyberjaya, Malay 2.5 112.5
7 121.123.132.1   Selangor Malaysia Maxis Communications Bhd 3.35 101.25
8 4.71.134.25 so-4-0-0.edge2.losangeles1.level3.net   United States Level 3 Communications 38.9048 -77.0354
9 4.69.144.62 vlan60.csw1.losangeles1.level3.net   United States Level 3 Communications 38.9048 -77.0354
10 4.69.137.37 ae-73-73.ebr3.losangeles1.level3.net   United States Level 3 Communications 38.9048 -77.0354
11 4.69.132.9 ae-3-3.ebr1.sanjose1.level3.net   United States Level 3 Communications 38.9048 -77.0354
12 4.69.135.186 ae-2-2.ebr2.newyork1.level3.net   United States Level 3 Communications 38.9048 -77.0354
13 4.69.148.34 ae-62-62.csw1.newyork1.level3.net   United States Level 3 Communications 38.9048 -77.0354
14 4.69.134.65 ae-61-61.ebr1.newyork1.level3.net   United States Level 3 Communications 38.9048 -77.0354
15 4.69.137.73 ae-43-43.ebr2.london1.level3.net   United States Level 3 Communications 38.9048 -77.0354
16 4.69.153.138 ae-58-223.csw2.london1.level3.net   United States Level 3 Communications 38.9048 -77.0354
17 4.69.139.100 ae-24-52.car3.london1.level3.net   United States Level 3 Communications 38.9048 -77.0354
18 195.50.90.190   London United Kingdom Level 3 Communications 51.5002 -0.1262
19 212.58.238.169   Tadworth, Surrey United Kingdom BBC 51.2833 -0.2333
20 212.58.239.58   London United Kingdom BBC 51.5002 -0.1262
21 212.58.251.44   Tadworth, Surrey United Kingdom BBC 51.2833 -0.2333
22 212.58.244.69 www.bbc.co.uk Tadworth, Surrey United Kingdom BBC 51.2833 -0.2333

What does that tell me? Well, I ignored the first hop on the local network and the second looks wrong as we jump to South Africa and back again. The first stop on our journey is Malaysia, 3958 miles away.

MSP130619habd3ff28e320700000g53di634b58d5ec

We then travel to the west coast of America, Los Angeles. Another 8288 miles.

MSP209619habc360b3abg0g0000322e7i1185717864

We wander around California. The latitude/longitude information isn’t that accurate so I’m basing this on the host name but we hop around LA and then to San Jose followed by New York. Only 2462 miles.

MSP15619habg6g8548fc9b00002a9i7gaf324f582g

We rattle around New York and then around London, 3470 miles across the pond.

MSP16819habgbc842f7750000045e0cea0hga1a972

Our final destination is Tadworth in Surrey, just outside of London.

That’s just over eighteen thousand miles (and back) in less than a second – not bad, I say.

p.s. don’t worry, I spent most of the time by the pool and not in front of a computer.