Part 3 – Building Better Monitoring for DotNetNuke Servers

20 December, 2012 CloudContent ManagementDotNetNukeHostingSecuritySoftwareStabilityTechnology

This is Part 3 of a 4-part series on the construction and implementation of our new server monitoring service … included with all our Business and Enterprise server plans.

– Tony Valenti

Let the Sun Shine in … The Trials and Tribulations of SolarWinds

Once we finalized our SolarWinds purchase, we still had our work cut out for us. We decided that our approach was going to be to build a DotNetNuke module that would serve as middle-Rays of Sunlight pouring through the cloudsware … sort of a module/API to interface between the applications.  We wanted to be able to set up a list of servers in DotNetNuke and have those servers monitored by SolarWinds. We also wanted to create a list of alert templates (at different thresholds) so that all server customers (and ourselves) could choose what they wanted to be alerted to in a consistent manner.

First we started with building an API for SolarWinds. Building an API to add automation services for someone else’s app isn’t easy. You have to spend a lot of time, trial, and error figuring out the right methods you should be calling in order to accomplish your task. Eventually we figured out how to do things like create servers and discover all of the components that we wanted to monitor.

Once we got our API built, we created a DotNetNuke front-end module. You could add a server into the DotNetNuke module, choose what alerts you wanted, and it would sync into our custom-built API. Now it was time for an internal BETA! We decided that the best way to test this would be on a few of our own servers so we loaded up 200 of our own servers into it. And it worked … flawlessly … almost.

A Hic-up
Remember in a previous post from this series when I was describing OpManager and how it would randomly stop collecting data from certain servers? Well, Network Performance Monitor was randomly having the same problem (fortunately, it was nice enough to let us know about it). A server would be monitoring fine for weeks and then, all of a sudden, no more data! What was going on?

Well, this became a little bit frustrating because once we stopped collecting SNMP data from an IP, we couldn’t figure out how to make it start again. After a bit of head-banging, we realized that every server that was having this issue had multiple IP addresses on it and it seemed like only one of those IPs would let us collect data from it. The worst part though, was that it was inconsistent. It wasn’t the first IP, the second IP, the last IP, the biggest IP, or the smallest IP … it was pretty much just a random IP address that would respond – and then it would sporadically change. What did we do wrong?

A quick Google search showed us that a few other people had similar problems with SNMP but their forum posts didn’t really help me. They would post a message like “Hey, my SNMP isn’t responding any more. What’s wrong?” Then 10 minutes later they would post again saying, “Never mind – I told it to monitor on a different IP address and it started working fine.” Well, this might be an OK solution if you have one or two servers that have multiple IPs, but this was not going to work for us. We have thousands of servers with multiple IP addresses.

At this point, I started getting very curious about what was happening. I loaded up a copy of WireShark and started sniffing my SNMP packets. Here’s an example of what I would see:
Tony-PC (192.168.1.1) sending an SNMP Request to 10.10.100.107
Tony-PC (192.168.1.1) received an SNMP Response from 10.10.99.110

Now, if you look at that log, you’ll see something is not right – I queried 10.10.100.107 but I got a response from 10.10.99.110 . That isn’t right! Both of these IP addresses, however, belonged to the same server, however, it was now clear what was happening – SNMP wasn’t responding correctly.

At this point I knew it was time to get Microsoft involved.

Bringing in the folks from Seattle
Our experience with Microsoft support has always been great, and this was no exception. However, because this bug was such a corner-case issue for Microsoft, it took a long Microsoft Logotime for them to understand, diagnose, and resolve the issue. Keep in mind that this bug only affects systems that have multiple IP addresses (globally, this is a very small percentage) and are using SNMP (even less). We’re really talking about a small percent of a very small percent of servers. I opened the ticket with Microsoft in August. Four months and seventy-nine emails later — working with case managers, product managers, security managers — and filling out what felt like 100 pages of paperwork … we had a fix in hand!

We’re still working with Microsoft to get the fix approved for public release via Windows Updates. Our small contribution to the advancement of Windows Server.Windows Server 2012 Logo

Finally we were ready to roll this out! The new monitoring service and alerts were working great for us and we knew it would be a great improvement for our customers. However, at the 11th hour, one of our team members brought up one more idea that we simply had to implement.

You see, in our testing, we had just built a monitoring module. This was great, but this meant that you would go to the monitoring website, create a new user and password, and log in with that. I don’t know if you’re like me, but I have too many usernames and passwords and I could really do without adding another. We needed a membership provider.

In the next, and last, part of this series, we will talk about the new membership provider, how it was implemented prior to launch, and how the new service is being received by our DotNetNuke server customers.

– Read part 2 again. –          – Go on to part 4 now. –