I recently came across a LinkedIn post that referred to the more narrow field of Application Performance Management and couldn’t resist responding: http://www.virtualizationpractice.com/blog/?p=12982&utm_source=rss&utm_medium=rss&utm_campaign=why-is-application-performance-management-so-screwed-up

Lets face it after 3 decades, Application, System, and Network monitoring has progressed very little. Sure there are lots of new toys but as the article does admit most are at the starting line. However, like most discussions on this topic, the focus was tool and tactics centric and as a result too tactical and in the weeds for the strategic issue that surrounds the entire industry.

The issue is systemic. No single vendor can fix the problem anymore than a firefighter with a hose and a shovel can contain a blaze under high winds. The industry standards and culture are the issue, not the tactical approach. The tactics are not bad, only ineffective without strategic support.

Looking at the sister field of Network management, which has been around a lot longer and has similar issues, the core bits can be seen clearly. Boiling away the derivative issues, the base problem is cultural – monitoring is viewed as synonymous with security.
DARPA set up the Internet and as such saw everything with this military hammer. Yet if you look at similar systems in nature, such as the human body, monitoring is approached as a secure registry system of entities. Security is not synonymous with monitoring. Policing is left to filtering (i.e. gut) and scans (i.e. immunology.) But with a military lens and the rapid development this subtle distinction went unnoticed and we have what we have.

Above this “monitoring is security” culture is a larger strategic issue. As a culture, we focus on the hammer we know and think the issue is a mental problem when in fact it is an emotional/cultural one. That is why we send kids to school to mentally learn for 20+ years and are surprised when they cannot emotionally cope. Yet you cannot think a feeling or feel a thought. That is, a blind person cannot understand blue anymore than an educated psychopath can understand love. We hardly know ourselves emotionally and know nothing about the emotional (i.e. cultural) currents of a business, much less an industry.

Anyway, fast forward 30 years and we’ve come unrealistically to expect a few vendors to magically keep track of all the developments without a confined, solid set of standards. Although IBM, CA, HP, BMC and alike will profess these God like qualities (only exasperated by the bout of positive thinking) – who are they *&^% kidding?

Realistically without shifting towards a registry approach like DHCP and infuse agents into the OS like virus software does or magically causing a dramatic slow down on the rate of innovation, the issue will boil down to a mapping problem. This would be doable if the Vendors had a standard where they could plug in recommended ranges and values and the business community came to expect such guidance. But this is delegated to central monitoring architectures/clouds/etc.

As a result, these central software vendors will buy into the KoolAid that somehow they can magically keep up with the world. They listen to edicts on high, work their 70 hour week, knowing that positive thinking alone can boil the ocean. Sadly it will take a continual invasion of million Babylonians programmers such as these to stay on top of this Spartan nut. The wild fire is never contained by standards or industry culture. So it is to the whims of the wind and the larger industry eco system and the economy that determines their fate and how much of the monitoring is contained at any moment.

There is a silver lining. There are a few vendors taking up the registration approach and this combined with the Babylonian current monitoring approach might get close to the magical 80%/20% break. Further, if via workflows and knowledge management you impose the “boundaries” and “registries” needed, this sort of utopia is possible. I have seen it in a few places, but they are generally midsize.

2 Responses to “Why is Application, System, and Network Monitoring so screwed up?”

  1. Kevin Conklin says:

    Wow Dan! Well written and deep thought there on the cultural and approach issues. I agree that it is our ‘business’ driven tendency that looks for the silver bullet approach when much of the issue is our wrong headedness about how to solve the problem to begin with.

    One area that I find particularly interesting though is when new technology breakthroughs enable changes in cultural approach. For instance, out of the investments that arose from DARPA in response to 9/11 security issues and the quest for facial and speech recognition software there was a lot of progress made in high speed pattern recognition algorithms. A few years ago, people started to apply these algorithms and more from 3rd generation machine learning advances in artificial intelligence to the quest to build a self-learning system that could learn all the interactions inherent in a complex application environment.

    This enables a change in the current management paradigm that says you filter data (time series and notification) looking for exceptions (thresholds or failures) and apply rules to them to point to problem solutions. This is a flawed approach that no longer scales with today’s complex IT environments and has led to a resurgence in long outages (witness blackberry, bank of america, etc.)

    We now have software products (like ours prelert.com) that can look at all the data produced by the largest IT environments in real-time and analyze it for the tell tale patterns that indicate problems and uncover the root – cause of those problems in real-time.

    This turns the management paradigm on it’s head and frees up the app and infrastructure experts to focus on projects and operations instead of building rules and dashboards for management tools that as you put it so well ‘think that security is found in monitoring’

  2. Very good point. Pattern recognition software can work around some of the “mapping” problem (a problem that derives from a stance that monitoring is security.) There are a few players in that space across many industries (i.e. http://en.wikipedia.org/wiki/Prediction_market) By itself it can get closer, but will never completely mitigate the final “mapping” gap.

    Let me explain what I mean about “mapping” so we are not talking past one another. Machines can only process logical information or information that a human has mapped to logical. For example, machines are very good at getting the IP address, DNS name, and other statistics (SNMP) about a device. However, business impact, profit center, and even physical location can only be accessed by a machine if a human has translated it logically.

    Sure you can put GPS units in each machine to provide physical location, but still you need someone to point out where the data centers are and their roll, etc. The problem is what is desired isn’t physical location but the meaning of the physical location, which only a human can map. Commonly this info is gleaned from device naming conventions or more sophisticatedly from inventory systems such as Granite or Maximo. However, again, this works because a human decided on the naming conventions or entered the data into Granite, etc. These are examples of the workflow and knowledge management stepping in where the industry standards are lacking. If the industry better policed the possible values (i.e. a robust SNMP MIB-II) and device/software vendors provided the values, that would be the ideal and provide fire lines to contain the problem, but the standards and culture don’t support it.

    Anyway, the pattern recognition works by abstraction and viewing larger patterns, often these are nested. These structures/patterns are used in addition to the elements as a pseudo “identity”. This in conjunction with industry port standards as well as vendor port conventions and traffic patterns give a larger taxonomy/identity inventory to leverage. As long as somewhere in the larger structure there is a map, it can be leveraged. These softwares use many of the techniques seen in the open source’s nmap’s fingerprinting technology. It should be noted that nmap is primarily a security tool, which goes to my point that there is no distinguishing security from monitoring due to the current stance of the industry culture and standards. The bottom line is if you want to inventory your monitoring space, you do so by treating the network as “hostile” and resisting your attempt via scans instead of a simple registry akin to DHCP (obtaining an IP dynamically.)

    The bottom line is if the mapping doesn’t exist due to human influences, even this approach will fall short. However, it is a better breed over the long haul and works around many of the industry standard gaps by creating its own taxonomy of vendors. So instead of dealing with a 1000 individual Cisco routers, the vendor only needs to pattern 100 router types. Still a daunting task to stay on top of, but much better than the alternative. Unfortunately the business pattern of small short lived innovative vendors being gobbled up by large Borgs (IBM,HP,CA,etc) might not allow enough time to create a comprehensive fix. But then experience/tactics always wins over academia/theory in this space. So it might work. Just need to see things in practice.

    Actually, ping me directly – my contact info is available on my profile or at http://www.nmsguru.com. I’d like to get a preview of how this works in practice. (I have 50 fortune 500 companies, government agencies, and military outfits under my belt and can provide input there in exchange. (i.e. AT&T, Cisco, USCENTCOM, Army, USDA,.. etc))

Leave a Reply

To prove you're a person (not a spam script), type the security word shown in the picture. Click on the picture to hear an audio file of the word.
Anti-spam image