Situational Awareness in the Cloud (OpenStack to AWS)

Posted in State Of NMS on November 12th, 2015 No Comments

Industry Fear and Stagnation

Something is wrong with our industry. In an age of OpenStack and AWS, where IT can diagnose and heal itself, the top-down and passive approach to Situational Awareness is stale and ineffective. This approach uses two fatal strategies:

Using humans to continually create and adapt rules for an ever changing technology and client environment
Grafting business approaches, such as ITIL and DevOps, onto the already cluttered cocktail of formal and informal workflows

The sad truth is after a quarter century I can find no fundamental difference in today’s products verses those sold when I graduated from college. ‘Working harder, rather than smarter’, has failed to enable these strategies for the Situational Awareness industry.

The common flaw in these two strategies is a lack of delegation based on what machines do well and what humans do well. As a result, tool architectures work independent of organizational structures, business processes, and knowledge management. The result is brittle solutions that require significantly more human and machine thought and effort to maintain an acceptable level of Situational Awareness. Businesses are left with deciding between having partial awareness or excessive expense.

Despite this reality, every year brands tout yet another paradigm shift whose secret sauce leverages synergies to the next level to increase shareholder value. They refresh slogans, acronyms, and jargon, but not the approach.

Eventually customer frustration forces the industry to reboot. Before the 1980’s it was IBM, then best of breed in the 90s, and since the dot com crash it has been monolithic large vendors such a IBM, CA, and HP. Yet with each reboot, a fresh batch of book-learned, but experience-poor, college kids reinvent the same solutions and approaches without the council of experience or instinct. It’s Ground Hog Day, or rather in deference to the industry’s collective amnesia, it’s Edge of Tomorrow whose tagline applies – ‘live, die, repeat.’

Industry Immaturity

Luckily a ‘secret sauce’ does exist outside the marketing spin. It is hidden in the subtext of the two movies mentioned. Unlike the Situational Awareness industry, the main characters in these movies eventually do more than ‘live, die, and repeat.’ They learn from their experiences. Two Einstein clichés apply:

Insanity: doing the same thing over and over again and expecting different results’
We can’t solve problems by using the same kind of thinking that we used when they created them.

But the protagonists do not learn these lessons cognitively. After trying to bang away at the external problem, they finally discover something is wrong with themselves. So they delve much deeper into the transformative emotional work and intuition development to fix the symptoms that Einstein described. This betray the secret sauce that the industry is missing – maturity.

If we take the hero’s journey in these tales as a model, we see the protagonists:

Learn what they really need and want
Find out and accept what reality can provide
Change themselves and their attitudes to adapt to the situation

In the case of the industry’s approach to Situational Awareness, this translates to a better integration of:

Cognitive, top-down human-designed business needs that specifies “WHAT” is required.
Dynamic, bottom-up machine intuition and social enablement that autonomously decide “HOW” to meet the top-down human requirements.

The industry’s mistake has been to focus primarily on the book-learned, top-down approach at the expense of experience-earned, bottom-up intuition. Further, this gap has been filled by cognitively dictating from the top HOW things should be, based on static rules and outdated empirical data. Ironically, the HOW-focus crowds out clarifying WHAT is required to meet business needs and thus fulfilling the old adage, ‘if everything is important, nothing is.’

This mistake is reflected in the persistent problems of the Situational Awareness industry. So what suppressed emotional truth is at the root of all of this? It comes down to a fundamental distrust of the machine in a culture that dreams up Jurassic World and Terminator. This belief system, like all initial belief systems, are inherited. In this case the belief was passed down from DARPA. Like any military organization, DARPA views the world through a lens of conflict: ‘us versus them’ and ‘kill or be killed.’ At that time this approach was appropriate for the scope and complexity of the Internet/ARPA-Net.

However, this distrust and consequential failure to delegate to the machine is outdated in a cloud-centric world where virtual environments can automatically recover from failed drives and where SDN can autonomously detect and fix denial of service attacks and most of all in an OpenStack world where the machine reports its intentions whether that is to recover from an interface failure or to alleviate CPU load with additional processors.

It is time for the industry to delegate to the machine what the machine can do right and delegate to the human what the human can do right and develop the wisdom to know the difference. It is time for the industry to grow up.

The Zen of Situational Awareness in the Cloud

Reflection, humility, and a healthy dash of shame is the starting point of maturity. The industry must ask itself:

What are the problems?
What are our current approaches?
Which of our long held assumptions are wrong?
What is the new model and path forward?

For the most part, the first two questions are known. The industry knows what the problems are and it knows how it has tried to approach these problems. However, instead of letting go, dealing with the emotions, and finding alternatives, the industry has doubled down and worked harder, not smarter. The following table lists the most pressing problems and the stale approaches:

Problem Area	Current Stale Approaches
Application, Device, and Network Discovery and Discovery	· Application, System, and Network Scans · Continual Audits. · Centralized Top-Down Control within the end-to-end solution
Software and Hardware Diversification	· Human Constructed Rules, Configurations, and Scripts
Self Healing IT (OpenStack, SDN, Puppet/Chef)	· Application, System, and Network Scans · Continual Audits. · Centralized Top-Down Control of Business Processes, Knowledge Management, and Strict Task Adherence to Organizational Structure · Human Constructed Rules, Configurations, and Scripts · Human Constructed Events: Groups, Filters, and Correlations
Machine Stated Intention	· Ignored and discarded
Human Social Media	· Ignored and discarded
Requirements Evolution	· Human Constructed Rules, Configurations, and Scripts
Attaching “meaning”, such as business impact, to Events	· Human Constructed Rules, Configurations, and Scripts
Root Cause Analysis	· Human Constructed Events: Groups, Filters, and Correlations
Grouping Event Relationships in Time and Space and Prioritizing Relative Importance	· Human Constructed Rules, Configurations, and Scripts · Human Constructed Event: Groups, Filters, and Correlations
Low Quality CMDB, Notification Contact Lists, Etc	· Centralized Top-Down Control within the end-to-end solution · Continual Audits

Problem Area

Current Stale Approaches (Continued)

Business Flows

· Centralized Top-Down Control of Business Processes, Knowledge Management, and Strict Task Adherence to Organizational Structure

· Centralized Top-Down Control within the end-to-end solution

· Top town Standards: DevOps, ITIL, FCAPS, BSS, GNOSS, etc

· Static Notification: Email, Ticketing, etc

· Working in silos or Conference Calls

· Human Constructed Runbooks

Shared Knowledge Management

· Centralized Top-Down Control of Business Processes, Knowledge Management, and Strict Task Adherence to Organizational Structure

· Centralized Top-Down Control within the end-to-end solution

· Top town Standards: DevOps, ITIL, FCAPS, BSS, GNOSS, etc

· Backup personnel

· Ticket Searches

· Human Constructed Runbooks

Behind all these approaches is a faulty belief in security posture to Situational Awareness. This is seen in the idea that polling the hell out of the cloud is a good way to discover what is there. It is also seen in the failure to solicit from the device or program: what it wants to be monitored on, who is near it, and other pertinent information that only the monitored element can know. The current approach basically sees the cloud as a disconnected collection of things. Shifting this posture, though beneficial, requires massive emotional growth within an industry rooted in outdated militaristic thinking and general distrust of: programs, devices, networks, and cloud.

So what would a cloud-trusting approach look like? Luckily getting to the solution mentally is trivial compared to the emotional journey. A cloud-trusting approach involves:

Holistic Situational Awareness with a symbiotic relationship between human and machine where human consciousness and intuition is recognized as the most precious resource in the end-to-end solution.

This is not what the industry does today. Instead of maximizing machine automation and analytics and leveraging machine enabled socialization, most solutions depend on human pre-built rules or real-time involvement for these tasks. This deviation is most often recognized by four common strategic blunders:

Incomplete Machine Enabled Socialization:
- Top-down management standardizes the approach at the expense of ownership and individual creativity.
- Bottom-up organization silos segregate departments, individuals, and tool components from the end-to-end monitoring solution.

Incomplete Machine Automation and Analytics:
- Top-down engineer adaptations require bottomless rule and script development to accommodate changing technologies, environments, and other requirements
- Bottom-up event and object silos ignore situations and see only individual events. This creates the false pursuit of a singular root cause in one isolated device at one specific moment in time which is attained primarily by filtering events rather than relating events.

All these conditions are compounded by the fact that large capital budgets are used to deploy the solutions while meager operation budgets barely cover the basics. There is no wiggle room for the massive human intervention required to duplicate the machine’s natural abilities. Conversely, if the problem is approached holistically and machine delegation is embraced, seven approaches can alleviate these concerns:

Human to Machine Declaratives – Humans declare what is required from the machine, rather than specify how the machine should do things and in what order.
Human Intuition – Use a machine to correlate related events first and then present the results to a human to determine the holistic situation.
Informal Process Social Enablement – Machine initiated interruptive human workflows such as situations rooms and other social media.
Machine Intuition – Machine experience-centric driven rule creation rather than machine pre-calculated rules or human-centric rule creation. This approach exploits:
- Encoded human meaning
- Machine intentions
- Human intentions (i.e. twitter, wiki, etc.)
Emergent Situational Awareness – Situational Awareness designed as a distributed registry system into a cloud, rather than a security function through pounding the network from a central Situational Awareness server.
Self Aware (Product on Product) – A central vender instance of the Situational Awareness solution performing Situational Awareness against all customer deployments. This makes the company’s experience and the customer’s experience one in the same and enables experiential learning as well as outside book learning.
Socialized Machine Learning – Machine sharing of: configurations, rules, analytics, reports, and automations among all deployments
Machine Learning – Passive machine training through observation of human tool usage for eventual automatic automation or human directed machine learning.

Reinventing a wheel this large is unrealistic because ‘you cannot know what you do not know.’ Experience is required and cannot be replaced by more book learning. Luckily a proxy approach can accessing the necessary experience.

Effective working models for Situational Awareness exist outside the industry. These have stood the test of time. As such, the experience-based lessons-learned are organically built into their architecture. These architectures can be mimicked.

The Human Mind’s Model for Situational Awareness

Introduction

In trying to apply these approaches the starting point is to ask, “who among the larger Situational Awareness vendors has done this nuts to bolts?” Unfortunately, since the industry is quite new and worse these concepts are still in conceptual development, the answer is no one. The next step is to look outside the industry for inspiration. Out of all the possibilities, cognitive science shows the most promise in providing a better model for Situational Awareness. It uses single threaded human consciousness along with multithreaded instinctual subconscious processes, which models the relationship between human and machine.

The human mind’s approach to Situational Awareness can be broken into two rough categories:

Passive Monitoring. Through the passive collection of information and accessing memories, the mind creates nested object trees and event clusters.
Adaptive Monitoring. Using results from passive monitoring, the mind works holistically with the body and interacts with the outside world to perform diagnostic and corrective workflows.

The exploration of how the mind handles both passive and adaptive monitoring can shed light on new approaches to Situational Awareness with the right mix and coordination between machine and human.

Passive Monitoring

Passively the human mind monitors reality with the following stages:

Outside Events
Sensory Experience
Model Event Clusters
Instincts: Historical Event Lookup
Model Object Hierarchy
Cognition: Historical Object Lookup
Action: Blend Instinct and Cognition

An example best illustrates the human mind’s approach to Situational Awareness. Imagine you see a car back into a fire hydrant.

Outside events send light signals, molecule vibrations, and other raw environmental changes to your senses. Sensory experience is created from this environment soup. Subjectively you experience:

Sight: The car impacting, the hydrant snapping, and a geyser of water.
Sound: The thud of the hit, the sound of crumpling metal, and the rush of water
Smell: The scent of gas, oil, and your own sweat.
Touch: The breeze against your skin, the misty spray of water on your face, and the pressure of the ground against your feet.

However, this is all still below the level of consciousness within the animal brain. There are, as of yet, none of the naming of objects or even the experiences. No relationships between the events experiences have been drawn.

On the right side of the brain, the mind models the event clusters. The subconscious creates nested structures and relationships between impressions. These are organized into clusters of events modeled in both time and space. The connections are relational and pattern-based rather than object and part-based. For example, the cloudy day and the crash of the car are connected to a depressed sense. The sense of forward flow connects the movement of the car to the snapping of the hydrant and to the flutter of leaves and birds.

Next, the right brain extends the associations into memories of structures and patterns. Instincts and gut senses are formed by comparing the event clusters to past patterns. From this, the intuitive gist of the crash is given along with emotional emphasis to bias a gut-based response.

In the left brain, the nexus points in the structures in space and time betray the presents of objects. The mind cuts out these nexus points and names them as objects. The objects are nested. For example, the following names and values break up the crash scene: red, loud, acrid, cold wet, etc. Objects are aggregated and nested into a tree or cloud structure:

crash
- loud
- car
  - oil
    - acrid
  - fire hydrant
    - red
  - water
    - cold

Some of the items above may have more than a one-to-many relationship, but rather a many-to-many relationship. For example, links might exist between fire hydrant and water as well as between loud and car, fire hydrant, and water.

The mind’s model of the outside world is completed by correlating past object within memories. For example, the car might be recognized as a Ford Pinto and associated to a past newscast on another Pinto’s exploding gas tank. Further specific patterns that are named are considered as the same object. For example, the George experienced yesterday walking his dog is recognized as being the same person backing up the car. As was the case in the Ford Pinto, attached facts learned yesterday are applied to today’s George. The man’s face is associated with the name George. George has a limp. His dog is a lab. He’s sort of a jerk. Since George is considered the same object rather than, as in the Pinto’s case, merely two instances of the same type of object, the association is much stronger.

This creates a pallet from which the conscious mind draws and forms its conclusions. An optical illusion can best demonstrate the sheer amount of processing the subconscious mind does before the conscious mind gets a chance to engage:

7H15 M35S4G3 S3RV35 7O PR0V3 H0W 0UR M1ND5 C4N D0 4M4Z1NG 7H1NG5! 1MPR3551V3 7H1NG5! 1N 7H3 B3G1NN1NG 1T WA5 H4RD BU7 N0W, 0N 7H15 LIN3 Y0UR M1ND 1S R34D1NG 1T 4U70M471C4LLY W17H 0U7 3V3N 7H1NK1NG 4B0U7 1T.

As you read, notice that your subconscious has already identified letters, grouped the letters into tokens, and used pattern recognition to associate the tokens to the closest existing words. This is done before the conscious mind engages. This is not always the case.

In the example below, by adding phonetic and scrambled letter replacements, the conscious mind engages before the subconscious is done with its work. This breaks the natural flow:

4 U th15 becmoes hrader 2 raed w1th 3 mehtodologeis used. Spceifiaclly sw4pping w0rd5 4 nmub3rs, scrmablnig l3ttrers, 4nd sw4pp1ng l3tt3rs f0ur nmuber5 craet3s a p4tt3rn wh3r3 teh subcnoscoius c4nn0t cmopl3t3 th3 w0rk b4 hnadnig 1t off 2 th3 cnoscoius.

In this example you can feel your conscious mind kicking in as the subconscious wasn’t able to complete its data preparations. But even in this case, with enough scrambled text, the subconscious mind will learn the patterns and your conscious mind will be engaged less and less.

This brings us to the final step in the human mind’s approach to Situational Awareness. The conscious mind uses the prepackaged event clusters and object trees to blend instinct and cognition. Depending on the situation, the emphasis could be instinctual or mental and conscious or subconscious. Specifically, in some cases the emotional emphasis takes partial to complete control. If the person enters a fight, flight, or freeze response, they might react without being conscious at all. For example a car backfire could trigger a war vet with PTSD to dive beneath the dashboard before they register what they’ve done.

This example has shown how the human mind uses four approaches to attain Situational Awareness:

The instinctual animal base of the brain to model reality
The right holistic side of the brain to process the subjective and structure of the model
The left part-centric side of the brain to objectify, name, and analyze the model
The cognitive, self-aware frontal lobe of the brain to blend the results and take an action

When the human mind’s approach is compared to industry’s approach, it becomes clear that the industry emphasizes the conscious logical quadrant of the human mind. As such, the young book-learned, but experience-poor, engineers are perfectly suited to rebuild these same approaches. However, this ignores the three other quadrants and a billion years of evolutionary trial and error. It is time for the industry to mature, discard stale strategies, and model the human mind’s approach to Situational Awareness.

Adaptive Monitoring

If problems are detected, the results from passive monitoring may start a larger diagnostic and/or corrective process within the human mind. Four primary methods are:

The use of habits to guide interactions with the outside situation
Create a distributed Situational Awareness organism
Use outside sources of knowledge
Self monitoring during the situation.

Though there are others, these are the most impactful to overall Situational Awareness.

One of the lesser known methods of learning is the foundation of habits. For example, you might choose to go to the movies, but the mind also observes and learns that you must like going to the movies. Repetitive actions are re-enforced though this form of feedback learning. This makes the human mind incredibly adaptive, even in hostile environments and encourages common methods and procedures within the mind. Unlike passive monitoring, this prompts actions to be taken based on successful conclusion of similar past situations.

Some methods of learning are not cognitive, but more raw, pre-verbal, identity-centric behavior. For example, when skin is transplanted from one location of the body to another, for some time afterwards when touched, the patient will react to the originating area, rather than to the new location of the skin. This indicates that the human mind, when dealing with the outside body, uses a trusted domain/registration system rather than a security centric focus for organs, cells, and body components. This is even true with foreign bacteria, microbes, and parasites that the body uses in particular states to metabolize or otherwise function. For example, mitochondria is a bacterial symbionts that long ago was engulfed by cells and employed to create ATP (adenosine triphosphate) to power the cell. For all intents and purposes it is part of the body but it reproduces and evolves separately. Unlike many areas of technology, the body performs rigorous checks on anything that cross the skin, lungs, and gut and enters the body. Once inside, the human mind assumes these components of the body are to be trusted. Security protocols for breach detection using the lymph and other immune response systems are separate from the body’s Situational Awareness. This is why the gatekeepers of the body: skin, lungs, and gut are considered 80% of the immune system.

Another learning method, observational learning, also encourages common methods and procedures but across multiple minds. This method of learning occurs from watching, retaining, and replicating a behavior observed from another person. For example, when spectators watch a participant swing a golf club, the same neurons required to swing the club fire in the spectators brain and both subconscious movement as well as conscious visual memories are stored.

The final method of learning that fits adaptive monitoring is self awareness. This is monitoring and reacting to changes in the health and wellbeing of ourselves as well as the situation around us.

Summary

The combination of the human mind’s passive monitoring and adaptive monitoring creates a more cooperative and adaptive Situational Awareness approach than is present in the industry today. In particular, if the subconscious is largely mapped to machines and the conscious mind mapped to humans, much of the Situational Awareness approach transfers over without much strategic modification.

Mimicking the Human Mind

Introduction

There are many ways to apply the human mind’s approach to Situational Awareness. This section explores one of them. This approach can be broken into two layers along the lines of the analysis done in the previous section:

Passive Monitoring. Through the passive collection of information and accessing memories, the mind creates nested object trees and event clusters.
Adaptive Monitoring. Using results from passive monitoring, the mind works holistically with the body and interacts with the outside world to perform diagnostic and corrective workflows.

These two areas map against the earlier proposed approach changes in the industry as follows:

Passive Monitoring
- Human to Machine Declaratives – Humans declare what is required from the machine, rather than specify how the machine should do things and in what order.
- Human Intuition – Use a machine to correlate related events first and then present the results to a human to determine the holistic situation.
- Informal Process Social Enablement – Machine initiated interruptive human workflows such as situations rooms and other social media.
- Machine Intuition – Machine experience-centric driven rule creation rather than machine pre-calculated rules or human-centric rule creation. This approach exploits:
  - Encoded human meaning
  - Machine intentions
  - Human intentions (i.e. twitter, wiki, etc.)
- Adaptive Monitoring
  - Emergent Situational Awareness – Situational Awareness designed as a distributed registry system into a cloud, rather than a security function through pounding the network from a central Situational Awareness server.
  - Self Aware (Product on Product) – A central vender instance of the Situational Awareness solution performing Situational Awareness against all customer deployments. This makes the company’s experience and the customer’s experience one in the same and enables experiential learning as well as outside book learning.
  - Socialized Machine Learning – Machine sharing of: configurations, rules, analytics, reports, and automations among all deployments
  - Machine Learning – Passive machine training through observation of human tool usage for eventual automatic automation or human directed machine learning

Passive Monitoring

Overview

Introduction

Most of the existing Situational Awareness approaches can be summed up as a subset of the human mind’s approach. The difference is that, in the case of the industry, the approach is incomplete and depends too little on machines and empirical data and too much on humans and preconceived rules and approaches.

In order to understand the problems and possible solutions, a working model for the industry is needed. The human mind’s processing, described previously, can be converted into the supper set of Situational Awareness (shown below.) This model has ten functional areas:

Data Collection
Eventology
Short-Term Event State
Objectology
Short-Term Object State
Short-Term Business, Physical, and Other States
Long-Term External Data Storage/CMDB
Analytics, Turing Machines, and Finite Automata
Presentation

Data Collection

The initial step in Situational Awareness is to convert environmental changes into data structures. This is the same role that the senses perform for the human mind. The initial data structure created by data collection is simple. It might be a sentence for syslog or a set of fields for either SNMP or socket streams. This step also does the first tokenization via lexical analysis, breaking the text of the stream into words and chunks without attached meaning.

Eventology

Once the information has been modeled internally in raw form, the next step is to perform the work akin to human unconsciousness of the right brain – immediate, holistic, pattern recognition. This is a two step process: grammar parsing and then pattern recognition.

Grammar parsing is the starting point for digesting an event. Messages usually consist of a template with plug-in objects. These objects might be:

The time at which the problem was detected
The device name or IP address of the device having issues
A particular resource on the device such as CPU number or the interface name
The current state of that resource

In this way each object has both an identity and role. The identity is simply the text of the object while the role is defined by the location and context in the grammar of the message. For example, the identity might be an IP address but the role within the message is as a proxy server. The message template in this example represents the root object with each of the embedded objects representing the children of the root object. In eventology the static nature of the template from message to message verses the more dynamic embedded objects enables the grammar parsing to parse the message into tokens without any outside context except historical message examples.

Pattern recognition follows grammar parsing and can be ran as a continual background process. This is akin to human mind presenting an intuitive sense for an environment. Statistical correlations are calculated between current messages as well as past messages. In addition, object and other topologies, such as: business, physical, and other states, can be used as parameters into statistical correlations, looking for patterns and past relations. Collectively these form one or more event states.

Most event states will likely be nested. For example, the story of a data center outage might contain three situations involving: the initial power outage, followed by overheating of some systems, followed by eventual restoration. Within each situation other situations, such as a flapping interface due to a brown out, are contained. These in turn might contain individual event states, which came from multiple sources. For example, a router might send out a trap which matches what a third party polling program sends.

Short-Term Event State

This represents the event centric short term memory of the solution. Unlike ordinary event lists, the events are organized into a hierarchy of stories, situation, and events. As events age or become unactionable, they get saved into long term historical event storage.

Objectology

Along with processing events, also objects identified within the events can be identified and process and used within statistical algorithms to assist correlation of related events, situations and stories. Unlike the human mind approach, objects are easier to identify once the grammar parsing within eventology is completed. Like eventology the objects within the message are noted but, unlike eventology, objectology goes a step further in naming the object and tracking these names across event, roles, and time. Also, implied relationships between objects and sub objects are recorded and create hierarchies. Two examples include:

Card0.HostA.LabNetworkA.
ProgramA.HostA.DevelopmentNetworkB.

These become parameters into correlation algorithms. Just as the human mind dealing with Joe in the present assumes many of the attributes discovered in Joe in the past, the algorithms assume strong correlation for object instances across events in time.

Short-Term Object State

This represents the object centric short term memory of the solution. This acts as a cache. As referencing events age or become unactionable, the object trees get saved into long term historical object storage. This object storage might be an external CMDB.

Short-Term Business, Physical, and other States

This represents other CMDB information that can be used to enrich the event stream. In most cases these will be populated from outside the solution and this area is used as a short-term cache. Any data normalization or filtering will occur between the long term external data storage and this state.

Long-Term External Data Storage/CMDB

This operates as the traditional CMDB. This is equivalent to data sources outside the mind from hearsay to legal documents. As in that situation, the data cannot always be trusted. However, in addition to the other processing often discrepancies can be detected.

Analytics, Turing Machine, and Finite Automata

This is the thinking engine of the solution. Depending on the algorithms employed, statistical analytics, finite automata or other evolutions of the Turing Machine will be used.

Presentation

The final state is presentation. This represents all consumers of the Situational Awareness produced by the solution whether machine or human.

Summary

Together these various states can produce Situational Awareness through refining event and object relationship among each other and over time. The objects provide references which statistical algorithms can play with and test against the historical database for predicting future patterns as well recognize patterns in the present.

Improvement 1: Managing Scope with Human to Machine Declaratives

With the ushering in of DevOps came the idea for declarative languages such as Puppet and Chef. Unlike the vast majority of current programming languages, declarative languages specify WHAT needs to be done but delegates the HOW to the machine via its agent. This is a necessity as the industry rapidly evolves to the Internet of things. Specifying HOW generically for Linux and Windows as well as drones and firewalls is impossible. However, by trusting the machine and delegating the HOW to the local software of the device, scaling becomes trivial.

This approach becomes the primary means to reduce variance and customization from customer to customer. Specifically, most organizations want the same twenty or so “things.” Where they really differ is in HOW they implement WHAT they want. Exposing the solution through APIs and published database schemas leads to significant divergence that cannot be managed. This path is what destroyed MySpace in the wake of Facebook and enabled Google to eclipse other search engines. Too much flexibility in areas that do not add value is detrimental.

The following is an example of a three level hierarchy of WHAT customers generally ask for in a Situational Awareness solution.

AREA	APPLICATION	NUANCE
Data Collection	SNMP	Cisco
	Socket	Fixed length fields
	Log	Single line, delimited
	Vendor	Solarwinds
Escalation	Stale/Hot Event
	Stale/Hot Incident
Notification	Ticketing	Remedy
		JIRA
		ServiceNow
	Paging
	SMS
	Email
Resource Management	OnCall Calendar
Change Management	Event Suppression
Problem Management	Event Parsing Problem
	Event Enrichment Problem
	Event Action Problem
	Other Problem
Situation Recognition	Workflow
	Labeler
Event Manipulation	Enrichment	CMDB
		Program
Visualization (Additional)	Maps	Logical
		Physical
		Geographical
	Reports
	Charts
Retention Policies	Event
	Alert
	Situation
Logfile Rollover
Sythetic Event/Aggregation	XinY
	Thresholds
Authentication/Authorization	LDAP
Solution Self Maintenance	Pruning	Logfile Rollover
		Stale event clearing
	Monitoring	Logfiles, Processes, Resources

By taking a “WHAT” approach to product development, unnecessary various is reduced in: code branches, scripts versions, and plug-ins flavors. Doing so puts the focus back onto “WHAT” the customer wants to do, rather than “HOW” the vendor or customer chooses to do it uniquely for their environment. This increases the need for salesmanship over engineering as nuance preferences of the customer have to be overcome, but then that is by definition a salesmanship problem, not an engineering one and so the problem is in the correct space to be solved.

Improvement 2: Exploit Human Intuition

Imagine you see a situation that contains the following events:

Failed power supply
Four crashed routers
Network convergence issues
Dead applications and transaction latency

Immediately, as a human, the failed power supply becomes the likely culprit as all the other issues could be caused by a loss of power. This is based on a lot of experience outside the logic of the network.

While the machine can easily correlate these events using time, location, as well as past relationships before presenting it to the human, only the human can integrate the facts and intuition from living in a much more diverse world. Most solutions do not recognize this difference. Instead, they try in one stroke through static human created rules infuse both human’s ability to find meaning with the machines ability to statistically correlations in real-time. However, by not letting the machine organize the events in real-time, the solution can never leverage the natural human ability to see the needle in the haystack based on intuitive insight and past experience.

Further, graphical reports and charting can exploit human intuition. Alignment of colored graphs and deciphering patterns in plotted data are all natural to humans. For the most part the current Situational Awareness solutions ignore this human ability and, instead, try to leverage the logical, calculating abilities of humans. Ironically, the human’s calculating abilities are dwarfed by that of the computer.

Improvement 3: Machine Social Enablement

Within companies there are two primary sets of workflows. The first is based on organizational structure and the formal ways of doing things. The second is based on informal work flows and how things actually get done.

Across organizations, I have never come across two identical organizational structures. However, most organizations, when you get down to the day-to-day, are doing the same things. Social Enablement is recognizing that ‘how things get done’ are consistent across organizations. As such, enabling those informal workflows is more important that encoding the latest standard with its associated jargon. It becomes a salesmanship task to align this to the latest and greatest business processes and quality improvement trends.

This approach recognizes the difference in abilities between machines and humans. Machines are horrible adaptors to social changes. However, humans do it all the time. In the 1970s the concept of a telephone on your hip made no sense. However, we have evolved socially past the cell phone and into email, instant messaging, texting, Skype, and dozens of other social outlets. These have expanded to include groups of people communicating as seen on LinkedIn and Facebook. This was all possible based on the ability humans have to adapt to social change. As a result, when it comes to workflows, it is more important to design the right tools and interfaces and expect the organizations to adopt to them. Getting the organizations to adopt is a salesmanship problem, rather than an engineering one. This is as it should be.

One example of an informal process is the situation room. As problem situations arise, this software will grab via instant messenger available people in the right areas and put them into a common chat room. Another example is if the software cannot determine how to parse an event at a rudimentary level, this event can be presented via email or other notification scheme to the administrators of the solution for review. Automating these sorts of informal, but critical, social connections, whether with humans, machines, or a mixture of both, can greatly enhance the effectiveness of the solution.

Improvement 4: Machine Intuition

The final improvement on passive monitoring is in the area of machine intuition. Specifically, this is allowing machine experience to drive rule creation with little to no human intervention. The machine has two huge hurdles in attempting to do this:

Lack of senses outside the medium of the network – such as business unit, device owner, and physical location.
No subjective sense to create a sense of meaning – such as business impact, event severity, and political sensitivity.

Currently the CMDB, along with human written rules, has been used to overcome these issues. However, both these mechanisms require massive human investment and do not scale or adapt to changing reality in real-time. When the machine is allowed to operate more autonomously, it can prepare the event stream to enhance the human’s ability to use their stronger intuition and meaning. There are two huge components of this:

Statistical analysis and correlation
Derive meaning through correlating human and machine stated intentions

Meaning as well as cause and effect can be approximated by time and space correlations. Such correlations are often accurate, especially when involving more than a few objects and events based on historical correlations.

The second method is to borrow intrinsic human or machine knowledge already contained within existing events such as:

Human meaning attached
Machine stated intentions
Human stated intentions

The most common ‘attached human meaning’ example is device naming conventions which often encode the physical location and business purpose of devices. Another example is port numbers, some of which are industry standards, such as 1521 for an Oracle database or 22 for SSH. Though these require human intervention, they are much more static and far fewer than specific rules on interpreting a single type of syslog event or SNMP trap.

Some feeds into a Situational Awareness solution are statements of human or machine intent. For example, OpenStack and puppet can perform healing mechanisms which can trigger events. Also a department’s twitter feed or troubleshooting situation rooms, described earlier, can be correlated with events happening on the network. Thus, if OpenStack states it intends to swap out a failed hard drive and a little later a running application that accesses that hard drive locks up, the relationship between the two is obvious when presented to a human.

Adaptive Monitoring

Overview

Situational Awareness is not a passive affair. This is especially true with self-healing cloud environments such as OpenStack. Machines can automate responses after encountering repetitive issues. That said, the bulk of the current Situational Awareness industry uses passive monitoring exclusively. This is usually a flat list of filtered events coupled with selective auto ticketing. Any deep diagnostics or problem solving is handed off to humans. This obviously is out of sync with cloud-centric approaches, which not only detect and diagnose problems, but can often perform self-healing.

As such, adaptive monitoring depends on looking at the cloud first when it comes to command and control and then at the individual network, devices, and applications.

There are four areas of improvement which can apply this philosophy and close the current gap between cloud concepts and Situational Awareness solutions:

Emergent Situational Awareness
Self Aware (Product on Product)
Socialized Machine Learning
Machine Learning

Improvement 5: Emergent Situational Awareness

One of the greatest flaws of the current approach to Situational Awareness is the lack of a holistic view. In particular, the Internet was put together by the military and as such saw most things through a security lens. Through this lens when a monitoring or managing application discovers the network, it is based on the idea of ‘us and them.’ This part-wise, left brain approach is diametrically opposed to what occurs in nature. Instead, nature works more like DHCP and other registration systems. Further, after the initial security handshake, a holistic approach respects the perspective of the element. For example, after a device or VM successfully registers, the element could inform the Situational Awareness software:

How it wants to be monitored
What resources and applications the element holds
The element’s overall business purpose (as recorded by a human.)

Some of these rules would be provided by the vendor while others provided by the administrators of the applications and systems. Instead the current approach is to have one group responsible for knowing everything about everything and poll the living hell out of the network and devices under the assumption that they not only cannot be trusted, but they are ignorant to their own needs and desires. Clearly, this is:

Not sustainable
Creates a huge customer and provider disconnect
Results in the need for massive human intervention, using stale data and static rules.

Distributing the element procurement and instrumentation responsibility to the customer and adding a registry function mimics most complex systems in nature, including how the human mind registers the parts of the human body. In these complex systems traditional security still exists, but the first and primary line of defense is policing the borders. Once there is an infiltration into the trusted system, it becomes a much harder problem to solve. As such, handing complex systems with only this secondary defense is idiotic.

In traditional IT this sort of approach is impossible without fundamental changes to routing protocols across multiple vendors. However, in the case of OpenStack and other cloud technologies, the distributed commands and control can be utilized to create a registration basis to element procurement, security, and Situational Awareness.

Improvement 6: Self Aware (Product on Product)

Once the cloud is viewed as the body by the Situational Awareness solution, then the next step is to expand that concept to the deployments of the Situational Awareness solution. That is, at the corporate headquarters use an instance of the Situation Management solution to monitor the health and welfare of all other Situation Management deployments. This has the added side affect of allowing the solution provider to be a solution consumer and get experience-learned as well as the book-learned perspective of the customer.

Basic system and application metrics along with connectivity and logs can be fed from the customer Situation Management solutions back into the corporate instance of the Situation Management solution. As a result the company can proactively detect and prevent issues before the customer is aware of them as well as map out the impact of upgrades and patches. In sort, the company becomes much more aware in real time of the customer’s experience.

Improvement 7: Socialized Machine Learning

Many Situational Awareness solutions allow human social networking to share programming patterns for event tokenization, analytics, reports, and automations among customer deployments. A lot of this work could be automated behind the scenes through current cloud technologies. In particular indexing and lint cleaning behind the scenes combined with a better human interface for enabling humans to nudge the machine to the right choice would improve the overall utility sharing among customer deployments.

Specifically, the configurations and code would be sanitized and tagged based on key words and patterns and entered into a shared knowledge database for access by other customers. Some changes could be as automated updates from a central corporate librarian. Though this likely would require a human to remove duplication and reduce variances, by using the WHAT verses HOW approach described earlier as well as automated code correlation, the machine and standards could simplify the problem to something manageable.

Improvement 8: Machine Learning

The final holistic method is the passive and active training of the machine by watching human operators accessing tools and diagnosing problems. As operators ping devices, create trace routes, and perform shell commands on consoles to the devices, the machine can watch and learn. At a minimum it can present command histories to past events along with their outcome. At best it can learn and build internal state machines (finite automata) and eventually automate some to all of the diagnostics and resolution steps performed.

Example: IBM’s Tivoli Netcool vs. Moogsoft’s incident.Moog

Introduction

Two examples demonstrate the difference between a legacy approach and an approach incorporating better machine delegation.

Both solutions were developed by Phil Tee and Mike Silvey, albeit separated by two decades.

The legacy approach is IBM Tivoli Netcool, which remains, after two decades, the standard implementation in the majority of fortune 500 companies. The modern solution is Moogsoft’s incident.Moog. Both technologies follow very similar approaches, but incident.Moog adds two critical enhancement:

Statistical correlation
Social enablement

Statistical correlation

The use of real time statistical correlation adds several benefits to Moog, some of them are unexpected. First, it typically reduces the number of things to look at between one to two orders of magnitude. Next, it correlates and groups events, rather than filter and discard, providing a better holistic picture. Finally, human intuition and insight are used by allowing the subject matter expert to view the organized events and using his own knowledge of the environment and expertise in the field, the answer is usually apparent.

For example, take the case of a situation containing the following events:

Failed circuit
BGP attempts to reroute
Network buffer overflows
Applications connection failures

The SME will immediately recognize that the failed circuit is likely the cause of the situation.

Also grouping events in this way allows more advanced presentations that talk to the story of the situation and how it unfolded through plotting the events in time. This enables human intuition and insight in ways not seen in the more traditional IBM Tivoli Netcool approach, which centers on lists of events and depends on filtering out and discarding events rather than inclusion and connections between the events.

Perhaps the greatest importance is drastically reducing the need for human intervention. In the absence of statistical correlation, extensive human created rules are used to reduce and correlate events to something manageable as well as rules are used to assign severities or otherwise prioritize the importance and urgency of events.

Social Enablement

The traditional method of social enablement is to codify aspects of ITIL, COBIT, and other IT business process standards. However, in practice trying to project business processes top-down is slow and difficult to implement across multiple groups. In the case of notification the standard method is ticketing, which often adds another layer of obfuscation and delays to the process. In the case of incident.Moog chat rooms are spawned and SMEs for the impacted products can be pulled in to work through issues. Similar machine ennoblements reduce the bureaucratic layers and facilitate the actual workflows used to solve problems.

By providing a historical plot for the events in a situation in a two dimensional graph, the natural human ability to process graphical information is exploited. Instead of drilling through an organized event tree or worse scan a list of events, the chronologically plotted situation allows the human to see cause and effect within the situation and identify the root cause or causes much more quickly than the traditional IBM Tivoli Netcool approach.

Summary

Though it seems small on the surface, leveraging both the machine’s ability to perform statistical correlation along with its ability to enable social interactions creates a significantly superior solution as it maximizes both what the machine and the human do will and increases their interdependence and cooperation.

Conclusion

Sometimes the difference between success and failure is attitude, and sometimes the attitude is subconscious. The attitude might not even belong to a single individual, but rather be embedded in the posture of an entire industry.

At the dawn of the Internet, DARPA saw Situational Awareness as a security exercise. Who could blame them? They were, and still are, a primary research arm of the United States military and the Internet was nothing more than a handful of computers wired together. In short, DARPA didn’t recognize the concept of cloud and really didn’t need to.

However, since the Internet was handed over to civilians, the Situational Awareness industry has had a choice. Instead of choosing to move forward into cloud-centric thinking, the industry has chosen to keep its security-centric roots. This is diametrically opposed to every successful complex system found in nature. By continuing to see the network as a collection of unrelated parts, every device is on the frontline and there is no well defined boundary to defend. In short, there is no recognition of cloud.

In the time of DARPA that approach made sense as connections were limited to a handful of computers in the U.S. military and academia. However, with the advent of OpenStack, AWS, and other cloud solutions, that posture needs to change. The Situational Awareness industry is severely lagging behind the rest of IT and staying in the old paradigm. The industry’s mindset is stuck in the universe of Terminator and Jurassic World.

The industry needs to look forward and through the lens of the universe described more recently in Tomorrowland. In this movie, it is the broadcast of predictions of doom and gloom that promise an early demise to mankind. Specifically, these feelings lead to decisions which result in the very outcome that humanity fears. These same fears set human against machine.

Unlike the earlier mentioned movies, Tomorrowland has a heroine with the emotional intelligence to look inward when consistent problems present themselves. She realizes that it isn’t a vast conspiracy all focused on her but rather her choices and perceptions that are the root of her problems. Through humility and the release of outdated wartime fears, the heroine finds the solution in optimism and cooperation with the machine.

I believe the situation for the Situational Awareness industry is no different. It is the adherence to a wartime posture and a failure to be introspective that is blocking the industry from moving forward. Specifically, it is the emotional stance of an outdated security-centric posture that prevents us from walking with OpenStack, AWS, and other cloud solutions into a more optimistic and, oddly, more realistic future.

It is time for the Situational Awareness industry to grow up.

Tags: cloud, nms, openstack, situational awareness

Categories: State Of NMS

SeniorCenter

Leave a Reply

Recent Posts

Categories

Links

Archives

Email Us

SeniorCenter

Situational Awareness in the Cloud (OpenStack to AWS)

Industry Fear and Stagnation

Industry Immaturity

The Zen of Situational Awareness in the Cloud

The Human Mind’s Model for Situational Awareness

Introduction

Passive Monitoring

Adaptive Monitoring

Summary

Mimicking the Human Mind

Introduction

Passive Monitoring

Overview

Introduction

Data Collection

Eventology

Short-Term Event State

Objectology

Short-Term Object State

Short-Term Business, Physical, and other States

Long-Term External Data Storage/CMDB

Analytics, Turing Machine, and Finite Automata

Presentation

Summary

Improvement 1: Managing Scope with Human to Machine Declaratives

Improvement 2: Exploit Human Intuition

Improvement 3: Machine Social Enablement

Improvement 4: Machine Intuition

Adaptive Monitoring

Overview

Improvement 5: Emergent Situational Awareness

Improvement 6: Self Aware (Product on Product)

Improvement 7: Socialized Machine Learning

Improvement 8: Machine Learning

Example: IBM’s Tivoli Netcool vs. Moogsoft’s incident.Moog

Introduction

Statistical correlation

Social Enablement

Summary

Conclusion

Share this:

Like this:

Related

Leave a Reply

Recent Posts

Categories

Links

Archives

Email Us