[PEAK] Re: Proposal for another Wiki "tutorial"

Wed Jul 14 16:59:20 EDT 2004

At 06:42 PM 7/14/04 +0100, Paul Moore wrote:
>Paul Moore <pf_moore at yahoo.co.uk> writes:
>
> > I'll start an initial Wiki page in the next day or two. Have a look
> > and see what you think.
>
>OK, I've put up some initial stuff. All talk and no code so far, but
>maybe it gives a flavour of where I intend to go. Comments gratefully
>accepted (here, on the Wiki, or via email).

A few design comments...  hope you don't mind.

If you are going to develop a framework like this with PEAK, the place to 
start is with "enduring abstractions".  That is, things that will always be 
a part of the application.  Or, to put it another way, what is the 
application's "domain"?

In the case of your monitoring system, the application domain is the status 
of systems.  Or perhaps more precisely, the services provided by those 
systems.  Thus, systems and services are your application domain.  Services 
have a status.  Services are offered by systems, which may be in groups 
such as clusters or networks.

Although the mechanics of what you *do* with them may change over time, 
your application(s) are going to always be dealing with services, systems, 
groups, and statuses, so there are your enduring abstractions.

So, what do you *do* with these things?  The kinds of functionality you can 
have are:

* Report on the current status of a service, system, or group, either in 
summary or detail
* Report on historical statuses (e.g. average uptime %)
* React to a change in status, e.g. send an e-mail

So, now that we know both the "nouns" and the "verbs" of our enduring 
application domain, we can create interfaces for these things.  For 
example, we could define the interface for services so that they expose an 
'events.IValue' for their current status, making it easy for reactive 
systems to "listen" to that event.

To go much further, though, we really have to flesh out what a "status" 
is.  We could look at it as a simple "upness" or "downness" status, and 
that might be useful for some things.  More relevantly, we could view a 
status as a metric, or collection of metrics, that apply to a particular 
service (or aggregation thereof), at a particular point in time.  (This 
last item is important for historical analysis, reacting to events, and 
perhaps even for service monitors to decide whether they should "ping" a 
service again.)

You'll notice here that I have not addressed controllers or repositories or 
anything like that.  Those are what we call "solution-domain" components, 
as opposed to "problem-domain" components.  In general solution-domain 
components are much less enduring and reusable than problem-domain 
components.  Also, if we design our problem-domain components well, often 
the solution-domain components evaporate into little more than glorified 
startup scripts!

Consider this: if you had objects to represent services, hosts, and so 
forth, that offered current status info and could trigger callbacks to 
systems that recorded history or took action, and they automatically 
handled the monitoring, what would be left to write?  Two things:

* "plug-ins" to perform specific tasks like monitoring events of a 
particular type and sending notifications

* reporting scripts to walk the domain objects and generate output of an 
appropriate format

These are specific day-to-day programming tasks to be accomplished with 
your framework's components, rather than being part of the framework 
themselves.

Also, the idea of a "repository" isn't really that useful either.  This is 
just an issue of storage, and that can and should be abstracted away.  In 
the case of reactive plugins, they won't care because they will just get 
attached to the right domain objects.  In the case of your scripts, they 
can simply reference the specific DM they want to load from, or use ZConfig 
to load a specified configuration file containing all the 
service/system/whatever objects.

So, now we begin to see that we actually have/want some sort of monitoring 
"server", in the sense that we don't want every little script doing its own 
status testing.  This is sort of like your "controller" concept, only much 
simpler.  All the "server" really is, is a script that loads up the domain 
objects, attaches reaction and reporting plugins to the domain objects, and 
runs the system's event loop.

Now, it may be that I have just designed something more like John Landahl 
wants than what you want.  :)  But, my main intent is to show how in 
designing with PEAK, you can start with what you "really want", focusing on 
the essentials of the problem rather than on accidents of 
implementation.  In essence, "repository" and "controller" are just 
computer words that aren't part of what you're really trying to do.  PEAK 
shoves these considerations off to the side using generic abstractions like 
DMs, commands, event loops, and executable configuration.  Thus, the bulk 
of the code that *you* write is about the problem domain, i.e. services and 
statuses, reports and reactions.

Does that make sense to you?

Anyway, the most interesting part, I think, of designing a framework like 
this, is the status metrics, because they represent a point of change over 
time.  Metrics may be discrete (e.g. boolean or enumerations) or numeric 
with units.  And they need names.  Some metrics may be derived from other 
metrics.  With an appropriate design for this part of the system, it should 
be possible to make fairly generic reporting and reaction tools, as well as 
developing advanced metrics that summarize various aspects of system state, 
like for example a color-coding scheme that takes various other 
measurements into consideration.

Actually, measurements should probably not just be point-in-time, but also 
support across-time measurements.  E.g. a metric for "% uptime over period".

Hm.  Anyway, I better stop now, because at this point I'm halfway to making 
your framework into a generalized enterprise management reporting system 
that could just as easily report on people or departments and products as 
it could on systems...  :)