In my industry, there’s alot of monitoring to be done. We have servers, services, processes, ports, log files, applications, you name it, we have to monitor it.
When I first started at DI, there was a current implementation of Nagios. Unfortunately, it was horribly done. Some agents used SSL, some didn’t. It was all done with separate configuration files, it was slow and if you had too many monitors, they didn’t go off in time.
In any case, we started investigating replacements for Nagios and we came upon Hyperic. Hyperic is an open source ( although they have an enterprise version ) monitoring system. It is written in java and has individual agents that are installed on each of the systems you wish to monitor. The agents have a very low overhead, although that is somewhat dependent on if they are running scripts and what those scripts are doing.
Background
To understand this review, it helps to get some familiarity with the Hyperic inventory model. In a nutshell, a ‘platform’ is an OS, a ’server’ is a daemon or something that would provide the next step - a ’service’. For instance, apache would be a ’server’ and it can provide many ’services’. Some new terms to learn but nothing too hard to remember here. A service can also be called a ‘platform service’, which basically skips the ’server’ part. In any case, any of these items are what is referred to as a ‘resource’. With that now under your belt, we can continue.
One of the huge downfalls of alot of monitoring systems is that they don’t bother with metric data. They’re simply concerned with ‘on/off’ type situations or maybe a couple performance checks and timeouts. However, Hyperic goes the next step ( and then some ) and collects metrics about all the resources you have. For example, a process resource has metrics such as CPU usage, memory usage, state and many more. Hyperic collects all of these and graphs them in a very professional, clean looking interface. It is these metrics that Hyperic bases its alerting on. You can take these metric graphs and align them in custom views so that you can correlate events between different resources, rather than having to switch between different screens. This lets you really see what is going on across your enterprise.
Basically, any data that you can present in a simple key = value format, it can graph and keep track of. A key note here is that since Hyperic is written in java, one of it’s strongest features is the ability to quickly monitor java apps through JMX ( think snmp for java.. ), but we’ll get to that.
Now I’ll break the review up into a couple sections to make it a little more organized. Some areas I have alot more experience with than others.
Server Setup and Installation
This part, thankfully, is quite simple. Hyperic supplies packaged formats for all the OS systems that it supports. A simple tar/gzip and running one binary and you’re into the installation. It consists of a few questions to get yourself configured and bada-bing, bada-bam, you’re online.
There are some issues with things like overly long JDBC strings ( for oracle 10g rack backends ) and you have to manually do some things, like the HA configuration, but overall the process is quite simple and painless.
Agent Installation and Configuration
The agent configuration is as simple as the server install. You’re asked a few questions at install time, like server destination, port, login credentials and the like. It’ll perform a test to see if it can connect to the master and then it’s online.
It’d be nice if they packaged a chkconfig compatible rc script but they do provide a regular generic linux rc script that you can use.
Inventory
Here’s where I start to get a little annoyed with Hyperic and the methodologies that it uses. When an agent first starts up, it goes through all the available plugins that it has locally and each of those plugins has an inventory method. Once the agent is done with the inventory search, it sends this up to the main Hyperic server, allowing you to simply click and say ‘Yeah, start monitoring all these things that you found’. While this might seem cool at first, it quickly becomes painful because this means that in order to default resources to be monitored, you have to write plugins for each.
So let’s say you want all your systems to monitor the crond process automatically. Write a plugin. xinetd? Write a plugin. syslogd? Write a plugin. You get the picture. Hyperic offers some simple ways of creating these monitors after an agent has inventoried and registered itself but this starts the trend that I see as Hyperic’s major downfall - it is painful for the *enterprise*. What I mean by that is while Hyperic’s interface is simple and easy to navigate through, it doesn’t offer the automation features that you need when it comes to managing several hundred alerts and metric monitors across thousand of systems. Now that’s not a blanket statement and Hyperic does offer ways around this. I’ll go into detail into this in a little bit, but this inventory part is just the beginning.
Metric Configuration
As I’ve mentioned, one of Hyperic’s great features is its ability to collect metrics. These metrics can be from JMX plugins, script outputs, Java compiled classes and a bunch of other stuff. Each of these metrics can be compiled into different views with a nice clean interface that allows you to correlate information across the enterprise. So for example, you could see that when your apache requests skyrocket, your application is taking this much memory and you’re making X calls to the database servers and they have Y amount of load. It can even correlate log and alerting events and display those along with the time.
However, again, the configuration of the metric themselves is clunky when dealing with many, many metrics across several hosts.
For example, let’s say that my agent registers itself and inventories all its partitions. Now I install 2000 agents across my enterprise and they all register and inventory. So I’ve got 2001 systems, each with an average of 8 partitions or so. That’s 16008 partition resources.
Now let’s say that I really don’t want to collect metrics that often on my / partition, because it isn’t used that much, so I go to change its default interval to 30 minutes. Well, here’s the kicker - you can do this, but you can only do it on a resource by resource basis. That’s right, you’d have to manually go to all those resources and change the default interval for that single partition.
Hyperic does offer a ‘default’ setting for metrics but they took this idea and applied it in a much too abstract manner. You can set the default interval setting for the partition metric type - but that’s it. I can’t give it any parameters to distinguish the difference between partition types. You can’t do this between operating systems either. So whatever interval I set as default for the partition type applies to ALL partitions across ALL OS systems - even Windows. It’s nice that the option is there but it just needs some work to make it actually useful. As I’ve mentioned, there are ways around this that I’ll get to in a bit.
User Roles
I really can’t speak to these because we haven’t used them at my company. It supports user based roles and you can group resources together and apply certain roles to them for access permissions. That’s about all I know about this feature.
Groups
Hyperic offers two types of groups for resources - ‘Compatible’ resources and ‘Mixed’ resources. ’Compatible’ resources are resources that are all of the same type. So, for instance, I could group all my Linux systems, or all my /opt partitions, or all my CPUs, etc.
Mixed groups are the opposite - they allow you to group any resources together. The only real function for mixed groups is for the user role based access. Since we don’t use that, there isn’t much use for mixed groups for our company.
Compatible groups, however, attempt to let you manage the resources in a more sane fashion. For instance, setting the metric information and alerting. There are a few problems with this feature however. The first one is that the groups aren’t criteria based and resources cannot automatically add themselves to the proper group ( think ’smart groups’ ). So whenever your agent inventories itself to the server, you still have to pick apart all the resources and shove them into the proper group. While this may be alright for 30 systems, when you’re talking about 2000 systems - it’s not as user friendly. The second problem with these compatible groups is that for some reason, they don’t offer the same level of detail that you get with individual systems. So while I can set very detailed alert specifications for an individual resource - I can’t do this with a compatible group. They disable some of the functionality that you get with individual resources. I’m not sure why this is but it’s definitely something that I believe should be enhanced.
On a side note, the new version of Hyperic, the 4.0 release, is supposed to have support for criteria-based groups. It’s due out for release in late summer.
Alerting
Besides just metric collection, Hyperic allows you to alert off the metrics that you gather from systems. You can alert based on several circumstances and conditions that must be met. You can also set up escalation schemes for your alerts. To top it all off, you can set alert notifications based on circumstances around the amount of times an alert happens in a certain time frame. This last part, however, is what the compatible groups fail to implement.
While this may not seem like a big issue, think about CPU alerts and memory alerts. CPU alerts are most often based on the CPU being over a certain threshold for a certain amount of time. After all, you don’t want to send alerts every time a CPU spikes for a second but you do want to know if the CPU is over a certain threshold for a few minutes at a time. Because compatible groups do not have this feature, that means grouping CPU and memory resources isn’t as convenient as other types of resources.
The next lacking part about alerting is something that we’ve already experienced - setting defaults. You can set the default alerts for a metric but in a similar fashion to the metrics, you cannot set specifics for the defaults. This means that you while you can set your default alert for the partition resource type, you can’t set an alert threshold for your /opt partitions that is different than your /var partitions. Again, this is so limited that your threshold default applies to your Windows C:\ drives the same as it applies to your Linux / partitions. While you can go into the individual resources and change the default around for the alert that is created, if you ever change the default definition, it erases all the modifications that you made to the individual alerts that were created off the default. It’d be nice if there was an option to only affect new alerts that were created, but unfortunately at this time, there isn’t any way around this.
Plugins
Hyperic offers some cool plugins out of the box and you can find many of the supported, out-of-the-box products on their website. The plugins can be as complicated as you want ( java compiled programs ) or as simple as you need ( simple xml files ). It also offers the option of what’s known as a script plugin. A simple xml descriptor file along with a script in whatever language you prefer creates an easy to implement custom plugin.
One benefit of these script type plugins is the integration with Nagios and in fact, Hyperic goes further in that it can quickly inventory Nagios plugins and bring them into your Hyperic deployment. This helps current Nagios shops quickly switch over to Hyperic.
As I mentioned in an earlier section, the agents in Hyperic only inventory off what plugins they have, actually calling the inventory method of each plugin. While I can understand some of the benefits of this method, I wish there was a combination of both methods of auto-inventory and setting defaults within the main interface that each agent could inventory, rather than writing a plugin for each.
As far as documentation goes, the Hyperic site has some plugin documentation but the best documentation that you’ll find is in the existing plugins and examples. You can readily plug through them, but they take some explaining to really understand what is going on.
Development
I wasn’t sure what to call this section but I felt the new HQU architecture available in Hyperic goes beyond just ‘plugins’. To quote the Hyperic site:
HQU is a plugin framework for Hyperic HQ which allows custom UI to be inserted into, and interact with various aspects of Hyperic HQ. All HQU plugins have the ability to interact with the entire HQ backend, and come with an API which allows for fast development.
HQU, in all seriousness, is a very cool utility. It allows you to customize Hyperic and its interface in anyway you could think of - if you can program it. The HQU architecture is implemented in Groovy, a ’scripting’ type Java language that was developed in order to give Java some more high-level type language features that languages like Python and Ruby have. My main beef with this choice is that most system administrators, the people who are going to be using Hyperic the most, generally aren’t Java developers. We prefer languages like Ruby, Python, PHP and Perl. Hopefully they intend on coming out with some bindings for these languages.
If you really want to see the power of HQU, take a look at Jon Travis’s HyperCAST of integrating Hyperic with Jira.
They also offer a ‘Groovy console’, where you can quickly input Groovy code and execute it. Again, the main problem with this is the learning curve involved with learning Groovy and the lack of documentation about the HQU API and HQ backend API. There are some examples but alot more of them are needed to lessen the curve here. That, or just hire some java developers.
Community
The community for Hyperic consists of a very active forum and what they’ve termed as Hyperforge. The Hyperforge consists of community contributed plugins that you can download and use in your own installations. Hyperforge also contains much of the documentation for the plugin development and API’s provided by Hyperic.
As I mentioned, the forums are quite active and the developers and support people actively patrol them for suggestions and for offering help.
Another cool community item is the Hyperic HyperCAST’s, which are basically web casts that you can dial into and watch. Each cast has a theme and different developers are involved with each one. My favorite is the aforementioned cast of Jon Travis’s implementation of an extra interface in Hyperic to integrate with Jira. Just makes me wish I knew Groovy better.
Support
Now because our company has the enterprise version of Hyperic, we have access to the Hyperic’s support center, which is basically a Jira implementation that they use for customers to submit support tickets. The support team over at Hyperic is quick on the turn around and while they might not have the answer, they do escalate the tickets in a rapid manner. We’ve also had alot of phone time with them, some that they initiated just to better understand our needs better. They’ve also addressed alot of our concerns with professional services and some custom plugins. Overall, I’ve got to rank their support team as one of the best I’ve worked with.
If I had to complain about anything, it’d be the Jira implementation. Each company only has one login, so if you have multiple people supporting Hyperic and opening tickets, you often have to put a distribution list as the account’s email contact. It gets somewhat annoying to have to read everyone else’s ticket information. Rumor has it though, that this will be going away and the ability for individual user accounts tied to a company account are coming.
Conclusion
I’ve been a big supporter of Hyperic and its product offerings and I still am. I fought the battle at my company to purchase the product and I’ve met and talked with many of the people over at Hyperic. Even their CEO gets involved and takes a personal interest in the customers.
Hyperic does have some pain points for the enterprise in terms of managing the interface and their documentation is somewhat lacking for all the new development and plugin features. People evaluating the product have to realize some of the short comings if they plan on integrating it in a large scale environment. However, if you’re willing to pay for it, Hyperic offers some quality professional services and their support services are top notch. That or spend the time to learn Groovy or have your own java developers get involved and you could probably get exactly what you need out of the product. If your company isn’t about to dish out the cash for the enterprise version and some professional services or you don’t have the time or resources to learn Groovy, you might find the product a bit painful to use.
If you’re looking at a smaller deployment of Hyperic, chances are that you can get away without many changes to the interface or the need for enterprise type functions. If that’s the case, I’d highly recommend the product over anything currently out there.
There are alot of other features I didn’t cover, mostly because I don’t have too much experience with them ( log watching, alert center and others.. ), but I don’t think they change my end thoughts all that much. It’s a good product that just requires some polishing to become great for the enterprise.
The team over there is constantly working to improve the product and are open to suggestions from the community, so head over to Hyperic’s site and check it out.
August 4th, 2008 at 7:40 am
I personally found the Hyperic agent anything but light weight. It was particularly memory intensive using something in the region of 60mb when first spawned.
The Hyperic graphs are very lacking and you would still need to run something such as Cacti a long side.
We settled on a Zabbix setup as its graphing is strong, the agent REALLY is light weight (around 700k in memory and 0% cpu usage). It doesn’t quite have as many bells and whistles as Hyperic in terms of alerts/reports/escalations but that is in development.