Why Bother?
To get right to the point, Wavefront is amazing, and you need it. You need it because it will let you see right into the heart of your system, however big and complicated that might be. You need it because you want to alert off meaningful telemetry generated by your whole estate, not off a shell script that exits 1, 2, or 3. You need it because, well, scaling Graphite.
Wavefront is a service into which you shovel time-series data. From collectd
,
statsd
, JMX, Dropwizard, echo
in a stupid shell script, anything. As fast as
you like. At whatever resolution you like. Then, using an API,
CLI or a
very nice UI, you can perform arbitrary mathematical operations on any number of
those series. It scales seamlessly, it works all the time, the support is great,
it’s feature-complete, and it’s well documented. It’s everything you always
want, but never get.
My current client uses it in production on an Ubuntu estate, but I have an all-SunOS (Solaris and SmartOS) lab, and I thought it would be interesting to instrument that. I can imagine lots of exciting possibilities wiring DTrace, kstats, and all manner of other stuff in, and I’m planning to write up my progress as I go.
Note: This article has been updated, late November 2016. We were a closed-beta customer of Wavefront, and many things have been improved over the time we’ve used it.
Setting up a Proxy
Wavefront is presented to you as an endpoint and a web UI. As an administrative user you generate an access token in the UI, then configure a proxy which listens for incoming metrics, bundles them up, and uses the token to pass them securely to the endpoint. Anything can write to the proxy, so it’s up to you to limit access. In EC2 we do this with security groups and IAM roles, but my lab has a private network, so I can put the proxy there, and anything inside can send metrics.
Make a Zone
I’m going to build a dedicated Solaris 11.3 zone to host the proxy. I have a golden zone which I clone from, so creation only takes a couple of seconds. Here’s my zone config.
# zonecfg -z shark-wavefront export
create -b
set brand=solaris
set zonepath=/zones/shark-wavefront
set autoboot=true
set autoshutdown=shutdown
set limitpriv=default,dtrace_proc,dtrace_user
set ip-type=exclusive
add anet
set linkname=net0
set lower-link=net0
set allowed-address=192.168.1.30/24
set configure-allowed-address=true
set defrouter=192.168.1.1
set link-protection=mac-nospoof
set mac-address=random
set maxbw=10M
end
add capped-memory
set physical=1G
end
add rctl
set name=zone.max-swap
add value (priv=privileged,limit=1073741824,action=deny)
end
add rctl
set name=zone.max-locked-memory
add value (priv=privileged,limit=209715200,action=deny)
end
add rctl
set name=zone.cpu-cap
add value (priv=privileged,limit=50,action=deny)
end
add dataset
set name=space/zone/shark-wavefront
end
To make this into a real, running thing, I only need to create a dataset to delegate, and clone my golden zone.
# zfs create space/zone/shark-wavefront
# zoneadm -z shark-wavefront clone shark-gold
The config shows you that I capped the zone’s memory usage at 1G – plenty for my low-traffic proxy – and limited the CPU usage at the equivalent of half a core. I also pinned the zone’s IP address and default router from the global zone. I usually do this, because it stops anyone or anything in the zone deliberately or accidentally changing the address and making the proxy disappear. I also capped the VNIC bandwidth at 10 megabit/s, which is pretty much my upstream-capacity. There might not be a great deal of value in that but, hey, I can, so why not?
If I were building a heavy-duty production proxy with hundreds of nodes writing to it (which I have done, many times), I’d set all these thresholds considerably higher, and build multiple, load-balanced zones on separate hosts.
The delegated dataset will be used for Wavefront’s logging and buffering. If the proxy can’t talk to your Wavefront cluster, it will buffer incoming mertrics on disk until the endpoint comes back, when it will flush them all out. (We’ve seen some massive spikes in our outgoing metric rate after network issues, and the cluster absorbs them without flinching.)
With this in mind I put a quota on the dataset to stop a broken connection flooding the disk and affecting all the other zones on the box. Actually , this probably isn’t necessary any more, as new proxy versions seem to have acquired the ability to limit the size of the buffer. But I still think it’s smart to quota all your non-global datasets so one tenant can’t DOS the others. And again, why not, when ZFS makes it as simple as
# zfs set quota=300M space/zone/shark-wavefront
Building the Wavefront Proxy
Unsurprisingly, Wavefront don’t supply packages for anything Solarish. But, they do make the source code available, so we can build one ourselves.
Compilation isn’t hard, but that didn’t stop me making a script to make it even easier.
That script works on Solaris 11 and SmartOS, and spits out a SYSV or pkgin
package. (Well, it does that unless you don’t have
fpm installed, in which case it gives you
a tarball.) It also has the ability (assuming you have the privileges) to
satisfy build dependencies: namely Java 8, Maven and Git.
If you can’t be bothered with all of that, here’s a ready-made package.
Installing the Wavefront Proxy
The package bundles an SMF method and manifest, but you will have to create the user and make a couple of directories on that dataset we delegated earlier. From inside the zone that looks like a ZFS pool, and we can treat it as if it were.
# useradd -u 104 -g 12 -s /bin/false -c 'Wavefront Proxy' -d /var/tmp wavefront
# zfs create -o mountpoint=/var/wavefront/buffer shark-wavefront/buffer
# zfs create -o mountpoint=/var/log/wavefront shark-wavefront/log
# zfs create -o mountpoint=/config shark-wavefront/config
# chown wavefront /var/wavefront /var/log/wavefront
Hopefully of course, you’d do this properly, and automate its all with the
config-management software of your choice.
I have a Puppet module
which you are welcome to use and extend, but it’s not exactly the
state-of-the-art. To use it, you must convert the datastream package
build_wf_proxy.sh
creates into directory format.
$ pkgtrans SDEFwfproxy.pkg . SDEFwfproxy
As we’ve already seen, the proxy is a Java application, so you’ll need a JVM. The Puppet stuff takes care of this of course, but if you’re doing things by hand, remember to:
# pkg install java/jre8
Briefly returning to storage, if your delegated dataset didn’t inherit the
compression=on
property, it’s definitely worth setting it now. Looking at my
existing proxy
$ zfs get -Hovalue compressratio shark-wavefront/buffer
11.87x
I find that turning on compression gets me twelve times the buffering period for free! The Wavefront UI will tell you how long an outage the buffering will cover on each configured proxy. I habitually turn compression on, unless I know a dataset will only contain incompressible data. I haven’t properly benchmarked, but it seems to me that in most workloads performance improves on compressed datasets.
Configuring the Proxy
So, assuming you’ve created the user, made the directories and installed the package, you’re almost ready to go. Depending on how busy you expect your proxy to be you might want to change the amount of memory allocated to the JVM. You can do that through SMF properties.
$ svcprop -p options wavefront/proxy
options/config_file astring /config/wavefront/wavefront.conf
options/heap_max astring 500m
options/heap_min astring 300m
You can see it sets a very small Java heap size, which so far seems to be fine for my modest lab requirements. Your mileage may vary, but it’s pretty easy to change.
# svccfg -s wavefront/proxy setprop options/heap_max=2048m
# svcadm restart wavefront/proxy
The proxies report back a lot of internal metrics, which make it very easy to monitor them. Relevant to the heap size discussion are JVM statistics, which let you see memory usage inside the JVM. This is one of a number of charts on my Wavefront “internal metrics” dashboard.
The proxy, obviously. needs a config file. It needs to know where to talk to,
how to authenticate, how to identify itself, and what ports to listen on. I keep
my application files in /config
, on my delegated dataset. I started this habit
years ago, before I learnt config management. The idea is that you can easily
rebuild a vanilla zone, re-import the dataset, and the applications will work.
If you want to use /etc
or something, the config-file path is also an SMF
option.
server=https://metrics.wavefront.com/api/
hostname=shark-wavefront
token=REDACTED
pushListenerPorts=2878
pushFlushMaxPoints=40000
pushFlushInterval=1000
pushBlockedSamples=5
pushLogLevel=SUMMARY
pushValidationLevel=NUMERIC_ONLY
customSourceTags=fqdn, hostname
idFile=/var/wavefront/.wavefront_id
retryThreads=4
I don’t currently do any metric whitelisting, blacklisting or pre-processing on this proxy. I use it almost entirely for experimenting and playing, so I want everything to go through, right or wrong.
In my client’s production environment we use metric whitelisting on all the proxies. By defining a single whitelist regular expression, we only accept metrics whose paths fit our agreed standards. This preserves the universal namespace which our tooling (and people) expect to see. When you have multiple proxy clusters, I think it’s also worth having them point-tag everything to help you identify where things came from.
I also only use the native Wavefront listener, disabling the OpenTSDB and
Graphite ports. My metrics all go in via
a customized Diamond setup
which speaks native Wavefront. If you want to use, say, collectd
, you’ll have
to use the Graphite listener. (Unless someone has written a Wavefront plugin,
which they might have by the time you read this.)
Dealing with Proxy Logs
Even in SUMMARY
mode, the proxy server is chatty. So, we need to have logadm
keep it in check.
# echo '/var/log/wavefront/wavefront-proxy.log -N -A 30d -s 10m -z 1 -a \
"/usr/sbin/svcadm restart wavefront/proxy"' \
>/etc/logadm.d/wavefront-proxy.logadm.conf
# svcadm refresh logadm-upgrade
My lab setup has a centralized logging system built with rsyslog
and Graylog2.
I also use Fluentd for things which don’t fit naturally with syslog, which is
clearly the case here. Fluentd.
The logs are multi-line, with the first line having the timestamp, and the second the message. The message is not always consistent. So far I’ve found the following block of config satisfies my needs, but it may need refining.
Note: for formatting, I have folded the long regex with backslashes, but really it has to be one long line.
<source>
@type tail
path /var/log/wavefront/wavefront-proxy.log
pos_file /var/run/td-agent/foo-bar.log.pos
tag wavefront_proxy
format multiline
format_firstline /^(?<time>.*) (?<class>com.wavefront.agent[^ ]+) (?<method>.*)$/
format1 /^(?<level>\w+): \[(?<port>\d+)\] \((?<type>\w+)\): \
points attempted: (?<attempted>\d+); blocked: \
(?<blocked>\d+)$|^(?<level>\w+): (?<message>.*)$/
time_format %b %d, %Y %l:%M:%S %p
</source>
<match wavefront_proxy.**>
type copy
<store>
type gelf
host graylog.localnet
port 12201
flush_interval 5s
</store>
</match>
When I first set this up I parsed out the attempted
and blocked
counts, so I
could alert on blocked messages. I set up a stream in Graylog which tailed the
proxy log and had an output which used the metrics plugin to report back to
Wavefront. Whenever any invalid metrics were blocked, the stream noticed, and
sent a “blocked” point to Wavefront, triggering an alert. So, I got Wavefront to
monitor things which don’t get to Wavefront!
I don’t need to do that now, because the proxy reports back counts of points
received, sent, and blocked, so I can tell Wavefront
rate(ts(~agent.points.2878.blocked))
and see a chart of blocked points. form
when I was working on the Ruby CLI and kept
sending duff points.
Wiring your logging into Wavefront is still a great idea though. At my client
site every FluentD log stream generates metrics, so we can have Wavefront alert
off abnormal numbers of errors, or from unexpectedly high or low log throughput.
One of the first things we did was to plot the number of auth.error
messages
going through syslog to show people trying to brute-force their way into our
perimeter SSH boxes. We now use it far far more things than that.
But, I digress. Again, there’s very rough-and-ready Puppet code to configure all of this logging stuff, with a separate manifest for Wavefront. You’re welcome to any of it.
Locking Down the Zone
Back to the “building a zone” part of the exercise, which I completed by making the proxy immutable.
# zonecfg -z shark-wavefront
zonecfg:shark-wavefront> set file-mac-profile = fixed-configuration
zonecfg:shark-wavefront> commit
zonecfg:shark-wavefront> ^D
# zoneadm -z shark-wavefront reboot
When the zone comes back up /var
is writeable, but everything else is
read-only. Again, this is something I’ve adopted as standard practice. If you
have a thing that you don’t expect to change, make it so that it can’t
change.
Next time I’ll talk a little bit about how I started getting useful metrics out of Solaris and into the proxy.