Technical Ramblings // Wavefront and Solaris 04: /proc

Wavefront and Solaris 04: /proc

03 March 2017

Previously on Twin Peaks

After some amount of work, I have very useful Solaris-specific Wavefront dashboards. Most of the metrics come from kstats, so they’re low overhead, and give deep, accurate introspection. For instance, here’s network traffic out of all zones on a single host, courtesy of my SunOS Network Collector.

"broadcom deleted my account"

But, kstats consider the kernel’s view of the system as a whole, and sometimes it’s handy to have a finer-grained view than that.

So, in this episode, I’m going to write something about how I use Solaris’s /proc filesystem to produce process-specific metrics.

Linux /proc is a mess. Who knows what’s in there. Solaris /proc isn’t. It’s very clean, consistent, and, of course, well documented. man -s4 proc will tell you all you need to know.

What the `/proc`?

You likely know that /proc contains one directory for each process in the system. The name of the directory is the PID of the process. Let’s have a look at one.

$ ls -l /proc/$$
-rw-------   1 rob      sysadmin 5877760 Mar  3 14:15 as
-r--------   1 rob      sysadmin     336 Mar  3 14:15 auxv
dr-x------   2 rob      sysadmin      32 Mar  3 14:15 contracts
-r--------   1 rob      sysadmin      32 Mar  3 14:15 cred
--w-------   1 rob      sysadmin       0 Mar  3 14:15 ctl
lr-x------   1 rob      sysadmin       0 Mar  3 14:15 cwd ->
dr-x------   2 rob      sysadmin     272 Mar  3 14:15 fd
-r--------   1 rob      sysadmin       0 Mar  3 14:15 ldt
-r--r--r--   1 rob      sysadmin     192 Mar  3 14:15 lpsinfo
-r--------   1 rob      sysadmin    1328 Mar  3 14:15 lstatus
-r--r--r--   1 rob      sysadmin    1072 Mar  3 14:15 lusage
dr-xr-xr-x   3 rob      sysadmin      64 Mar  3 14:15 lwp
-r--------   1 rob      sysadmin    2600 Mar  3 14:15 map
dr-x------   2 rob      sysadmin     544 Mar  3 14:15 object
-r--------   1 rob      sysadmin    4176 Mar  3 14:15 pagedata
dr-x------   2 rob      sysadmin     816 Mar  3 14:15 path
-r--------   1 rob      sysadmin      72 Mar  3 14:15 priv
-r--------   1 rob      sysadmin       0 Mar  3 14:15 prune
-r--r--r--   1 rob      sysadmin     440 Mar  3 14:15 psinfo
-r--------   1 rob      sysadmin    2600 Mar  3 14:15 rmap
lr-x------   1 rob      sysadmin       0 Mar  3 14:15 root ->
-r--------   1 rob      sysadmin    2304 Mar  3 14:15 sigact
-r--------   1 rob      sysadmin    1680 Mar  3 14:15 status
-r--r--r--   1 rob      sysadmin     504 Mar  3 14:15 usage
-r--------   1 rob      sysadmin       0 Mar  3 14:15 watch
-r--------   1 rob      sysadmin   40824 Mar  3 14:15 xmap

Unlike Linux, Solaris’s /proc directory doesn’t present all that much from a simple ls. Beyond the user and group the process runs as, there’s not a lot of information at all. The timestamps on the files are the time the process was launched. Also unlike Linux, none of those files can be usefully accessed with simple tools like cat. Instead of just catting things, you’re expected to use tools like pmap or pstack, which are consumers of these binary structures. This is fine, and if we were writing instrumentation in C, as nature intended, it would be trivially easy to read and use them.

But, I’m writing a collector for Diamond, which means Python. And, as this article will surely show, I’m not much of a Python programmer.

One of the things I like about /proc is the careful use of file permissions. Look at the example above and consider

$ ls -ld /proc/$$
dr-x--x--x   5 rob      sysadmin     928 Mar  3 14:15 /proc/2208

Only a process owner can list the directories, and see other sensitive information. But the “other” execute bit is set on the directory, and some files are world-readable. So if we’re even a little bit clever, our unprivileged Diamond process should be able to see all system processes.

First, I needed to decide what information I wanted. /proc gives way more detail, particularly on LWPs, than I’m interested in. What I want, at least for now, is a prstat style thing which reports the CPU and memory consumption of running processes. I’d like to be able to aggregate and filter that on process name, PID, and the zone in which the process runs.

After a bit of thought I decided to put the process name in the metric path, then to tag each point with the PID of the process and its zone. Tagging by PID forces Wavefront to separate multiple processes with the same executable name, but it’s trivial to wrap them in a sum(), or group them by metric path if need be. (There’s an argument for dumping the whole lot on the same path and having the exec name also be a tag, but this seemed more natural to me, and I was worried about having too many tags on the same path.) When I first started writing my Solaris collectors, I put the zone name in the metric path, but it never quite seemed right having to force global into things where it didn’t really belong. So, zone name is always a tag now. More tags. Tags are good. It’s possible that the PID tag could end up being of too high cardinality, but I don’t see much alternative. We’ll see how it goes.

The proc(4) man page gives you the layout of the /proc data structures, and they’re also in /usr/include/sys/procfs.h, with a little bit more annotation. From that file, here’s psinfo

#define PRARGSZ   80  /* number of chars of arguments */
typedef struct psinfo {
  int pr_flag;              /* process flags (DEPRECATED; do not use) */
  int pr_nlwp;              /* number of active lwps in the process */
  pid_t pr_pid;             /* unique process id */
  pid_t pr_ppid;            /* process id of parent */
  pid_t pr_pgid;            /* pid of process group leader */
  pid_t pr_sid;             /* session id */
  uid_t pr_uid;             /* real user id */
  uid_t pr_euid;            /* effective user id */
  gid_t pr_gid;             /* real group id */
  gid_t pr_egid;            /* effective group id */
  uintptr_t pr_addr;        /* address of process */
  size_t  pr_size;          /* size of process image in Kbytes */
  size_t  pr_rssize;        /* resident set size in Kbytes */
  size_t  pr_pad1;
  dev_t pr_ttydev;          /* controlling tty device (or PRNODEV) */

      /* The following percent numbers are 16-bit binary */
      /* fractions [0 .. 1] with the binary point to the */
      /* right of the high-order bit (1.0 == 0x8000) */

  ushort_t pr_pctcpu;       /* % of recent cpu time used by all lwps */
  ushort_t pr_pctmem;       /* % of system memory used by process */
  timestruc_t pr_start;     /* process start time, from the epoch */
  timestruc_t pr_time;      /* usr+sys cpu time for this process */
  timestruc_t pr_ctime;     /* usr+sys cpu time for reaped children */
  char  pr_fname[PRFNSZ];   /* name of execed file */
  char  pr_psargs[PRARGSZ]; /* initial characters of arg list */
  int pr_wstat;             /* if zombie, the wait() status */
  int pr_argc;              /* initial argument count */
  uintptr_t pr_argv;        /* address of initial argument vector */
  uintptr_t pr_envp;        /* address of initial environment vector */
  char  pr_dmodel;          /* data model of the process */
  char  pr_pad2[3];
  taskid_t pr_taskid;       /* task id */
  projid_t pr_projid;       /* project id */
  int pr_nzomb;             /* number of zombie lwps in the process */
  poolid_t pr_poolid;       /* pool id */
  zoneid_t pr_zoneid;       /* zone id */
  id_t  pr_contract;        /* process contract */
  int pr_filler[1];         /* reserved for future use */
  lwpsinfo_t pr_lwp;        /* information for representative lwp */
} psinfo_t;

and here’s usage:

typedef struct prusage {
  id_t    pr_lwpid;         /* lwp id.  0: process or defunct */
  int   pr_count;           /* number of contributing lwps */
  timestruc_t pr_tstamp;    /* current time stamp */
  timestruc_t pr_create;    /* process/lwp creation time stamp */
  timestruc_t pr_term;      /* process/lwp termination time stamp */
  timestruc_t pr_rtime;     /* total lwp real (elapsed) time */
  timestruc_t pr_utime;     /* user level cpu time */
  timestruc_t pr_stime;     /* system call cpu time */
  timestruc_t pr_ttime;     /* other system trap cpu time */
  timestruc_t pr_tftime;    /* text page fault sleep time */
  timestruc_t pr_dftime;    /* data page fault sleep time */
  timestruc_t pr_kftime;    /* kernel page fault sleep time */
  timestruc_t pr_ltime;     /* user lock wait sleep time */
  timestruc_t pr_slptime;   /* all other sleep time */
  timestruc_t pr_wtime;     /* wait-cpu (latency) time */
  timestruc_t pr_stoptime;  /* stopped time */
  timestruc_t filltime[6];  /* filler for future expansion */
  ulong_t   pr_minf;        /* minor page faults */
  ulong_t   pr_majf;        /* major page faults */
  ulong_t   pr_nswap;       /* swaps */
  ulong_t   pr_inblk;       /* input blocks */
  ulong_t   pr_oublk;       /* output blocks */
  ulong_t   pr_msnd;        /* messages sent */
  ulong_t   pr_mrcv;        /* messages received */
  ulong_t   pr_sigs;        /* signals received */
  ulong_t   pr_vctx;        /* voluntary context switches */
  ulong_t   pr_ictx;        /* involuntary context switches */
  ulong_t   pr_sysc;        /* system calls */
  ulong_t   pr_ioch;        /* chars read and written */
  ulong_t   filler[10];     /* filler for future expansion */
} prusage_t;

Between them, those two structures reveal everything I want. And, they’re both universally accessible. So how the heck do I read a binary struct in Python?

Ruddy Python

I’m not the biggest Python fan. I certainly don’t think there’s anything bad about it, but it’s never “clicked” with me in the way that, say, Ruby did. Writing Ruby, for me, is fun. Writing Python is work. Rather dry work. But, I plough on, hopeful that one day I’ll “get it” and enjoy Python too.

It’s straightforward to read a binary file into a variable, and once that’s done, you can use Python’s struct module to unpack the structure within. But, you need to know how to describe that structure.

In C we define data types as, say, char or int, and the compiler knows the sizes of those types. Python’s struct module requires you to define the incoming structures with characters, which are helpfully listed in a table.

With basic types like ulong_t or int, it’s obvious what letter to use, but I had no idea what the underlying types of things like poolid_t were, so I spent a fair amount of time hunting through /usr/include/sys/types.h or running bits of C like

#include <stdio.h>
#include <sys/types.h>

void main() {
  printf("%zu\n", sizeof(dev_t));
}

timestruc_t took a bit of tracking down, but it looks like this:

typedef struct  timespec {    /* definition per POSIX.4 */
  time_t  tv_sec;   /* seconds */
  long    tv_nsec;  /* and nanoseconds */
} timespec_t;

So after a little while, I had a format string identifying the whole of the usage struct. psinfo was more challenging, as it includes a complex structure called lwpsinfo_t. I started trying to decode this, but eventually realised I didn’t want any of the information in it, so I could discard it by telling my initial read() operation to only read the file up to the point just before lwpsinfo_t began.

I thought it would be nice to make a Python dict of each of the /proc files I was interested in, so I ended up writing a “key” dict to describe them:

proc_parser = {
  'usage': {
    'fmt':  '=ii8s8s8s8s8s8s8s8s8s8s8s8s8s8s13L',
    'keys': ('pr_lwpid', 'pr_count', 'pr_tstamp', 'pr_create', 'pr_term',
             'pr_rtime', 'pr_utime', 'pr_stime', 'pr_ttime', 'pr_tftime',
             'pr_dftime', 'pr_kftime', 'pr_ltime', 'pr_slptime',
             'pr_wtime', 'pr_stoptime', 'pr_minf', 'pr_majf', 'pr_nswap',
             'pr_inblk', 'pr_oublk', 'pr_msnd', 'pr_mrcv', 'pr_sigs',
             'pr_vctx', 'pr_ictx', 'pr_sysc', 'pr_ioch'),
      'size': 172,
      'ts_t': ('pr_tstamp', 'pr_create', 'pr_term', 'pr_rtime', 'pr_utime',
               'pr_stime', 'pr_ttime', 'pr_tftime', 'pr_dftime',
               'pr_kftime', 'pr_ltime', 'pr_slptime', 'pr_wtime',
               'pr_stoptime')
  },
  'psinfo': {
    'fmt':  '=iiiiiiIIIIlLLLlHH8s8s8s16s80siills3siiiiiii',
    'keys': ('pr_flag', 'pr_nlwp', 'pr_pid', 'pr_ppid', 'pr_pgid',
             'pr_sid', 'pr_uid', 'pr_euid', 'pr_gid', 'pr_egid',
             'pr_addr', 'pr_size', 'pr_rssize', 'pr_pad1', 'pr_ttydev',
             'pr_pctcpu', 'pr_pctmem', 'pr_start', 'pr_time', 'pr_ctime',
             'pr_fname', 'pr_psargs', 'pr_wstat', 'pr_argc', 'pr_argv',
             'pr_envp', 'pr_dmodel', 'pr_pad2', 'pr_taskid', 'pr_projid',
             'pr_nzomb', 'pr_poolid', 'pr_zoneid', 'pr_contract'),
   'size': 232,
   'ts_t': ('pr_start', 'pr_time', 'pr_ctime'),
  },
}

The ts_t list is of all the timestruc_t fields. For my own convenience I decided to turn all those into simple Python floats.

Then a simple method which uses that information to return a dict which lets me easily access, say, pr_zoneid. (I’ve removed all the error handling for brevity and clarity.)

def proc_info(p_file, pid):
  parser = proc_parser[p_file]
  p_path = path.join('/proc', str(pid), p_file)
  raw = file(p_path, 'rb').read(parser['size'])

  ret = dict(zip(parser['keys'], struct.unpack(parser['fmt'], raw)))

  for k in parser['ts_t']:
    (s, n) = struct.unpack('lL', ret[k])
    ret[k] = (s * 1e9) + n

  return ret

I put all of this into a library, and went on to write the collector.

Collector

The collector loops over every process in /proc, reading both psinfo and usage and turning them into a single dict. The person configuring Diamond is able to select any number of keys from that dict and have them made into metrics under the process namespace. Simple!

Except, of course, nothing’s ever simple. A likely use for the new proc collector is to show the busiest processes on a host, and just how busy they are. Refer back to the usage structure, and you can see that processes do keep track of that, using all those timestruc_t pr_*times. But they’re cumulative, so the collector has to do some sums. Diamond gives you a class variable, last_value to memo things between runs, and using that and our simple-to-work-with nanosecond times, it’s easy to work out the time each process spends on-CPU, literally, and as a percentage of available time. Here’s the system + kernel CPU usage for the Java processes on one of my hosts

"broadcom deleted my account"

If you hover over that chart, you’ll see the zone tags. These are attached to every metric produced by the proc collector, using the pr_zoneid value from psinfo. This is the first field you see when you do zoneadm list -cv; it’s numeric, and it’s not that meaningful. (If you reboot a zone it will likely get a different ID.) At the beginning of each collector run I shell out to zoneadm and generate a map of zone ID to name. When it’s time to send the point, I look up the pr_zoneid value in the map, and tag with the name.

Now on to memory, where I chose to follow prstat(1), and offer the RSS (resident set size) and SIZE columns. The latter is defined in the man pages as The total virtual memory size of the process, including all mapped files and devices. We can get the values from the psinfo struct’s pr_rssize and pr_size fields.

The prstat man page tells us that memory sizes are displayed in Kb. I prefer to send all my metrics as bytes, and let Wavefront handle the prefixes. I’ve taken a hardline “always bytes” policy across all my collectors, even if the standard tooling uses a different unit. But, you know, that K can mean different things to different people. Checking the source it seems that Sun chose to use “proper” K and M, not this Ki and Mi nonsense. So, we multiply the raw figure by 1024.

This seemed to work fine. I wrote and tested it on SmartOS, but I also run a couple of Solaris machines. When I dropped the code on to those, some of the processes showed zero memory usage. At first it seemed arbitrary: of two Java process, one reported its memory correctly, the other didn’t. I couldn’t work it out and I started digging: I DTraced prstat to see exactly what it did and, so far as I could tell, it was doing the same as my code. I read through the prstat source. (The Illumos source is mostly very easy to follow, and the block comments are superb.) The more I looked, the more baffled I was. Everything was correct, I was certain, but for the unavoidable fact it didn’t work.

Eventually I gave up, and asked for help on the illumos-discuss mailing list. In next to no time, an Illumos kernel dev had pointed me at the code which zeros out the pr_size field if a 32-bit process tries to examine a 64-bit one. And sure enough, on SmartOS:

$ file -b `which python2.7`
ELF 64-bit LSB executable AMD64 Version 1, dynamically linked, not stripped, no
debugging information available

and on Solaris:

$ file -b `which python`
ELF 32-bit LSB executable 80386 Version 1 [SSE FXSR FPU], dynamically linked, not stripped

Mystery solved. Self kicked.

At this point I decided to write a script to make an “Omnibus” style package to deploy Diamond. This builds and bundles together together a 64-bit Python, my own Wavefront-enabled fork of Diamond, all Diamond’s dependencies, and my SunOS collectors.

The Tenant Collector

So far I’d developed everything with a view to running in a global zone (easier done on Solaris than on SmartOS) and collecting metrics on the system and on individual zones all from one place. Doing as much as possible from the global is a I learnt early in my zoning days (2007 I think). Back then I wrote acres of Korn shell to dynamically probe whatever appeared under /zones, and run zlogin loops over the output of zoneadm list running Nagios-style check scripts. That looping survives in these collectors, though the zlogin stuff is much less necessary on SmartOS due to zone-aware extensions to the svc commands, and zone-level kstats. Solaris needs to catch up in these areas, assuming it continues to exist at all.

But I have stuff, like the site you’re reading, which runs in the Joyent Public Cloud. I’m in zones there, and you have a different view of the system from inside a zone. Some metrics are invisible, others are meaningless. So I got copying and pasting, and put together a collector tuned to run in a resource-capped SmartOS zone. The README explains the available metrics, so if you’re interested, read that.

Telegraf and the Future

Diamond is fine. It’s a reasonably active open-source project, and it feels a bit step up from CollectD. But now we have Telegraf, and that is far more modern than Diamond.

When I first tried Telegraf, it wouldn’t build on Solaris. Now, it will but, as was the case with Diamond, none of the OS-related plugins work at all. So I’ve started porting my Diamond collectors to Telegraf. There are kstat bindings for Go, and I’m using those. Once I have everything ported over, I might look at trying to write a libscf binding, (though I suspect I’m over-reaching myself there) so I can monitor SMF without having to shell out. I also haven’t really explored DTrace metric collection in the way I wanted to, and there are Go libusdt bindings just waiting for me. It’s a little slow going though, as I’m having to learn Go and its toolchain as I go.

Having a single binary is a big advantage when deploying to a SmartOS global zone. Getting a Python environment built and running was messy, and feels like a big fat hack.