Categories
Monitoring Network Projects & Hacking

wraprancid and RANCID 3.x

Jethro R Binks’ excellent wraprancid script allows you to bring in configurations (and pretty much anything else that can be text) without having to get involved in writing a new ?rancid/?login combination for your device. That avoids some pretty hairy perl and Tcl code, so it’s definitely a Good Thing! It’s also useful for devices that don’t even have a command-line, but might allow you to fetch their config from a web page, or TFTP.

The trouble is, RANCID changed the way it deals with device types between RANCID 2.x and RANCID 3. It changed in a good way, so that the patches to rancid-fe that tools like wraprancid required are no longer necessary. What was previously hard-coded in the source of rancid-fe is now a proper configuration file, with a second config file for you to add your own types to. Here’s how to get wraprancid working with RANCID 3.x

First, I’m assuming you have a working wrapplugin script. Here’s one I use to fetch the config from Asterisk servers.

#!/opt/perl/bin/perl -w
#
#######################################################
# Modules
#######################################################

# Load any modules needed
use strict;
use Getopt::Std;
use Net::SSH::Perl;

#######################################################
# Variables
#######################################################

# Initialize variables used in this script

my $debug = 0;

my %options = ();
getopts('df:', \%options);
my $file = $options{'f'};
my $fh;
my $host = $ARGV[0];

$debug = $options{'d'};

print STDERR "to host: $host\n" if $debug;

my $ssh = Net::SSH::Perl->new($host, protocol => '2,1', debug => $debug );

print STDERR "made ssh obj\n" if $debug;
$ssh->login("root");

print STDERR "login\n" if $debug;
my ($stdout, $stderr, $exit) = $ssh->cmd("true");
print STDERR "got output\n" if $debug;

# Open the output file.
open($fh, ">", $file) or die "Cannot open output file\n";
print $fh "#RANCID-CONTENT-TYPE: wrapper.asterisk\n#\n";

print $fh $stdout;
print STDERR "wrote output of ". length($stdout)." bytes\n" if $debug;

#######
# End #
#######
close($fh);
print STDERR "done\n" if $debug;

That lives in ~rancid/bin/asterisk.wrapplugin, just as it did in version 2.

Then, in ~rancid/etc/rancid.types.conf, we’ll define a new device type called wrapper-asterisk:

wrapper-asterisk;script;wraprancid -s asterisk.wrapplugin
wrapper-asterisk;login;clogin

(I don’t think the login script matters, as it’s never used, but it must be specified to keep RANCID happy)

And finally in the router.db, you can put your actual device:

asterisk-sipgateway;wrapper-asterisk;up;

That’s it. You can repeat for whichever other scripts you need to do this for.

Bonus Tip

The asterisk end of the script above works like this: we use SSH public key authentication to connect to the server, and then in ~root/.ssh/authorized_keys, there is a line like this:

command="/usr/sbin/asterisk -V; echo 'extensions.conf'; cat /etc/asterisk/extensions.conf; echo 'sip.conf'; cat /etc/asterisk/sip.conf; echo 'iax.conf';cat  /etc/asterisk/iax.conf",from="myrancidhost" ssh-dss AAAAB3NzaC174ENozlUVBe5hH32Wy/duAJt1b4nWbVPoW1GP/koSZNv3888s3fx23nEpLMJxispulA== rancid@myrancidhost

So that the user authenticating with that particular key doesn’t get a shell, they just get the output from a series of cat commands, and then disconnected. They must also be connecting from the RANCID server.

So, now we have Asterisk in the same version control system as our network gear. You can use a similar setup for things like BSD ipfw-based firewalls, or Quagga routers.

Categories
Monitoring Network Tech Uncategorized

RANCID, ssh, Cisco MDS and “too many authentication failures”

I just ran into this, and it took a little while to figure out, so here’s my quick note. If you have a Cisco MDS being backed up by RANCID, then you can get the following odd message, even if it’s the first time you tried to log in with this user:

Received disconnect from 10.0.7.5: 2: Too many authentication failures for confbackup

What is happening is that the ssh client tries with whatever public keys it has configured first, and then the password-based auth that you thought it was doing all along. With a few keys, that’s enough to annoy the MDS into closing the connection.

The solution is to disable public-key auth for this connection. To do that with RANCID requires a little bit of extra work. First, create a shellscript (I call mine /opt/rancid/local/ssh-no-pubkey):

#!/bin/sh

ssh -o PubkeyAuthentication=no $*

Then for the devices that are suffering, tell RANCID to use this new SSH command instead of just ‘ssh’. In .cloginrc:

add sshcmd mds01 {/opt/rancid/local/ssh-no-pubkey}

Now RANCID can login and backup the config fine.

Additional tip – the ‘cisco’ device type seems to work better than the (theoretically correct) ‘cisco-nx’ device type for MDS switches.

Categories
Monitoring Network Tech

RANCID on Speed

I like RANCID a lot, and this is the first time I’ve found a presentation from someone else about the kind of things I like to do with RANCID.

http://www.denog.de/meetings/denog2/pdf/010-Stoegbauer-RANCID_on_Speed.pdf

RANCID is pretty handy by itself – allowing you to actually know that the config for customer X hasn’t been changed in months, or verify that changes happening actually have corresponding change control tickets, as well as simply having a backup of everything and an automatic inventory (need all the serial numbers of all the WS-X6748-GE-TX cards in your network? It’s just a grep away). Since it all goes into version control (Subversion or CVS), you can do all this for last week, or last year, too. Useful for when your maintenance contract still lists the original serial number for that module that got RMAed 6 months ago, instead of the new one.

Internally, we have tools based around the Net::Netblock and Cisco::Reconfig perl modules and a bunch of hacks to generate things like hourly-updated, always accurate maps of what VLANs are in use where, what IP ranges are in use where, by VRF, which devices have an interface in that subnet on that VLAN and so on, all generated from collected configs in RANCID.

If you have a network of more than a few devices, and especially if you need to suddenly start answering “compliance” kinds of questions (where are your backups? can you prove they are regular? can you show the last change on that device? can you regularly verify that all devices have telnet disabled?) then you really should spend that afternoon setting it up. You’ll feel better for it.

Categories
Monitoring Network Tech

Why do you make me have to hurt you, baby?

Everyone knows that #monitoringsucks, but does it have to suck this much?

An organisation I deal with uses CA’s Nimsoft monitoring system. It has a neat architecture, with a hierarchy of hubs, each collecting data and funnelling it back up to the central site. A central management console lets you configure any of the hubs. The hubs include an SSL VPN, so monitoring traffic can traverse NATs and live in conflicting address spaces inside customer networks. There are a bunch of application-specific plugins for enterprisey things like SQL and ERP apps. Config and new plugins are pushed out from the centre to hubs when necessary, and alerting/reporting goes back up the same pipe, so you don’t need to punch a firewall full of holes. Pretty cool, right?

Here’s why it sucks though:

  • It’s incredibly slow. Like, “go and do something else while waiting for a window to open” slow. I don’t know why.

  • Each probe plugin provides it’s own UI, and uses it’s own individual config file. So you might have a central management console, but you are centrally managing dozens of individual probes with separate islands of config. Because of that:

    • some probes have templates, some don’t
    • some probes allow arranging targets into groups, or folders, some don’t
    • any probe that uses SNMP or other credentials has it’s own record of the credentials – if you have a Cisco switch and want to monitor general health, interface stats and some other special OID, that’s three different probes to config, per device. Four, if you want to receive traps back from it.
    • sometimes, the same credentials work in one probe, but fail in another, on the same system!
    • Basic UI features, like not having the tab-order through dialogs be completely random, are missing

The “separate config files” thing is supposed to make it easy to roll out a “standard” config across a series of customers, which it might, but it makes dealing with the tool after deployment really painful.

Previously, I’ve used Cacti to do this particular piece of monitoring. Cacti has two plugins, Autom8 and Thold, that allow me to:

  • Add a new device, and apply a Host Template to it – that pulls in the relevant SNMP variables for this device
  • Wait for Autom8 to add all the graphs for me
  • Apply threshold templates that give standard alerting for all those new graphs
  • That’s it

With a CLI to bulk-add devices, I can have a DC full of switches under monitoring, with graphs and alarms in about 20 minutes. Even without the (standard, documented) CLI, it takes a minute to add a new device. The only thing I can’t do is distribute the polling to hubs, for customer networks or for general performance and efficiency. A simple one-line cron job will tell Cacti to rescan for new interfaces on a device as often as I like.

The only other part I need to do manually, currently, is periodically re-apply the Thold templates to pick up new interfaces for checking error-counters. Autom8 doesn’t talk to Thold, unfortunately.

How do large companies manage to make simple tasks so complicated? It’s like nobody actually tried to use this for a normal installation, while imagining they had a real job they were supposed to be doing as well. Configuring monitoring shouldn’t be a career choice, for things like switch error rates, should it?

Nimsoft isn’t alone – last time I looked for a distributed, SME-scale monitoring tool that understood that two devices might have the same IP address in different networks, and had a central management console, the competition was mostly worse. Usually it had better UI, often much better, but didn’t really have central management, just a central reporting console – you go to each remote hub to actually do the configuration.

How can this stuff be so hard?