Archive

Archive for June, 2012

Ping Is Too Pessimistic

June 23rd, 2012

There are many billions of packets flying across the Internet every single second. The fact that a packet can get from one host in one part of the world to another host in another part of the world in a matter of milliseconds is absolutely amazing. The ping utility has long been used as a measure of checking if this awesomeness works… but unfortunately, it is very pessimistic.

To think that I can send 56 bytes of non-sense data from my machine to any Internet connected machine on the planet and have ping tell me “0% packets lost” seems rather depressing. Instead, ping should be happily exclaim “100% packets found!”

I wrote a little patch that makes the ping utility a happier utility program and the user benefits from seeing just how awesome the Internet is (unless of course, some packets were not found). ping should be a “glass is half-full” kinda program if you ask me.

eric@lolbuntu:/tmp/iputils-20071127.new$ sudo ./ping -c 3 www.google.com
PING www.l.google.com (74.125.45.147) 56(84) bytes of data.
64 bytes from yx-in-f147.1e100.net (74.125.45.147): icmp_seq=1 ttl=51 time=51.5 ms
64 bytes from yx-in-f147.1e100.net (74.125.45.147): icmp_seq=2 ttl=51 time=50.0 ms
64 bytes from yx-in-f147.1e100.net (74.125.45.147): icmp_seq=3 ttl=51 time=49.7 ms

--- www.l.google.com ping statistics ---
3 packets transmitted, 3 received, 100% packets found, time 2002ms
rtt min/avg/max/mdev = 49.792/50.433/51.504/0.762 ms

Here’s the patch:

--- iputils-20071127/ping_common.c	2007-12-09 20:56:22.000000000 -0700
+++ iputils-20071127.new/ping_common.c	2012-06-23 00:16:47.838210690 -0600
@@ -795,9 +795,9 @@
 	if (nerrors)
 		printf(", +%ld errors", nerrors);
 	if (ntransmitted) {
-		printf(", %d%% packet loss",
-		       (int) ((((long long)(ntransmitted - nreceived)) * 100) /
-			      ntransmitted));
+		printf(", %d%% packets found",
+		       (int) (100 - ((((long long)(ntransmitted - nreceived)) * 100) /
+			      ntransmitted)));
 		printf(", time %ldms", 1000*tv.tv_sec+tv.tv_usec/1000);
 	}
 	putchar('\n');

General

A Proper Post-Mortem

June 17th, 2012

There are three companies that I really enjoy doing business with. They are USAA, Amazon (on the consumer side), and Internap (on the tech side).

Let’s skip USAA, because it isn’t tech oriented and let’s look at Internap first. When there is a failure in Internap’s service, 9 times out of 10 they tell me before I realize it. Most of the time, these failures are transient and I would have never even known there was a problem had Internap not sent me an email giving me the info.

Here’s an email I got from Internap on May 30th, 2012:

At approximately 12:18 EDT on May 30, 2012 we were notified that the BGP session for our Verio provider in the ACS PNAP (Atlanta, GA) was in an active (down) state. The session recovered at 12:22 EDT and has been stable since that time.

During this time period, you may have noticed some sub-optimal routing and slight latency or packet loss as traffic destined for the Verio network was re-routed through our other providers in the PNAP. Once the session recovered, you may have noticed sub-optimal routing and slight latency again, as traffic was re-routed back onto Verio.

This type of outage is routine and isn’t a big deal. Internap lost an upstream provider at their PNAP. So what? I don’t really care. I pay them to deal with this and I experienced no downtime. But what happens when Internap themselves fucks up? We’ve had two major Internap outages. Each time, we’ve received a full RFO. One of them was an internal error from a sysadmin and another was a faulty Cisco command module. Most importantly, we received a full RFO (reason for outage) each time.

Now that we’ve moved to AWS, June 14th 2012′s outage RFO from Amazon makes me incredibly happy. From Amazon:

We would like to share some detail about the Amazon Elastic Compute Cloud (EC2) service event last night when power was lost to some EC2 instances and Amazon Elastic Block Store (EBS) volumes in a single Availability Zone in the US East Region.

This is the most beautiful thing I can imagine. They are not hiding the failures. They are admitting that it failed and they are giving both the reason why it failed and what they’re going to do to prevent future failures. The best part of this is that I didn’t have to wake up and deal with this all night. This is why IaaS is a good idea… as long as this communication continues.

General

Introduction To Syslog Log Levels/Priorities

June 7th, 2012

A very common question about syslog is how to decide the appropriate log priority (a.k.a. log level) for a specific log message. Deciding on the correct priority depends on a number of different factors.

Syslog allows you define a facility and a log level for each individual log message. The syslog “facility” is used to separate out log messages by application or by function. For example, email logs are normally logged on the LOG_MAIL facility. This groups all email related logs together. Your systems administrator will assign you a log facility to use. You should not use an arbitrary facility.

Log priorities are not as cut and dry. For custom applications, developers and systems administrators need to work together to define what constitutes a certain priority of a message. In the most generic terms possible, syslog levels are defined as:

       LOG_EMERG      system is unusable
       LOG_ALERT      action must be taken immediately
       LOG_CRIT       critical conditions
       LOG_ERR        error conditions
       LOG_WARNING    warning conditions
       LOG_NOTICE     normal, but significant, condition
       LOG_INFO       informational message
       LOG_DEBUG      debug-level message

Before digging into specifics on the priority definitions, let me address the developers directly. These priorities are defined in syslog.h as follows:

#define LOG_EMERG       0       /* system is unusable */
#define LOG_ALERT       1       /* action must be taken immediately */
#define LOG_CRIT        2       /* critical conditions */
#define LOG_ERR         3       /* error conditions */
#define LOG_WARNING     4       /* warning conditions */
#define LOG_NOTICE      5       /* normal but significant condition */
#define LOG_INFO        6       /* informational */
#define LOG_DEBUG       7       /* debug-level messages */

Normally, your application should call syslog() from some sort of internal logging function/method. This allows you to set an application-wide maximum log level. For example, on a production system, you do not want to waste CPU cycles generating debug error messages. Your application’s logging function should specify a maximum log level and filter the messages internally via some configuration variable.

For instance, in production, you may only care about messages of priority LOG_ERR and higher. So you would specify via some variable that your maximum logging level is LOG_ERR (3) and if a message comes into your logging function with a priority of 4 or greater, it is ignored.

The syslog server has the ability to filter messages of specific priorities as well. Systems administrators may choose to log only LOG_ERR or higher. So if the application is generating LOG_DEBUG error messages and the syslog server is only logging LOG_ERR or higher, this is just wasted processing time for both the application, the syslog server, and possibly even network I/O.

But the main question here is what is the most appropriate log level/priority for a particular message. Again, this is a hard question to answer because it can vary wildly by application, but in general, I would define them as such:

  1. LOG_EMERG – The application has completely crashed and is no longer functioning. Normally, this will generate a message on the console as well as all root terminals. This is the most serious error possible. This should not normally be used applications outside of the system level (filesystems, kernel, etc). This usually means the entire system has crashed.
  2. LOG_ALERT – The application is unstable and a crash is imminent. This will generate a message on the console and on root terminals. This should not normally be used applications outside of the system level (filesystems, kernel, etc).
  3. LOG_CRIT – A serious error occurred during application execution. Someone (systems administrators and/or developers) should be notified and should take action to correct the issue.
  4. LOG_ERR – An error occurred that should be logged, however it is not critical. The error may be transient by nature, but it should be logged to help debug future problems via error message trending. For example, if a connection to a remote server failed, but it will be retried automatically and is fairly self-healing, it is not critical. But if it fails every night at 2AM, you can look through the logs to find the trend.
  5. LOG_WARNING – The application encountered a situation that it was not expecting, but it can continue. The application should log the unexpected condition and continue on.
  6. LOG_NOTICE – The application has detected a situation that it was aware of, it can continue, but the condition is possibly incorrect.
  7. LOG_INFO – For completely informational purposes, the application is simply logging what it is doing. This is useful when trying to find out where an error message is occurring during code execution.
  8. LOG_DEBUG – Detailed error messages describing the exact state of internal variables that may be helpful when debugging problems.

So as an application developer, you may be asking yourself why you should not be using LOG_EMERG or LOG_ALERT. This is a valid question and this depends on you working with your systems administrator to determine if these log levels are appropriate. By default, almost every syslog implementation will log all LOG_EMERG and LOG_ALERT messages to the console which can make it difficult to actually work on a system to fix the problem if the log messages are flying by on the screen. Your systems administrator can set up filters on the syslog server to log them to the appropriate place, but before using those two priority levels, you should definitely consult with your systems administrator.

LOG_CRIT should be reserved for error messages that actually need to be visible to systems administrators and/or developers. If the error message you are logging will be ignored by everyone receiving the error, it should not be considered critical. Excuse the tautology, but “Critical errors are critical.” A critical error requires user intervention. If it does not require user intervention, it should be logged as LOG_ERR.

Priorities of LOG_WARNING and lower should be used at the developer’s discretion. It is common practice not to log any message with a priority lower than LOG_ERR on a production system.

General