apache logging awk user-agent access-log

Finding blank User Agents and spoofed UA in access logs

I'm trying to find any blank user agents and traces of spoofed user agents in my apache access logs.

Here's a typical line from my Access Log: (with IP and domain redacted)

x.x.x.x - - [10/Nov/2012:16:48:38 -0500] "GET /YLHicons/reverbnation50.png HTTP/1.1" 304 - "http://www.example.com/newaddtwitter.php" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/534.7 ZemanaAID/FFFF0077"

For blank user agents I'm trying to do this:

awk -F\" '($6 ~ /^-?$/)' /www/logs/www.example.com-access.log | awk '{print $1}' | sort | uniq

For finding info about UA's I'm running this: (Gives me the amount of hits each unique UA has)

awk -F\" '{print $6}' /www/logs/www.example.com-access.log | sort | uniq -c | sort -fr

What can I do differently to make these commands stronger and more thought out, while giving me the best information I can to combat bots and other scums of the Internet?

Solution

I wouldn't use \" as a field separator. CLF is constructed well enough that if you separate on whitespace, field 12 is the start of your user agent. If $12 == '""', the user agent is blank.

Remember that awk can accept standard input. So you can have "live" monitoring of your Apache log with:

$ tail -F /path/to/access.log | /path/to/awkscript

Just remember that when invoked this way, an awk script will never reach its END. But you can process lines as they are added to the log by Apache.

Something like this might help. Add to it as you see fit.

#!/usr/bin/awk -f

BEGIN {
  mailcmd="Mail -s \"Security report\" webmaster@example.com";
}

# Detect empty user-agent
$12 == "" {
  report="Empty user agent from " $1 "\n";
}

# Detect image hijacking
$7 ~ /\.(png|jpg)$/ && $11 !~ /^http:\/\/www.example.com\// {
  report=report "Possible hijacked image from " $1 " (" $11 " -> " $7 ")\n";
}

# Detect too many requests per second from one host
thissecond != $4 {
  delete count;
  thissecond=$4;
}
{
  count[$1]++;
  for (ip in count) {
    if (count[ip] > 100) {
      report=report "Too many requests from " $1 "\n";
      delete(count[ip]);  # Avoid too many reports
    }
  }
}

# Send report, if there is one
length(report) {
  print report | mailcmd;    # Pipe output through a command.
  close(mailcmd);            # Closing the pipe sends the mail.
  report="";                 # Blank the report, ready for next.
}

Note that counting requests within a particular second is only marginally helpful; if you have a lot of traffic from China, or university/corporate networks behind firewalls, then many requests might appear to come from a single IP address. And the Mail command isn't a great way to handle notifications; I include it here only for demonstration purposes. YMMV, salt to taste.