Filtering Googlebot IPs from a list of IP addresses

Aug 03, 2018

I recently needed a simple script to filter Googlebot IPs from a list of IP addresses to be able to extract actual Googlebot visits from an access log.

Thankfully, Google provides a method to make sure a visitor is actually Googlebot.

If you have a similar need, there you go:

#/bin/bash
#
# Performs reverse and forward DNS lookups to list Googlebot's IPs, given a list
# of IP addresses as a file. Useful for filtering access logs to find out actual
# Googlebot visits.
#
# An implementation of https://support.google.com/webmasters/answer/80553?hl=en

while IFS='' read -r IP_ADDRESS || [[ -n "$IP_ADDRESS" ]];
do
    IS_GBOT=0
    REV_LOOKUP="$(host $IP_ADDRESS)"
    echo "$REV_LOOKUP" | grep -E "google.com.$|googlebot.com.$" > /dev/null && IS_GBOT=1

    if [[ IS_GBOT -eq 1 ]]; then
        FWD_LOOKUP="$(host $(echo "$REV_LOOKUP" | cut -d " " -f 5) | cut -d " " -f 4)"
        if [[ "$FWD_LOOKUP" = "$IP_ADDRESS" ]];
        then
            echo $IP_ADDRESS
        fi
    fi
done < "$1"

You may save it as something like filter-googlebot-ips.sh and provide a file with a list of IP addresses to filter (each on a single line), as an argument. Like so:

$ ./filter-googlebot-ips.sh access-log-ips.txt > googlebot-ips.txt

This will perform reverse and forward DNS lookups for each of the IP addresses and print out the verified Googlebot IPs to STDOUT, which you can write to a file like in the example above.

Hope it helps someone out there! 🙌

PS: Here is a GitHub Gist if you prefer that.