I recently needed a simple script to filter Googlebot IPs from a list of IP addresses to be able to extract actual Googlebot visits from an access log.
Thankfully, Google provides a method to make sure a visitor is actually Googlebot.
If you have a similar need, there you go:
#/bin/bash # # Performs reverse and forward DNS lookups to list Googlebot's IPs, given a list # of IP addresses as a file. Useful for filtering access logs to find out actual # Googlebot visits. # # An implementation of https://support.google.com/webmasters/answer/80553?hl=en while IFS='' read -r IP_ADDRESS || [[ -n "$IP_ADDRESS" ]]; do IS_GOOGLEBOT=0 REVERSE_LOOKUP="$(host $IP_ADDRESS)" echo "$REVERSE_LOOKUP" | grep -E "google.com.$|googlebot.com.$" > /dev/null && IS_GOOGLEBOT=1 if [[ IS_GOOGLEBOT -eq 1 ]]; then FORWARD_LOOKUP="$(host $(echo "$REVERSE_LOOKUP" | cut -d " " -f 5) | cut -d " " -f 4)" if [[ "$FORWARD_LOOKUP" = "$IP_ADDRESS" ]]; then echo $IP_ADDRESS fi fi done < "$1"
You may save it as something like
filter-googlebot-ips.sh and provide a file with a list of IP addresses to filter (each on a single line), as an argument. Like so:
$ ./filter-googlebot-ips.sh access-log-ips.txt > googlebot-ips.txt
This will perform reverse and forward DNS lookups for each of the IP addresses and print out the verified Googlebot IPs to
STDOUT, which you can write to a file like in the example above.
Hope it helps someone out there! 🙌
PS: Here is a GitHub Gist if you prefer that.