問題描述
最近觀察網站(Powered By WordPress)後台日誌發現,Googlebot大量請求/wp-login.php?redirect_to=xxx(xxx表示某個文章頁的URL)這一類頁麵。這些請求最後都直接返回/wp-login.php登陸頁麵的簡短內容,無論請求多少次,返回的內容都大同小異。這個情況,一方麵對搜索引擎非常不友好,大量URL對應的內容一致;另外一方麵,這種對網站搜索排名沒有意義的請求,卻浪費了較多的帶寬資源。問題截圖如下:
可能看不太清楚,這裏再貼幾條日誌:
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8140.html HTTP/1.1" 200 4689 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8145.html HTTP/1.1" 200 4689 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8144.html HTTP/1.1" 200 4688 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8142.html HTTP/1.1" 200 9775 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8136.html HTTP/1.1" 200 6666 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8143.html HTTP/1.1" 200 9781 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8129.html HTTP/1.1" 200 4687 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8135.html HTTP/1.1" 200 4687 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8133.html HTTP/1.1" 200 6008 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
"GET /wp-login.php?redirect_to=https%3A%2F%2Fvimsky.com%2Farticle%2F8128.html HTTP/1.1" 200 6766 "vimsky.com" "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)" "-"
解決方案
剛看到這個問題的時候,還以為是有人在攻擊本站,想要暴力破解login賬號。但仔細分析之後,從請求的IP池、使用GET而非POST協議、以及訪問頻率等來看:這些應該是Googlebot的正常請求,問題可能出在vimsky站點本身頁麵上帶有這種鏈接。基於這個思路,經過一番查找,發現問題症結如下圖所示:
本站設置了登陸才能發表評論,所以這個地方有一個向登陸頁的重定向,所以Googlebot能發現這個鏈接並嘗試下載。那麽接下來的問題是,如何禁止Googlebot或者Baiduspider這樣的爬蟲抓取這樣的網頁呢?
通常來說一般有兩個方法:
一、給鏈接加上 rel="nofollow"
屬性。
在鏈接上加上nofollow
這個屬性,是告訴搜索引擎不要跟蹤這個鏈接。Wordpress的“登陸之後才能評論”對應的鏈接,位於文件wp-includes/comment-template.php
大約2220行,修改之後如下:
2217 /** This filter is documented in wp-includes/link-template.php */
2218 'must_log_in' => '< p class="must-log-in" >' . sprintf(
2219 /* translators: %s: login URL */
2220 str_replace("\">", "\" rel=\"nofollow\">", __( 'You must be logged in to post a comment.' )),
2221 wp_login_url( apply_filters( 'the_permalink', get_permalink( $post_id ) ) )
2222 ) . '< /p>',
2223 /** This filter is documented in wp-includes/link-template.php */
考慮到不影響Wordpress原始代碼中的漢化(涉及./wp-content/languages/zh_CN.po文件),這裏簡單的對字符串做了str_replace
替換,替換之後加上了rel="nofollow"
屬性。
二、在網站的robots.txt
文件中設置禁止訪問wp-login相關URL
在robots.txt加上禁止訪問wp-login相關的URL
User-agent: *
Disallow: /wp-admin
Disallow: /comments/feed
Disallow: /wp-login
最好二種方法都用上,更徹底地避免爬蟲對wp-login.php相關URL的請求。