問題描述
其他一些網站可能會使用cURL和偽造的http Referer複製我的網站內容。
我們是否可以檢測出請求是cURL而不是真正的Web瀏覽器?
最佳思路
沒有任何完美的方法可以避免自動爬取網頁。因為人可以做到的一切,機器人也可以模擬做到。但是有很多能讓機器抓取變得更困難的做法,從而防止絕大部分人的專區,不過對於非常精通技術的極客效果有限。
這裏介紹幾種不同類型的反爬技術。
1.每個IP的會話數
如果用戶每分鍾使用50個新會話,則可以認為該用戶可能是不處理Cookie的爬蟲程序。當然,curl可以完美地管理cookie,但是如果您將其與每個會話的訪問計數器結合使用(稍後說明),或者爬蟲對cookie處理得不好,那麽這個方法可能是有效的。
一般不太可能有50個具有相同共享連接的人會同時在您的網站上訪問。如果發生這種情況,則認為是爬蟲在抓取,您可以鎖定網站頁麵,直到輸入驗證碼為止。
具體步驟:
1)創建2個表:1個保存禁用的ips,1個保存ip和會話
create table if not exists sessions_per_ip (
ip int unsigned,
session_id varchar(32),
creation timestamp default current_timestamp,
primary key(ip, session_id)
);
create table if not exists banned_ips (
ip int unsigned,
creation timestamp default current_timestamp,
primary key(ip)
);
2)在腳本的開頭,您從兩個表中刪除了太舊的條目
3)接下來,您檢查用戶的IP是否被禁止(將標誌設置為true)
4)如果沒有,您可以計算出他的IP會話數
5)如果TA的會話過多,則將其插入到被禁止的表中並設置一個標誌
6)如果尚未插入sessions_per_ip
表,則將其ip插入
我編寫了一個代碼示例,以更好地顯示我的想法。
<?php
try
{
// Some configuration (small values for demo)
$max_sessions = 5; // 5 sessions/ip simultaneousely allowed
$check_duration = 30; // 30 secs max lifetime of an ip on the sessions_per_ip table
$lock_duration = 60; // time to lock your website for this ip if max_sessions is reached
// Mysql connection
require_once("config.php");
$dbh = new PDO("mysql:host={$host};dbname={$base}", $user, $password);
$dbh->setAttribute(PDO::ATTR_ERRMODE, PDO::ERRMODE_EXCEPTION);
// Delete old entries in tables
$query = "delete from sessions_per_ip where timestampdiff(second, creation, now()) > {$check_duration}";
$dbh->exec($query);
$query = "delete from banned_ips where timestampdiff(second, creation, now()) > {$lock_duration}";
$dbh->exec($query);
// Get useful info attached to our user...
session_start();
$ip = ip2long($_SERVER['REMOTE_ADDR']);
$session_id = session_id();
// Check if IP is already banned
$banned = false;
$count = $dbh->query("select count(*) from banned_ips where ip = '{$ip}'")->fetchColumn();
if ($count > 0)
{
$banned = true;
}
else
{
// Count entries in our db for this ip
$query = "select count(*) from sessions_per_ip where ip = '{$ip}'";
$count = $dbh->query($query)->fetchColumn();
if ($count >= $max_sessions)
{
// Lock website for this ip
$query = "insert ignore into banned_ips ( ip ) values ( '{$ip}' )";
$dbh->exec($query);
$banned = true;
}
// Insert a new entry on our db if user's session is not already recorded
$query = "insert ignore into sessions_per_ip ( ip, session_id ) values ('{$ip}', '{$session_id}')";
$dbh->exec($query);
}
// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...
// We do not display anything now because we'll play with sessions :
// to make the demo more readable I prefer going step by step like
// this.
ob_start();
// Displays your current sessions
echo "Your current sessions keys are : <br/>";
$query = "select session_id from sessions_per_ip where ip = '{$ip}'";
foreach ($dbh->query($query) as $row) {
echo "{$row['session_id']}<br/>";
}
// Display and handle a way to create new sessions
echo str_repeat('<br/>', 2);
echo '<a href="' . basename(__FILE__) . '?new=1">Create a new session / reload</a>';
if (isset($_GET['new']))
{
session_regenerate_id();
session_destroy();
header("Location: " . basename(__FILE__));
die();
}
// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned: wait 60secs to be unbanned... a captcha must be more friendly of course!</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
ob_end_flush();
}
catch (PDOException $e)
{
/*echo*/ $e->getMessage();
}
?>
2.訪問計數
如果您的用戶使用相同的Cookie來抓取您的頁麵,則可以使用其會話來阻止它。這個想法很簡單:您的用戶是否有可能在60秒內訪問60頁?
步驟:
- 在用戶會話中創建一個數組,其中將包含每次訪問時間。
- 刪除此數組中早於X秒的訪問
- 為實際訪問添加新條目
- 計算此數組中的條目
- 如果用戶訪問了Y頁,則禁止該用戶
樣例代碼:
<?php
$visit_counter_pages = 5; // maximum number of pages to load
$visit_counter_secs = 10; // maximum amount of time before cleaning visits
session_start();
// initialize an array for our visit counter
if (array_key_exists('visit_counter', $_SESSION) == false)
{
$_SESSION['visit_counter'] = array();
}
// clean old visits
foreach ($_SESSION['visit_counter'] as $key => $time)
{
if ((time() - $time) > $visit_counter_secs) {
unset($_SESSION['visit_counter'][$key]);
}
}
// we add the current visit into our array
$_SESSION['visit_counter'][] = time();
// check if user has reached limit of visited pages
$banned = false;
if (count($_SESSION['visit_counter']) > $visit_counter_pages)
{
// puts ip of our user on the same "banned table" as earlier...
$banned = true;
}
// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...
echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';
// Display counter
$count = count($_SESSION['visit_counter']);
echo "You visited {$count} pages.";
echo str_repeat('<br/>', 2);
echo <<< EOT
<a id="reload" href="#">Reload</a>
<script type="text/javascript">
$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});
</script>
EOT;
echo str_repeat('<br/>', 2);
// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned! Wait for a short while (10 secs in this demo)...</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>
3.圖片下載
爬蟲通常要在很短的時間內獲取大量數據,一般不會下載頁麵上的圖像,原因是:圖像占用了太多帶寬,會使抓取速度變慢。
這個方法的具體做法是:(我認為是最簡潔,最容易實現的)
使用mod_rewrite將.jpg /.png /…等格式的圖像文件隱藏在網頁中。該圖像應該在您要保護的每個頁麵上可用:它可能是您的網站LOGO,一般選擇尺寸較小的圖像(因為該圖像不得緩存)。
步驟:
1. 將這些行添加到您的.htaccess中
RewriteEngine On
RewriteBase /tests/anticrawl/
RewriteRule ^logo\.jpg$ logo.php
2.使用安全性創建您的logo.php
<?php
// start session and reset counter
session_start();
$_SESSION['no_logo_count'] = 0;
// forces image to reload next time
header("Cache-Control: no-store, no-cache, must-revalidate");
// displays image
header("Content-type: image/jpg");
readfile("logo.jpg");
die();
3.在需要增加安全性的每個頁麵上增加no_logo_count,並檢查其是否達到限製。
樣例代碼:
<?php
$no_logo_limit = 5; // number of allowd pages without logo
// start session and initialize
session_start();
if (array_key_exists('no_logo_count', $_SESSION) == false)
{
$_SESSION['no_logo_count'] = 0;
}
else
{
$_SESSION['no_logo_count']++;
}
// check if user has reached limit of "undownloaded image"
$banned = false;
if ($_SESSION['no_logo_count'] >= $no_logo_limit)
{
// puts ip of our user on the same "banned table" as earlier...
$banned = true;
}
// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...
echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';
// Display counter
echo "You did not loaded image {$_SESSION['no_logo_count']} times.";
echo str_repeat('<br/>', 2);
// Display "reload" link
echo <<< EOT
<a id="reload" href="#">Reload</a>
<script type="text/javascript">
$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});
</script>
EOT;
echo str_repeat('<br/>', 2);
// Display "show image" link : note that we're using .jpg file
echo <<< EOT
<div id="image_container">
<a id="image_load" href="#">Load image</a>
</div>
<br/>
<script type="text/javascript">
// On your implementation, you'llO of course use <img src="logo.jpg" />
$('#image_load').click(function(e) {
e.preventDefault();
$('#image_load').html('<img src="logo.jpg" />');
});
</script>
EOT;
// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned: click on "load image" and reload...</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
?>
4.Cookie檢查
您可以在javascript端創建cookie,以檢查您的用戶是否執行了javascript(例如,使用Curl的抓取工具不會)。
這個想法很簡單:這與圖像檢查大致相同。
- 將$ _SESSION值設置為1,並在每次訪問中將其遞增
- 如果存在cookie(在JavaScript中設置),請將會話值設置為0
- 如果此值達到限製,擇禁止用戶訪問
代碼:
<?php
$no_cookie_limit = 5; // number of allowd pages without cookie set check
// Start session and reset counter
session_start();
if (array_key_exists('cookie_check_count', $_SESSION) == false)
{
$_SESSION['cookie_check_count'] = 0;
}
// Initializes cookie (note: rename it to a more discrete name of course) or check cookie value
if ((array_key_exists('cookie_check', $_COOKIE) == false) || ($_COOKIE['cookie_check'] != 42))
{
// Cookie does not exist or is incorrect...
$_SESSION['cookie_check_count']++;
}
else
{
// Cookie is properly set so we reset counter
$_SESSION['cookie_check_count'] = 0;
}
// Check if user has reached limit of "cookie check"
$banned = false;
if ($_SESSION['cookie_check_count'] >= $no_cookie_limit)
{
// puts ip of our user on the same "banned table" as earlier...
$banned = true;
}
// At this point you have a $banned if your user is banned or not.
// The following code will allow us to test it...
echo '<script type="text/javascript" src="https://ajax.googleapis.com/ajax/libs/jquery/1.6.2/jquery.min.js"></script>';
// Display counter
echo "Cookie check failed {$_SESSION['cookie_check_count']} times.";
echo str_repeat('<br/>', 2);
// Display "reload" link
echo <<< EOT
<br/>
<a id="reload" href="#">Reload</a>
<br/>
<script type="text/javascript">
$('#reload').click(function(e) {
e.preventDefault();
window.location.reload();
});
</script>
EOT;
// Display "set cookie" link
echo <<< EOT
<br/>
<a id="cookie_link" href="#">Set cookie</a>
<br/>
<script type="text/javascript">
// On your implementation, you'll of course put the cookie set on a $(document).ready()
$('#cookie_link').click(function(e) {
e.preventDefault();
var expires = new Date();
expires.setTime(new Date().getTime() + 3600000);
document.cookie="cookie_check=42;expires=" + expires.toGMTString();
});
</script>
EOT;
// Display "unset cookie" link
echo <<< EOT
<br/>
<a id="unset_cookie" href="#">Unset cookie</a>
<br/>
<script type="text/javascript">
// On your implementation, you'll of course put the cookie set on a $(document).ready()
$('#unset_cookie').click(function(e) {
e.preventDefault();
document.cookie="cookie_check=;expires=Thu, 01 Jan 1970 00:00:01 GMT";
});
</script>
EOT;
// Display if you're banned or not
echo str_repeat('<br/>', 2);
if ($banned)
{
echo '<span style="color:red;">You are banned: click on "Set cookie" and reload...</span>';
echo '<br/>';
echo '<img src="http://4.bp.blogspot.com/-PezlYVgEEvg/TadW2e4OyHI/AAAAAAAAAAg/QHZPVQcBNeg/s1600/feu-rouge.png" />';
}
else
{
echo '<span style="color:blue;">You are not banned!</span>';
echo '<br/>';
echo '<img src="http://identityspecialist.files.wordpress.com/2010/06/traffic_light_green.png" />';
}
5.防止代理
我們可以在網上找到有關不同種類的代理的一些信息:
- 普通代理顯示有關用戶連接(尤其是其IP)的信息。 (“normal” proxy)
- 匿名代理不顯示IP,但在標頭上提供有關代理使用的信息。(anonymous proxy)
- 高度匿名代理不顯示用戶IP,也不顯示瀏覽器可能無法發送的任何信息。(high-anonyous proxy)
發現連接任何網站的代理很容易,但是很難發現high-anonymous代理。
一些$ _SERVER變量可能包含密鑰,特別是如果您的用戶位於代理之後(詳盡列表來自this question):
- CLIENT_IP
- FORWARDED
- FORWARDED_FOR
- FORWARDED_FOR_IP
- HTTP_CLIENT_IP
- HTTP_FORWARDED
- HTTP_FORWARDED_FOR
- HTTP_FORWARDED_FOR_IP
- HTTP_PC_REMOTE_ADDR
- HTTP_PROXY_CONNECTION’
- HTTP_VIA
- HTTP_X_FORWARDED
- HTTP_X_FORWARDED_FOR
- HTTP_X_FORWARDED_FOR_IP
- HTTP_X_IMFORWARDS
- HTTP_XROXY_CONNECTION
- VIA
- X_FORWARDED
- X_FORWARDED_FOR
如果您檢測到$_SERVER
變量中有上述字段,就可以為反爬製定相應的反代理安全策略。
結論
綜上,有很多方法可以檢測到您網站上的爬蟲行為。但是,需要精確地了解網站的使用方式,從而您的安全策略誤傷正常用戶。
參考資料