當前位置: 首頁>>代碼示例>>PHP>>正文


PHP Crawler::setForbiddenDirectories方法代碼示例

本文整理匯總了PHP中Crawler::setForbiddenDirectories方法的典型用法代碼示例。如果您正苦於以下問題:PHP Crawler::setForbiddenDirectories方法的具體用法?PHP Crawler::setForbiddenDirectories怎麽用?PHP Crawler::setForbiddenDirectories使用的例子?那麽, 這裏精選的方法代碼示例或許可以為您提供幫助。您也可以進一步了解該方法所在Crawler的用法示例。


在下文中一共展示了Crawler::setForbiddenDirectories方法的1個代碼示例,這些例子默認根據受歡迎程度排序。您可以為喜歡或者感覺有用的代碼點讚,您的評價將有助於係統推薦出更棒的PHP代碼示例。

示例1: runCrawler

/**
 * runs crawler for given settings, returns array of files
 * TODO update to new stuff
 */
function runCrawler($FILE, $FILES_CACHE)
{
    global $SETTINGS, $LAYOUT;
    $urlToCrawl = isset($SETTINGS[PSNG_CRAWLER_URL]) && $SETTINGS[PSNG_CRAWLER_URL] != $SETTINGS[PSNG_WEBSITE] ? $SETTINGS[PSNG_CRAWLER_URL] : $SETTINGS[PSNG_WEBSITE];
    $url = parse_url($urlToCrawl);
    $path = $url['path'];
    if (substr($urlToCrawl, -1) != '/' && $url['path'] == '') {
        $path .= '/';
        $urlToCrawl .= '/';
    }
    // check if we have a already started scan
    debug($SETTINGS[PSNG_TIMEOUT], 'PSNG_TIMEOUT');
    if (isset($SETTINGS[PSNG_TIMEOUT_TODO])) {
        debug($SETTINGS[PSNG_TIMEOUT_TODO], 'PSNG_TIMEOUT_TODO');
    }
    # !!! 'repair' may not be correct mk/2005-11-08
    if ($SETTINGS[PSNG_TIMEOUT] != PSNG_TIMEOUT_NONE && isset($SETTINGS[PSNG_TIMEOUT_TODO])) {
        // check if we're running in TIMEOUT mode
        debug('', "Running crawler engine from last point");
        $crawler = new Crawler($urlToCrawl, $SETTINGS[PSNG_TIMEOUT_TIME_DEADLINE]);
        $crawler->setTodo($SETTINGS[PSNG_TIMEOUT_TODO]);
        $crawler->setDone($SETTINGS[PSNG_TIMEOUT_DONE]);
        $crawler->setFiles($SETTINGS[PSNG_TIMEOUT_FILE]);
    } else {
        // we are not in timeout mode, no rerun
        $crawler = new Crawler($urlToCrawl, $SETTINGS[PSNG_TIMEOUT_TIME_DEADLINE]);
        $crawler->setTodo(array($urlToCrawl));
    }
    $crawler->setForbiddenKeys($SETTINGS[PSNG_DISALLOW_KEY]);
    $crawler->setForbiddenDirectories($SETTINGS[PSNG_DISALLOW_DIR]);
    $crawler->setForbiddenFiles($SETTINGS[PSNG_DISALLOW_FILE]);
    //Set the directory to forbid the crawler to follow below it
    $crawler->setDirectory($path);
    $crawler->start();
    if (!$crawler->hasFinished()) {
        // store current data into session
        $SETTINGS[PSNG_TIMEOUT_TODO] = $crawler->getTodo();
        $SETTINGS[PSNG_TIMEOUT_DONE] = $crawler->getDone();
        $SETTINGS[PSNG_TIMEOUT_FILE] = $crawler->getFiles();
        $SETTINGS[PSNG_TIMEOUT_ACTION] = PSNG_TIMEOUT_ACTION_WEBSITE;
    } else {
        while ($crawler->hasNext()) {
            $fileinfo = $crawler->getNext();
            // returns an array
            if (!isset($fileinfo['http_status'])) {
                $fileinfo['http_status'] = '';
            }
            if (!isset($fileinfo['file'])) {
                $fileinfo['file'] = '';
            }
            if (!isset($fileinfo['lastmod'])) {
                $fileinfo['lastmod'] = '';
            }
            if (!isset($fileinfo['changefreq'])) {
                $fileinfo['changefreq'] = '';
            }
            if (!isset($fileinfo['priority'])) {
                $fileinfo['priority'] = '';
            }
            $http_status = $fileinfo['http_status'];
            // create and setup valid values
            $fileinfo = handleURL($fileinfo['file'], $fileinfo['lastmod'], $fileinfo['changefreq'], $fileinfo['priority']);
            $fileinfo = handleURLCached($FILES_CACHE, $fileinfo);
            // handle some website specific stuff
            if ($http_status == "404") {
                $fileinfo[PSNG_FILE_ENABLED] = '';
                $fileinfo[PSNG_HTML_STATUS] = 'class="notfound"';
            }
            //			info($fileinfo, 'Fileinfo from crawler');
            // handle if the file exists on filesystem and on website
            if (array_key_exists($fileinfo[PSNG_FILE_URL], $FILE)) {
                $fileinfo = handleDoubleEntryFilesystemWebsite($FILE[$fileinfo[PSNG_FILE_URL]], $fileinfo);
            }
            //			info($fileinfo, 'Fileinfo after handle double entry');
            if (isset($fileinfo[PSNG_HTML_SOURCE])) {
                if ($fileinfo[PSNG_HTML_SOURCE] == PSNG_HTML_SOURCE_FS) {
                    $fileinfo[PSNG_HTML_SOURCE] = PSNG_HTML_SOURCE_FS_WEBSITE;
                } else {
                    $fileinfo[PSNG_HTML_SOURCE] = PSNG_HTML_SOURCE_WEBSITE;
                }
            } else {
                $fileinfo[PSNG_HTML_SOURCE] = PSNG_HTML_SOURCE_WEBSITE;
            }
            $FILE[$fileinfo[PSNG_FILE_URL]] = $fileinfo;
        }
        $SETTINGS[PSNG_TIMEOUT_ACTION] = '';
    }
    return $FILE;
}
開發者ID:Tipkin-Commons,項目名稱:tipkin,代碼行數:93,代碼來源:parsing.inc.php


注:本文中的Crawler::setForbiddenDirectories方法示例由純淨天空整理自Github/MSDocs等開源代碼及文檔管理平台,相關代碼片段篩選自各路編程大神貢獻的開源項目,源碼版權歸原作者所有,傳播和使用請參考對應項目的License;未經允許,請勿轉載。