Java WebURL.setDocid方法代码示例

本文整理汇总了Java中edu.uci.ics.crawler4j.url.WebURL.setDocid方法的典型用法代码示例。如果您正苦于以下问题：Java WebURL.setDocid方法的具体用法？Java WebURL.setDocid怎么用？Java WebURL.setDocid使用的例子？那么, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在类edu.uci.ics.crawler4j.url.WebURL的用法示例。

在下文中一共展示了WebURL.setDocid方法的3个代码示例，这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞，您的评价将有助于系统推荐出更棒的Java代码示例。

示例1: entryToObject

import edu.uci.ics.crawler4j.url.WebURL; //导入方法依赖的package包/类
@Override
public WebURL entryToObject(TupleInput input) {
	WebURL webURL = new WebURL();
	webURL.setURL(input.readString());
	webURL.setDocid(input.readInt());
	webURL.setParentDocid(input.readInt());
	webURL.setParentUrl(input.readString());
	webURL.setDepth(input.readShort());
	webURL.setPriority(input.readByte());
	webURL.setAnchor(input.readString());
	return webURL;
}

开发者ID:sapienapps，项目名称:scrawler，代码行数:13，代码来源:WebURLTupleBinding.java

示例2: addSeed

import edu.uci.ics.crawler4j.url.WebURL; //导入方法依赖的package包/类
/**
 * Adds a new seed URL. A seed URL is a URL that is fetched by the crawler
 * to extract new URLs in it and follow them for crawling. You can also
 * specify a specific document id to be assigned to this seed URL. This
 * document id needs to be unique. Also, note that if you add three seeds
 * with document ids 1,2, and 7. Then the next URL that is found during the
 * crawl will get a doc id of 8. Also you need to ensure to add seeds in
 * increasing order of document ids.
 * <p/>
 * Specifying doc ids is mainly useful when you have had a previous crawl
 * and have stored the results and want to start a new crawl with seeds
 * which get the same document ids as the previous crawl.
 *
 * @param pageUrl the URL of the seed
 * @param docId   the document id that you want to be assigned to this seed URL.
 */
public void addSeed(String pageUrl, int docId) {
    String canonicalUrl = URLCanonicalizer.getCanonicalURL(pageUrl);
    if (canonicalUrl == null) {
        logger.error("Invalid seed URL: " + pageUrl);
        return;
    }
    if (docId < 0) {
        docId = docIdServer.getDocId(canonicalUrl);
        if (docId > 0) {
            // This URL is already seen.
            return;
        }
        docId = docIdServer.getNewDocID(canonicalUrl);
    } else {
        try {
            docIdServer.addUrlAndDocId(canonicalUrl, docId);
        } catch (Exception e) {
            logger.error("Could not add seed: " + e.getMessage());
        }
    }

    WebURL webUrl = new WebURL();
    webUrl.setURL(canonicalUrl);
    webUrl.setDocid(docId);
    webUrl.setDepth((short) 0);
    if (!robotstxtServer.allows(webUrl)) {
        logger.info("Robots.txt does not allow this seed: " + pageUrl);
    } else {
        frontier.schedule(webUrl);
    }
}

开发者ID:sapienapps，项目名称:scrawler，代码行数:48，代码来源:CrawlController.java

示例3: addSeed

import edu.uci.ics.crawler4j.url.WebURL; //导入方法依赖的package包/类
/**
 * Adds a new seed URL. A seed URL is a URL that is fetched by the crawler
 * to extract new URLs in it and follow them for crawling. You can also
 * specify a specific document id to be assigned to this seed URL. This
 * document id needs to be unique. Also, note that if you add three seeds
 * with document ids 1,2, and 7. Then the next URL that is found during the
 * crawl will get a doc id of 8. Also you need to ensure to add seeds in
 * increasing order of document ids.
 * 
 * Specifying doc ids is mainly useful when you have had a previous crawl
 * and have stored the results and want to start a new crawl with seeds
 * which get the same document ids as the previous crawl.
 * 
 * @param pageUrl
 *            the URL of the seed
 * @param docId
 *            the document id that you want to be assigned to this seed URL.
 * 
 */
public void addSeed(String pageUrl, int docId) {
	String canonicalUrl = URLCanonicalizer.getCanonicalURL(pageUrl);
	if (canonicalUrl == null) {
		logger.error("Invalid seed URL: {}", pageUrl);
		return;
	}
	if (docId < 0) {
		docId = docIdServer.getDocId(canonicalUrl);
		if (docId > 0) {
			// This URL is already seen.
			return;
		}
		docId = docIdServer.getNewDocID(canonicalUrl);
	} else {
		try {
			docIdServer.addUrlAndDocId(canonicalUrl, docId);
		} catch (Exception e) {
			logger.error("Could not add seed: {}", e.getMessage());
		}
	}

	WebURL webUrl = new WebURL();
	webUrl.setURL(canonicalUrl);
	webUrl.setDocid(docId);
	webUrl.setDepth((short) 0);
	if (!robotstxtServer.allows(webUrl)) {
		logger.info("Robots.txt does not allow this seed: {}", pageUrl);
	} else {
		frontier.schedule(webUrl);
	}
}

开发者ID:Chaiavi，项目名称:Crawler4j，代码行数:51，代码来源:CrawlController.java

注：本文中的edu.uci.ics.crawler4j.url.WebURL.setDocid方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台，相关代码片段筛选自各路编程大神贡献的开源项目，源码版权归原作者所有，传播和使用请参考对应项目的License；未经允许，请勿转载。