當前位置: 首頁>>代碼示例>>Java>>正文


Java Parser類代碼示例

本文整理匯總了Java中org.htmlparser.Parser的典型用法代碼示例。如果您正苦於以下問題:Java Parser類的具體用法?Java Parser怎麽用?Java Parser使用的例子?那麽, 這裏精選的類代碼示例或許可以為您提供幫助。


Parser類屬於org.htmlparser包,在下文中一共展示了Parser類的15個代碼示例,這些例子默認根據受歡迎程度排序。您可以為喜歡或者感覺有用的代碼點讚,您的評價將有助於係統推薦出更棒的Java代碼示例。

示例1: parserUrl

import org.htmlparser.Parser; //導入依賴的package包/類
@Override
public NodeList parserUrl(Parser parser) {
	NodeFilter hrefNodeFilter = new NodeFilter() {
		@Override
		public boolean accept(Node node) {
			if (node.getText().startsWith("a href=")) {
				return true;
			} else {
				return false;
			}
		}
	};
	try {
		return parser.extractAllNodesThatMatch(hrefNodeFilter);
	} catch (ParserException e) {
		e.printStackTrace();
	}
	return null;
}
 
開發者ID:PerkinsZhu,項目名稱:WebSprider,代碼行數:20,代碼來源:HtmlParser01.java

示例2: getPlainText

import org.htmlparser.Parser; //導入依賴的package包/類
public static String getPlainText(String htmlStr) {
    Parser parser = new Parser();
    String plainText = "";
    try {
        parser.setInputHTML(htmlStr);

        StringBean stringBean = new StringBean();
        // 設置不需要得到頁麵所包含的鏈接信息
        stringBean.setLinks(false);
        // 設置將不間斷空格由正規空格所替代
        stringBean.setReplaceNonBreakingSpaces(true);
        // 設置將一序列空格由單一空格替代
        stringBean.setCollapse(true);

        parser.visitAllNodesWith(stringBean);
        plainText = stringBean.getStrings();

    } catch (ParserException e) {
        e.printStackTrace();
    }

    return plainText;
}
 
開發者ID:sercxtyf,項目名稱:onboard,代碼行數:24,代碼來源:HtmlTextParser.java

示例3: parseMessage

import org.htmlparser.Parser; //導入依賴的package包/類
/**
   * parses the body of the message, and returns a parsed representation
   * See {@link http://htmlparser.sourceforge.net/} for details
   * @param url the url that the message resulted from
   * @param message the Message to parse
   * @return a NodeList containing the various Nodes making up the page
   */
  public Object parseMessage(HttpUrl url, Message message) {
      String contentType = message.getHeader("Content-Type");
      if (contentType == null || !contentType.matches("text/html.*")) {
          return null;
      }
      byte[] content = message.getContent();
      if (content == null || content.length == 0) {
          return null;
      }
      Parser parser = Parser.createParser(new String(content), null);
      try {
          NodeList nodelist = parser.extractAllNodesThatMatch(new NodeFilter() {
public boolean accept(Node node) {
                  return true;
              }
          });
          return nodelist;
      } catch (ParserException pe) {
          _logger.severe(pe.toString());
          return null;
      }
  }
 
開發者ID:Neraud,項目名稱:PADListener,代碼行數:30,代碼來源:HTMLParser.java

示例4: getGangliaAttribute

import org.htmlparser.Parser; //導入依賴的package包/類
public List<String> getGangliaAttribute(String clusterName)
		throws ParserException, MalformedURLException, IOException {
	String url = gangliaMetricUrl.replaceAll(clusterPattern, clusterName);
	Parser parser = new Parser(new URL(url).openConnection());
	NodeFilter nodeFilter = new AndFilter(new TagNameFilter("select"),
			new HasAttributeFilter("id", "metrics-picker"));
	NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);
	SimpleNodeIterator iterator = nodeList.elements();
	List<String> metricList = new ArrayList<String>();
	while (iterator.hasMoreNodes()) {
		Node node = iterator.nextNode();

		SimpleNodeIterator childIterator = node.getChildren().elements();
		while (childIterator.hasMoreNodes()) {
			OptionTag children = (OptionTag) childIterator.nextNode();
			metricList.add(children.getOptionText());
		}
	}

	return metricList;

}
 
開發者ID:Ctrip-DI,項目名稱:Hue-Ctrip-DI,代碼行數:23,代碼來源:GangliaHttpParser.java

示例5: main

import org.htmlparser.Parser; //導入依賴的package包/類
public static void main(String[] args) throws Exception {
	Parser parser = new Parser(new URL("http://10.8.75.3/ganglia/?r=hour&cs=&ce=&s=by+name&c=Zookeeper_Cluster&tab=m&vn=&hide-hf=false").openConnection());
	NodeFilter nodeFilter = new AndFilter(new TagNameFilter("select"),
			new HasAttributeFilter("id", "metrics-picker"));
	NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);
	SimpleNodeIterator iterator = nodeList.elements();
	while (iterator.hasMoreNodes()) {
		Node node = iterator.nextNode();

		SimpleNodeIterator childIterator = node.getChildren().elements();
		while (childIterator.hasMoreNodes()) {
			OptionTag children = (OptionTag) childIterator.nextNode();
			System.out.println(children.getOptionText());
		}
	}

}
 
開發者ID:Ctrip-DI,項目名稱:Hue-Ctrip-DI,代碼行數:18,代碼來源:TestGangliaHttpParser.java

示例6: splitHtml

import org.htmlparser.Parser; //導入依賴的package包/類
private List<String> splitHtml() {
	List<String> resultList = new ArrayList<String>();
	try {
		Parser parser = Parser.createParser(content, "UTF-8");
		NodeList nodeList = parser.parse(null);
		resultList = recusiveSplitHtml(nodeList);
		StringBuffer lastPageContent = new StringBuffer();
		for (TagNode tagNode : tagNodeList) {
			if (tagNode.getStartPosition() < startPosition && tagNode.getEndTag().getEndPosition() >= startPosition) {
				lastPageContent.append("<");
				lastPageContent.append(tagNode.getText());
				lastPageContent.append(">");
			}
		}
		lastPageContent.append(content.substring(startPosition));
		Parser lastPageContentParser = Parser.createParser(lastPageContent.toString(), "UTF-8");
		NodeList pageContentNodeList = lastPageContentParser.parse(null);
		resultList.add(pageContentNodeList.toHtml());
	} catch (ParserException e) {
		e.printStackTrace();
	}
	return resultList;
}
 
開發者ID:wangko27,項目名稱:SelfSoftShop,代碼行數:24,代碼來源:Article.java

示例7: html2text

import org.htmlparser.Parser; //導入依賴的package包/類
/**
 * Converts an HTML document into plain text.
 * 
 * @param html HTML document
 * @return plain text or <code>null</code> if the conversion failed
 */
public static synchronized String html2text(String html) {
	// convert HTML document
	StringBean sb = new StringBean();
	sb.setLinks(false);  // no links
	sb.setReplaceNonBreakingSpaces (true); // replace non-breaking spaces
    sb.setCollapse(true);  // replace sequences of whitespaces
	Parser parser = new Parser();
	try {
		parser.setInputHTML(html);
		parser.visitAllNodesWith(sb);
	} catch (ParserException e) {
		return null;
	}
	String docText = sb.getStrings();
	
	if (docText == null) docText = "";  // no content
	
	return docText;
}
 
開發者ID:claritylab,項目名稱:lucida,代碼行數:26,代碼來源:HTMLConverter.java

示例8: file2text

import org.htmlparser.Parser; //導入依賴的package包/類
/**
 * Reads an HTML document from a file and converts it into plain text.
 * 
 * @param filename name of file containing HTML documents
 * @return plain text or <code>null</code> if the reading or conversion failed
 */
public static synchronized String file2text(String filename) {
	// read from file and convert HTML document
	StringBean sb = new StringBean();
	sb.setLinks(false);  // no links
	sb.setReplaceNonBreakingSpaces (true); // replace non-breaking spaces
    sb.setCollapse(true);  // replace sequences of whitespaces
	Parser parser = new Parser();
	try {
		parser.setResource(filename);
		parser.visitAllNodesWith(sb);
	} catch (ParserException e) {
		return null;
	}
	String docText = sb.getStrings();
	
	return docText;
}
 
開發者ID:claritylab,項目名稱:lucida,代碼行數:24,代碼來源:HTMLConverter.java

示例9: run

import org.htmlparser.Parser; //導入依賴的package包/類
@Override
public void run() {
	try {
		parser = new Parser(content);
		logger.info(currentThread().getName() + "開始解析Post請求響應的HTML!,並存儲到HBASE中!");
		NodeIterator rootList = parser.elements();
		rootList.nextNode();
		NodeList nodeList = rootList.nextNode().getChildren();
		// System.out.println("===================="+nodeList.size());
		/*
		 * 判斷該HTML響應是否有具體的內容,在出錯或者到所有數據讀取完畢時起效
		 * 如果起效,修改endFlag標誌位,停止開啟新的線程,結束當前任務!
		 */
		if (nodeList.size() <= 4) {
			program.endFlag = true;
		}
		/*
		 * 找到對應的tag記錄,然後解析
		 */
		nodeList.remove(0);
		nodeList.remove(0);
		SimpleNodeIterator childList = nodeList.elements();
		while (childList.hasMoreNodes()) {
			Node node = childList.nextNode();
			if (node.getChildren() != null) {
				toObject(node);
			}
		}
	} catch (Exception e) {
		logger.error(currentThread().getName() + "解析HTML文件出現異常!\n"+e.getMessage()+"\n");
	} finally {
		logger.info(currentThread().getName() + "HTML文件解析結束!");
		store.close();
	}
}
 
開發者ID:husky00,項目名稱:worm,代碼行數:36,代碼來源:PostRequestHtmlParser.java

示例10: parsePageInfo

import org.htmlparser.Parser; //導入依賴的package包/類
/***
 * 解析小區的頁數
 *
 * @param url
 * @return
 * @throws IOException
 * @throws ParserException
 */
private int parsePageInfo(final String url) throws IOException, ParserException {
    Parser parser = new Parser(CommonHttpURLConnection.getURLConnection(url));

    NodeFilter nodeFilter = new HasAttributeFilter("class", "pagenumber");
    NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);
    for (Node node : nodeList.toNodeArray()) {
        if (!(node instanceof Div)) {
            continue;
        }
        for (Node innerNode : node.getChildren().elementAt(1).getChildren().toNodeArray()) {
            if (!(innerNode instanceof TextNode)) {
                continue;
            }
            String pageStr = innerNode.toPlainTextString();
            if (!pageStr.contains("/")) {
                continue;
            }
            pageStr = pageStr.substring(pageStr.indexOf("/") + 1);
            try {
                return Integer.parseInt(pageStr);
            } catch (Exception e) {

            }
        }
    }
    return 0;
}
 
開發者ID:deanjin,項目名稱:houseHunter,代碼行數:36,代碼來源:DepartmentParser.java

示例11: run

import org.htmlparser.Parser; //導入依賴的package包/類
/***
 * 爬取透明網最近的預售證信息
 * @param url
 * @throws InterruptedException
 * @throws IOException
 * @throws ParserException
 */
public void run(String url) throws InterruptedException, IOException, ParserException {

    URLConnection urlConnection = CommonHttpURLConnection.getURLConnection(url);
    Parser parser = new Parser(urlConnection);
    NodeFilter nodeFilter = new HasAttributeFilter("class", "sale1");
    NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);

    if (nodeList.toNodeArray().length > 0) {
        Node[] sellCreditNodeArray = nodeList.elementAt(0).getChildren().toNodeArray();
        for (int i = 2; i < sellCreditNodeArray.length; i++) {
            if (sellCreditNodeArray[i] instanceof TableRow) {
                SellCreditInfo sellCreditInfo = parseSellParser(sellCreditNodeArray[i]);
                log.info("get sell credit info:{}", sellCreditInfo);
                //該預售證是否已經爬過
                HouseInfo houseInfo = dataOP.getHouseInfoByDepartmentNameAndSellCredit(sellCreditInfo);
                if(houseInfo != null){
                    log.info("already parsing sell credit:{}",sellCreditInfo);
                    break;
                }
                dataOP.insertSellCreditInfo(sellCreditInfo);
                if(i==2) continue;
                parseHouseInfo(sellCreditInfo);
            }
        }
    }
}
 
開發者ID:deanjin,項目名稱:houseHunter,代碼行數:34,代碼來源:SellCreditParser.java

示例12: parseDailyBriefInfo

import org.htmlparser.Parser; //導入依賴的package包/類
public List<DailyBriefInfo> parseDailyBriefInfo() throws IOException, ParserException {

        Parser parser = new Parser(CommonHttpURLConnection.getURLConnection("http://www.tmsf.com/index.jsp"));
        NodeFilter nodeFilter = new HasAttributeFilter("id", "myCont5");
        NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);
        if (nodeList.toNodeArray().length == 0) {
            return Collections.EMPTY_LIST;
        }

        List<DailyBriefInfo> dailyBriefInfoList = new ArrayList<>();

        //到1970/01/01 00:00:00的小時數
        int parseHour = (int) (Clock.systemUTC().millis() / (1000 * 3600));

        //到1970/01/01 00:00:00的天數
        int parseDay = (int) parseHour / 24;

        NodeList infoNodeList = nodeList.elementAt(0).getChildren().elementAt(1)
                .getChildren().elementAt(1).getChildren();

        for (int i = 5; i <= 13; i = i + 2) {
            DailyBriefInfo dailyBriefInfo = new DailyBriefInfo(CharMatcher.WHITESPACE.trimFrom(infoNodeList.elementAt(i).getChildren().elementAt(1).toPlainTextString()),
                    Integer.parseInt(CharMatcher.WHITESPACE.trimFrom(infoNodeList.elementAt(i).getChildren().elementAt(3).toPlainTextString())),
                    Integer.parseInt(CharMatcher.WHITESPACE.trimFrom(infoNodeList.elementAt(i).getChildren().elementAt(5).toPlainTextString())),
                    Integer.parseInt(CharMatcher.WHITESPACE.trimFrom(infoNodeList.elementAt(i).getChildren().elementAt(7).toPlainTextString())),
                    parseDay,parseHour);

            dailyBriefInfoList.add(dailyBriefInfo);
            dataOP.insertBriefDealInfo(dailyBriefInfo);

            ESOP.writeToES("log/daily_brief_info_es", JSONObject.toJSONString(dailyBriefInfo));
        }

        return dailyBriefInfoList;

    }
 
開發者ID:deanjin,項目名稱:houseHunter,代碼行數:37,代碼來源:DailyDealParser.java

示例13: parsePageInfo

import org.htmlparser.Parser; //導入依賴的package包/類
/**
 * 爬取當前樓幢的頁數
 *
 * @return
 * @throws InterruptedException
 * @throws IOException
 * @throws Exception
 */
public int parsePageInfo(String url, DepartmentInfo departmentInfo) throws ParserException, IOException {

    Parser parser = new Parser(CommonHttpURLConnection.getURLConnection(url));

    int page = 0;
    //解析頁數
    NodeFilter nodeFilter = new HasAttributeFilter("class", "spagenext");
    NodeList nodeList = parser.extractAllNodesThatMatch(nodeFilter);
    if (nodeList.size() == 0) {
        return page;
    }

    for (Node pageNode : nodeList.elementAt(0).getChildren().toNodeArray()) {
        if (pageNode instanceof Span) {
            try {
                String tmp = pageNode.toPlainTextString();
                page = Integer.parseInt(tmp.substring(tmp.indexOf("/") + 1, tmp.indexOf("總數") - 1).trim());
                break;
            } catch (Exception e) {
            }
        }
    }

    log.info("get total page [{}] for department:[{}]", page, departmentInfo.toString());

    return page;
}
 
開發者ID:deanjin,項目名稱:houseHunter,代碼行數:36,代碼來源:HouseParser.java

示例14: PostCleaner

import org.htmlparser.Parser; //導入依賴的package包/類
public PostCleaner(String html, int minCodeChars, boolean excludeCode) {
  try {
    Parser htmlParser = Parser.createParser(html, "utf8");  

    PostCleanerVisitor res = new PostCleanerVisitor(minCodeChars, excludeCode);      
    htmlParser.visitAllNodesWith(res);      
    mText = res.getText();
  } catch (ParserException e) {      
    System.err.println(" Parser exception: " + e + " trying simple conversion");
    // Plan B!!!
    mText = PostCleanerVisitor.simpleProc(html);
  }    
}
 
開發者ID:oaqa,項目名稱:knn4qa,代碼行數:14,代碼來源:ConvertStackOverflow.java

示例15: extractKeyWordText

import org.htmlparser.Parser; //導入依賴的package包/類
public static void extractKeyWordText(String url, String keyword) {
	try {
		// 生成一個解析器對象,用網頁的 url 作為參數
		Parser parser = new Parser(url);
		// 設置網頁的編碼,這裏隻是請求了一個 gb2312 編碼網頁
		parser.setEncoding("utf-8");// gb2312
		// 迭代所有節點, null 表示不使用 NodeFilter
		NodeList list = parser.parse(null);
		// 從初始的節點列表跌倒所有的節點
		processNodeList(list, keyword);
	} catch (ParserException e) {
		e.printStackTrace();
	}
}
 
開發者ID:YufangWoo,項目名稱:news-crawler,代碼行數:15,代碼來源:HtmlParserTest.java


注:本文中的org.htmlparser.Parser類示例由純淨天空整理自Github/MSDocs等開源代碼及文檔管理平台,相關代碼片段篩選自各路編程大神貢獻的開源項目,源碼版權歸原作者所有,傳播和使用請參考對應項目的License;未經允許,請勿轉載。