当前位置: 首页>>代码示例>>Python>>正文


Python Parser.parseFile方法代码示例

本文整理汇总了Python中Parser.Parser.parseFile方法的典型用法代码示例。如果您正苦于以下问题:Python Parser.parseFile方法的具体用法?Python Parser.parseFile怎么用?Python Parser.parseFile使用的例子?那么恭喜您, 这里精选的方法代码示例或许可以为您提供帮助。您也可以进一步了解该方法所在Parser.Parser的用法示例。


在下文中一共展示了Parser.parseFile方法的2个代码示例,这些例子默认根据受欢迎程度排序。您可以为喜欢或者感觉有用的代码点赞,您的评价将有助于系统推荐出更棒的Python代码示例。

示例1: _decode

# 需要导入模块: from Parser import Parser [as 别名]
# 或者: from Parser.Parser import parseFile [as 别名]
 def _decode(self, shelf):
     root = Registry("root")
     parser = Parser(root)
     parser.parseFile(shelf.name)
     shelf['inventory'] = root
     shelf._frozen = True
     return
开发者ID:geodynamics,项目名称:pythia,代码行数:9,代码来源:CodecConfigSheet.py

示例2: SearchEngine

# 需要导入模块: from Parser import Parser [as 别名]
# 或者: from Parser.Parser import parseFile [as 别名]
class SearchEngine(object):
    def __init__(self):
        """
        Constructor method.
        """
        self.invertedIndex = dict()
        self.documents = dict()
        stopWordsPath = "sw.txt"
        self.parser = Parser(stopWordsPath)

    def calculateWeights(self):
        """
        Calculate the idf and tf-idf weights using the self.invertedIndex dict.

        tf = term frequency of the word in the document.
        idf = log_2(N/n), where N is the amount of words in the collection, and
        n is the amount of documents in wich the word appears.
        tf-idf = tf * idf. It's the weight of a word in a document.

        This method expects that the inverted index contains frequecy of the
        words, and does not attempt to check if the current values are in fact
        frequencies.  So the user must be aware if there's or not really
        frequencies in the index before calling this method, at the risk of
        getting wrong weights and losing the previous data.

        return: None
        """
        # N is the amount of words in the collection
        N = len(self.documents)
        for word in self.invertedIndex.iterkeys():
            idf, lst = self.invertedIndex[word]
            # n is the amount of documents in wich the word appeared
            n = len(lst)
            #print "word: {}  N: {}  n: {}".format(word, N , n)
            idf = log(N / n, 2) # idf of the word
            # now calculate the weight for each pair document, frequency
            for iii, pair in enumerate(lst):
                docID, freq = pair
                weight = idf * freq
                lst[iii] = (docID, weight)
            self.invertedIndex[word] = (idf, lst)

    def calculateDocNorms(self):
        """
        Calculate the leghts/norms of the document vectors, using the weights
        in the inverted index. And places them at the self.documents dict, on
        the Document.norm field.

        It calculates the norm based on the current weights in the inverted
        index. This method does not attempt to check if the weights used are
        valid or not.

        return: None
        """
        # sum the square of the weight of each component of the each document
        # vector
        for word, pair in self.invertedIndex.iteritems():
            idf, lst = pair
            for docId, weight in lst:
                doc = self.documents[docId]
                subTotal = doc.norm
                subTotal += weight **2
                doc = doc._replace(norm=subTotal)

                # place the result in the self.documents dict
                self.documents[docId] = doc

        # now that we have the sum of the squares, we take the square root
        # to get the norm
        for docId in self.documents.iterkeys():
            doc = self.documents[docId]
            doc = doc._replace(norm=doc.norm **0.5)
            self.documents[docId] = doc

    def createIndex(self, folderPath, regex=r"^cf\d{2}$", tfidf=True):
        """
        Creates the inverted index based on the files of the folderPath, that
        match the regex.

        The parseFile method is collection specific, and could be overrided.
        In this case it's using the CFC collection. In the case the file match
        the regex but is either a folder or cannot be opened, it's ignored.

        param folderPath: string containing the path to the folder with the
        collection. Defaults to the current working directory.
        param regex: string containing a regex to match the files in the folder
        that will be parsed. Defaults to a regex for the CFC collection.
        param tfIdf: bool value, if it's True, calculate the idf for the words
        int the self.invertedIndex dict, and the tf-idf weights for the words
        in the documents. Defaults to True.
        return: None.
        """
        print("Creating index using the files in the folder: {}" .format(folderPath))

        # regex to match the files of the collection.
        validFile = re.compile(regex)
        # list the files and folders in the folderPath variable, and the ones
        # that match the regex are parsed
        for fileName in os.listdir(folderPath):
            if validFile.match(fileName):
#.........这里部分代码省略.........
开发者ID:jpaulofb,项目名称:cfc_search_engine_tri,代码行数:103,代码来源:SearchEngine.py


注:本文中的Parser.Parser.parseFile方法示例由纯净天空整理自Github/MSDocs等开源代码及文档管理平台,相关代码片段筛选自各路编程大神贡献的开源项目,源码版权归原作者所有,传播和使用请参考对应项目的License;未经允许,请勿转载。