Apache日志文件可能很庞大且难以阅读。
这里提供一种从Apache日志文件中获取访问量最大的页面(或文件)列表的方法。
在此示例中,我们只需要知道GET请求中的URL。编程实现将使用Python的集合中强大的Counter计数器
import collections
logfile = open("yourlogfile.log", "r")
clean_log=[]
for line in logfile:
try:
# copy the URLS to an empty list.
# We get the part between GET and HTTP
clean_log.append(line[line.index("GET")+4:line.index("HTTP")])
except:
pass
counter = collections.Counter(clean_log)
# get the Top 50 most popular URLs
for count in counter.most_common(50):
print(str(count[1]) + " " + str(count[0]))
logfile.close()