当前位置:博客首页 > Python > 正文

win10环境异步新闻爬虫使用爬坑笔记

作者: Jarvan 分类: Python 发布时间: 2020-07-12 14:58 百度已收录

脚本网址:

大规模异步新闻爬虫:实现功能强大、简洁易用的网址池(URL Pool)

大规模异步新闻爬虫:实现一个同步定向新闻爬虫

0)我用的python版本是3.6.5,win10在安装leveldb的时候无法用pip直接安装,需要编译,我用的是网上的一个版本leveldb.pyd,需要将其放在安装路径的site-packages中,我的安装路径为例:C:\Python36\Lib\site-packages

适合3.6的leveldb.pyd下载链接:

链接:https://pan.baidu.com/s/1xCRKYQl_rthQlTvH4BePVg
提取码:id68

1)引入的lzma压缩算法模块模块,win10环境pip安装时没找到,所以用同类型的模块zlib来代替

2)由于用zlib替代lzma所以原先的创建数据表的代码也需要一些调整:

CREATE TABLE `crawler_hub` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `url` varchar(64) NOT NULL,
  `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `url` (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

上面的crawler_hub的没变,但创建crawler_html的一些命名改变了:

CREATE TABLE `crawler_html` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `urlhash` bigint(20) unsigned NOT NULL COMMENT 'farmhash',
  `url` varchar(512) NOT NULL,
  `html_zlib` longblob NOT NULL,
  `created_at` timestamp NULL DEFAULT CURRENT_TIMESTAMP,
  PRIMARY KEY (`id`),
  UNIQUE KEY `urlhash` (`urlhash`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

3)同时异步新闻爬虫的主程序也改变了(针对zlib替代lzma的部分)

#!/usr/bin/env python3
# Author: veelion

import urllib.parse as urlparse
import zlib
import farmhash
import traceback


from ezpymysql import Connection
from urlpool import UrlPool
import functions as fn
import config

class NewsCrawlerSync:
    def __init__(self, name):
        self.db = Connection(
            config.db_host,
            config.db_db,
            config.db_user,
            config.db_password
        )
        self.logger = fn.init_file_logger(name + '.log')
        self.urlpool = UrlPool(name)
        self.hub_hosts = None
        self.load_hubs()

    def load_hubs(self,):
        sql = 'select url from crawler_hub'
        data = self.db.query(sql)
        self.hub_hosts = set()
        hubs = []
        for d in data:
            host = urlparse.urlparse(d['url']).netloc
            self.hub_hosts.add(host)
            hubs.append(d['url'])
        self.urlpool.set_hubs(hubs, 300)

    def save_to_db(self, url, html):
        urlhash = farmhash.hash64(url)
        sql = 'select url from crawler_html where urlhash=%s'
        d = self.db.get(sql, urlhash)
        if d:
            if d['url'] != url:
                msg = 'farmhash collision: %s <=> %s' % (url, d['url'])
                self.logger.error(msg)
            return True
        if isinstance(html, str):
            html = html.encode('utf8')
        html_zlib = zlib.compress(html)
        sql = ('insert into crawler_html(urlhash, url, html_zlib) '
               'values(%s, %s, _binary %s)')
        good = False
        try:
            self.db.execute(sql, urlhash, url, html_zlib)
            good = True
        except Exception as e:
            if e.args[0] == 1062:
                # Duplicate entry
                good = True
                pass
            else:
                traceback.print_exc()
                raise e
        return good

    def filter_good(self, urls):
        goodlinks = []
        for url in urls:
            host = urlparse.urlparse(url).netloc
            if host in self.hub_hosts:
                goodlinks.append(url)
        return goodlinks

    def process(self, url, ishub):
        status, html, redirected_url = fn.downloader(url)
        self.urlpool.set_status(url, status)
        if redirected_url != url:
            self.urlpool.set_status(redirected_url, status)
        # 提取hub网页中的链接, 新闻网页中也有“相关新闻”的链接,按需提取
        if status != 200:
            return
        if ishub:
            newlinks = fn.extract_links_re(redirected_url, html)
            goodlinks = self.filter_good(newlinks)
            print("%s/%s, goodlinks/newlinks" % (len(goodlinks), len(newlinks)))
            self.urlpool.addmany(goodlinks)
        else:
            self.save_to_db(redirected_url, html)

    def run(self,):
        while 1:
            urls = self.urlpool.pop(5)
            for url, ishub in urls.items():
                self.process(url, ishub)


if __name__ == '__main__':
    crawler = NewsCrawlerSync('yuanrenxue')
    crawler.run()

4)在上面主程序的类NewsCrawlerSync中调用的数据库连接参数,需要单独建立一个config.py的文件,把主程序类需要调用的参数填进去,如本地数据库写法:

db_host = '127.0.0.1'
db_db = 'yuanrenxue'
db_user = 'root'
db_password = 'root'

5)使用pymysql连接MySQL数据库,新增数据的字段中有blob类型的数据(我们用的zlib压缩后的网页用的是blob类型),在新增的时候报Warning: (1300, “Invalid utf8mb4 character string: ‘F9876A'”),只需要在SQL语句的BLOB类型的字段前面添加_binary的前缀即可:

sql = ('insert into crawler_html(urlhash, url, html_zlib) '               'values(%s, %s, _binary %s)') 

本文主要参考内容:大规模异步新闻爬虫:实现一个同步定向新闻爬虫