Web Crawlers¶

The darc.crawl module provides two types of crawlers.

crawler() – crawler powered by requests
loader() – crawler powered by selenium

darc.crawl.crawler(url)¶

Single requests crawler for a entry link.

Parameters: url (str) – URL to be crawled by requests.

The function will first parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST , LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()). Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

Web Crawlers¶

darc

Navigation

Related Topics