Web Crawlers

The darc.crawl module provides two types of crawlers.

darc.crawl.crawler(link)[source]

Single requests crawler for an entry link.

Parameters

link (Link) – URL to be crawled by requests.

Return type

None

The function will first parse the URL using parse_link(), and check if need to crawl the URL (c.f. PROXY_WHITE_LIST, PROXY_BLACK_LIST, LINK_WHITE_LIST and LINK_BLACK_LIST); if true, then crawl the URL with requests.

If the URL is from a brand new host, darc will first try to fetch and save robots.txt and sitemaps of the host (c.f. save_robots() and save_sitemap()), and extract then save the links from sitemaps (c.f. read_sitemap()) into link database for future crawling (c.f. save_requests()).

Note

A host is new if have_hostname() returns True.

If darc.proxy.null.fetch_sitemap() and/or darc.proxy.i2p.fetch_hosts() failed when fetching such documents, the host will be removed from the hostname database through drop_hostname(), and considered as new when next encounter.

Also, if the submission API is provided, submit_new_host() will be called and submit the documents just fetched.

If robots.txt presented, and FORCE is False, darc will check if allowed to crawl the URL.

Note

The root path (e.g. / in https://www.example.com/) will always be crawled ignoring robots.txt.

At this point, darc will call the customised hook function from darc.sites to crawl and get the final response object. darc will save the session cookies and header information, using save_headers().

Note

If requests.exceptions.InvalidSchema is raised, the link will be saved by save_invalid(). Further processing is dropped, and the link will be removed from the requests database through drop_requests().

If LinkNoReturn is raised, the link will be removed from the requests database through drop_requests().

If the content type of response document is not ignored (c.f. MIME_WHITE_LIST and MIME_BLACK_LIST), submit_requests() will be called and submit the document just fetched.

If the response document is HTML (text/html and application/xhtml+xml), extract_links() will be called then to extract all possible links from the HTML document and save such links into the database (c.f. save_requests()).

And if the response status code is between 400 and 600, the URL will be saved back to the link database (c.f. save_requests()). If NOT, the URL will be saved into selenium link database to proceed next steps (c.f. save_selenium()).

Return type

None

Parameters

link (darc_link.Link) –

darc.crawl.loader(link)[source]

Single selenium loader for an entry link.

Parameters
  • Link – URL to be crawled by selenium.

  • link (Link) –

Return type

None

The function will first parse the URL using parse_link() and start loading the URL using selenium with Google Chrome.

At this point, darc will call the customised hook function from darc.sites to load and return the original selenium.webdriver.chrome.webdriver.WebDriver object.

Note

If LinkNoReturn is raised, the link will be removed from the selenium database through drop_selenium().

If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.

Note

When taking full-page screenshot, loader() will use document.body.scrollHeight to get the total height of web page. If the page height is less than 1,000 pixels, then darc will by default set the height as 1,000 pixels.

Later darc will tell selenium to resize the window (in headless mode) to 1,024 pixels in width and 110% of the page height in height, and take a PNG screenshot.

If the submission API is provided, submit_selenium() will be called and submit the document just loaded.

Later, extract_links() will be called then to extract all possible links from the HTML document and save such links into the requests database (c.f. save_requests()).

Return type

None

Parameters

link (Link) –