Web Crawlers¶
The darc.crawl module provides two types of crawlers.
-
darc.crawl.crawler(url)¶ Single
requestscrawler for a entry link.- Parameters
url (str) – URL to be crawled by
requests.
The function will first parse the URL using
parse_link(), and check if need to crawl the URL (c.f.PROXY_WHITE_LIST,PROXY_BLACK_LIST,LINK_WHITE_LISTandLINK_BLACK_LIST); if true, then crawl the URL withrequests.If the URL is from a brand new host,
darcwill first try to fetch and saverobots.txtand sitemaps of the host (c.f.save_robots()andsave_sitemap()), and extract then save the links from sitemaps (c.f.read_sitemap()) into link database for future crawling (c.f.save_requests()). Also, if the submission API is provided,submit_new_host()will be called and submit the documents just fetched.See also
If
robots.txtpresented, andFORCEisFalse,darcwill check if allowed to crawl the URL.Note
The root path (e.g.
/in https://www.example.com/) will always be crawled ignoringrobots.txt.At this point,
darcwill call the customised hook function fromdarc.sitesto crawl and get the final response object.darcwill save the session cookies and header information, usingsave_headers().Note
If
requests.exceptions.InvalidSchemais raised, the link will be saved bysave_invalid(). Further processing is dropped.If the content type of response document is not ignored (c.f.
MIME_WHITE_LISTandMIME_BLACK_LIST),darcwill save the document usingsave_html()orsave_file()accordingly. And if the submission API is provided,submit_requests()will be called and submit the document just fetched.If the response document is HTML (
text/htmlandapplication/xhtml+xml),extract_links()will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()).And if the response status code is between
400and600, the URL will be saved back to the link database (c.f.save_requests()). If NOT, the URL will be saved intoseleniumlink database to proceed next steps (c.f.save_selenium()).
-
darc.crawl.loader(url)¶ Single
seleniumloader for a entry link.- Parameters
url (str) – URL to be crawled by
requests.
The function will first parse the URL using
parse_link()and start loading the URL usingseleniumwith Google Chrome.At this point,
darcwill call the customised hook function fromdarc.sitesto load and return the originalselenium.webdriver.Chromeobject.If successful, the rendered source HTML document will be saved using
save_html(), and a full-page screenshot will be taken and saved.Note
When taking full-page screenshot,
loader()will usedocument.body.scrollHeightto get the total height of web page. If the page height is less than 1,000 pixels, thendarcwill by default set the height as 1,000 pixels.Later
darcwill tellseleniumto resize the window (in headless mode) to 1,024 pixels in width and 110% of the page height in height, and take a PNG screenshot.See also
If the submission API is provided,
submit_selenium()will be called and submit the document just loaded.Later,
extract_links()will be called then to extract all possible links from the HTML document and save such links into therequestsdatabase (c.f.save_requests()).