Web Crawlers¶
The darc.crawl
module provides two types of crawlers.
- darc.crawl.crawler(link)[source]¶
Single
requests
crawler for an entry link.The function will first parse the URL using
parse_link()
, and check if need to crawl the URL (c.f.PROXY_WHITE_LIST
,PROXY_BLACK_LIST
,LINK_WHITE_LIST
andLINK_BLACK_LIST
); if true, then crawl the URL withrequests
.If the URL is from a brand new host,
darc
will first try to fetch and saverobots.txt
and sitemaps of the host (c.f.save_robots()
andsave_sitemap()
), and extract then save the links from sitemaps (c.f.read_sitemap()
) into link database for future crawling (c.f.save_requests()
).Note
A host is new if
have_hostname()
returnsTrue
.If
darc.proxy.null.fetch_sitemap()
and/ordarc.proxy.i2p.fetch_hosts()
failed when fetching such documents, the host will be removed from the hostname database throughdrop_hostname()
, and considered as new when next encounter.Also, if the submission API is provided,
submit_new_host()
will be called and submit the documents just fetched.If
robots.txt
presented, andFORCE
isFalse
,darc
will check if allowed to crawl the URL.Note
The root path (e.g.
/
in https://www.example.com/) will always be crawled ignoringrobots.txt
.At this point,
darc
will call the customised hook function fromdarc.sites
to crawl and get the final response object.darc
will save the session cookies and header information, usingsave_headers()
.Note
If
requests.exceptions.InvalidSchema
is raised, the link will be saved bysave_invalid()
. Further processing is dropped, and the link will be removed from therequests
database throughdrop_requests()
.If
LinkNoReturn
is raised, the link will be removed from therequests
database throughdrop_requests()
.If the content type of response document is not ignored (c.f.
MIME_WHITE_LIST
andMIME_BLACK_LIST
),submit_requests()
will be called and submit the document just fetched.If the response document is HTML (
text/html
andapplication/xhtml+xml
),extract_links()
will be called then to extract all possible links from the HTML document and save such links into the database (c.f.save_requests()
).And if the response status code is between
400
and600
, the URL will be saved back to the link database (c.f.save_requests()
). If NOT, the URL will be saved intoselenium
link database to proceed next steps (c.f.save_selenium()
).
- darc.crawl.loader(link)[source]¶
Single
selenium
loader for an entry link.The function will first parse the URL using
parse_link()
and start loading the URL usingselenium
with Google Chrome.At this point,
darc
will call the customised hook function fromdarc.sites
to load and return the originalselenium.webdriver.chrome.webdriver.WebDriver
object.Note
If
LinkNoReturn
is raised, the link will be removed from theselenium
database throughdrop_selenium()
.If successful, the rendered source HTML document will be saved, and a full-page screenshot will be taken and saved.
Note
When taking full-page screenshot,
loader()
will usedocument.body.scrollHeight
to get the total height of web page. If the page height is less than 1,000 pixels, thendarc
will by default set the height as 1,000 pixels.Later
darc
will tellselenium
to resize the window (in headless mode) to 1,024 pixels in width and 110% of the page height in height, and take a PNG screenshot.If the submission API is provided,
submit_selenium()
will be called and submit the document just loaded.Later,
extract_links()
will be called then to extract all possible links from the HTML document and save such links into therequests
database (c.f.save_requests()
).See also