Sites Customisation

As websites may have authentication requirements, etc., over its content, the darc.sites module provides sites customisation hooks to both requests and selenium crawling processes.

Important

To create a sites customisation, define your class by inheriting darc.sites.BaseSite and register it to the darc module through darc.sites.register().

To start with, you just need to define your sites customisation by inheriting BaseSite and overload corresponding crawler() and/or loader() methods.

To customise behaviours over requests, you sites customisation class should have a crawler() method, e.g. DefaultSite.crawler.

The function takes the requests.Session object with proxy settings and a Link object representing the link to be crawled, then returns a requests.Response object containing the final data of the crawling process.

To customise behaviours over selenium, you sites customisation class should have a loader() method, e.g. DefaultSite.loader.

The function takes the WebDriver object with proxy settings and a Link object representing the link to be loaded, then returns the WebDriver object containing the final data of the loading process.

To tell the darc project which sites customisation module it should use for a certain hostname, you can register such module to the SITEMAP mapping dictionary through register():

darc.sites.SITEMAP: DefaultDict[str, Type[darc.sites._abc.BaseSite]]
from darc.sites.default import DefaultSite

SITEMAP = collections.defaultdict(lambda: DefaultSite, {
    # 'www.sample.com': SampleSite,  # local customised class
})

The mapping dictionary for hostname to sites customisation classes.

The fallback value is darc.sites.default.DefaultSite.

See also

Please refer to Customisations for more examples and explanations.