Sites Customisation

As websites may have authentication requirements, etc., over its content, the darc.sites module provides sites customisation hooks to both requests and selenium crawling processes.

Important

To create a sites customisation, define your class by inheriting darc.sites.BaseSite and register it to the darc module through darc.sites.register().

To start with, you just need to define your sites customisation by inheriting BaseSite and overload corresponding crawler() and/or loader() methods.

To customise behaviours over requests, you sites customisation class should have a crawler() method, e.g. DefaultSite.crawler.

The function takes the requests.Session object with proxy settings and a Link object representing the link to be crawled, then returns a requests.Response object containing the final data of the crawling process.

darc.sites.crawler_hook(timestamp, session, link)[source]

Customisation as to requests sessions.

Parameters
  • timestamp (datetime) – Timestamp of the worker node reference.

  • session (requests.Session) – Session object with proxy settings.

  • link (Link) – Link object to be crawled.

Returns

The final response object with crawled data.

Return type

requests.Response

See also

To customise behaviours over selenium, you sites customisation class should have a loader() method, e.g. DefaultSite.loader.

The function takes the WebDriver object with proxy settings and a Link object representing the link to be loaded, then returns the WebDriver object containing the final data of the loading process.

darc.sites.loader_hook(timestamp, driver, link)[source]

Customisation as to selenium drivers.

Parameters
  • timestamp (datetime) – Timestamp of the worker node reference.

  • driver (selenium.webdriver.Chrome) – Web driver object with proxy settings.

  • link (Link) – Link object to be loaded.

Returns

The web driver object with loaded data.

Return type

selenium.webdriver.Chrome

See also

To tell the darc project which sites customisation module it should use for a certain hostname, you can register such module to the SITEMAP mapping dictionary through register():

darc.sites.register(site, *hostname)[source]

Register new site map.

Parameters
  • site (Type[BaseSite]) – Sites customisation class inherited from BaseSite.

  • *hostname (Tuple[str]) – Optional list of hostnames the sites customisation should be registered with. By default, we use site.hostname.

Return type

None

darc.sites.SITEMAP: DefaultDict[str, Type[darc.sites._abc.BaseSite]]
from darc.sites.default import DefaultSite

SITEMAP = collections.defaultdict(lambda: DefaultSite, {
    # 'www.sample.com': SampleSite,  # local customised class
})

The mapping dictionary for hostname to sites customisation classes.

The fallback value is darc.sites.default.DefaultSite.

darc.sites._get_site(link)[source]

Load sites customisation if any.

If the sites customisation does not exist, it will fallback to the default hooks, DefaultSite.

Parameters

link (Link) – Link object to fetch sites customisation class.

Return type

Type[BaseSite]

Returns

The sites customisation class.

See also

Please refer to Customisations for more examples and explanations.