Default Hooks

The darc.sites.default module is the fallback for sites customisation.

class darc.sites.default.DefaultSite[source]

Bases: BaseSite

Default hooks.

static crawler(timestamp, session, link)[source]

Default crawler hook.

Parameters:
  • timestamp (datetime) – Timestamp of the worker node reference.

  • session (requests.Session) – Session object with proxy settings.

  • link (Link) – Link object to be crawled.

Returns:

The final response object with crawled data.

Return type:

requests.Response

static loader(timestamp, driver, link)[source]

Default loader hook.

When loading, if SE_WAIT is a valid time lapse, the function will sleep for such time to wait for the page to finish loading contents.

Parameters:
  • timestamp (datetime) – Timestamp of the worker node reference.

  • driver (selenium.webdriver.Chrome) – Web driver object with proxy settings.

  • link (Link) – Link object to be loaded.

Returns:

The web driver object with loaded data.

Return type:

selenium.webdriver.Chrome

Note

Internally, selenium will wait for the browser to finish loading the pages before return (i.e. the web API event DOMContentLoaded). However, some extra scripts may take more time running after the event.