Sites Customisation¶
As websites may have authentication requirements, etc., over
its content, the darc.sites module provides sites
customisation hooks to both requests and selenium
crawling processes.
Important
To create a sites customisation, define your class by inheriting
darc.sites.BaseSite and register it to the darc
module through darc.sites.register().
To start with, you just need to define your sites customisation by
inheriting BaseSite and overload corresponding
crawler() and/or
loader() methods.
To customise behaviours over requests, you sites customisation
class should have a crawler() method, e.g.
DefaultSite.crawler.
The function takes the requests.Session object with proxy settings and
a Link object representing the link to be
crawled, then returns a requests.Response object containing the final
data of the crawling process.
-
darc.sites.crawler_hook(timestamp, session, link)[source]¶ Customisation as to
requestssessions.- Parameters
timestamp (datetime.datetime) – Timestamp of the worker node reference.
session (requests.Session) – Session object with proxy settings.
link (darc.link.Link) – Link object to be crawled.
- Returns
The final response object with crawled data.
- Return type
See also
darc.sites.SITE_MAP
To customise behaviours over selenium, you sites customisation
class should have a loader() method, e.g.
DefaultSite.loader.
The function takes the WebDriver
object with proxy settings and a Link object representing
the link to be loaded, then returns the WebDriver
object containing the final data of the loading process.
-
darc.sites.loader_hook(timestamp, driver, link)[source]¶ Customisation as to
seleniumdrivers.- Parameters
timestamp (datetime.datetime) – Timestamp of the worker node reference.
driver (selenium.webdriver.Chrome) – Web driver object with proxy settings.
link (darc.link.Link) – Link object to be loaded.
- Returns
The web driver object with loaded data.
- Return type
selenium.webdriver.Chrome
See also
darc.sites.SITE_MAP
To tell the darc project which sites customisation
module it should use for a certain hostname, you can register
such module to the SITEMAP mapping dictionary
through register():
-
darc.sites.register(site, *hostname)[source]¶ Register new site map.
- Parameters
site (Type[darc.sites._abc.BaseSite]) – Sites customisation class inherited from
BaseSite.*hostname (Tuple[str]) – Optional list of hostnames the sites customisation should be registered with. By default, we use
site.hostname.
- Return type
-
darc.sites.SITEMAP: DefaultDict[str, Type[darc.sites._abc.BaseSite]]¶ from darc.sites.default import DefaultSite SITEMAP = collections.defaultdict(lambda: DefaultSite, { # 'www.sample.com': SampleSite, # local customised class })
The mapping dictionary for hostname to sites customisation classes.
The fallback value is
darc.sites.default.DefaultSite.
-
darc.sites._get_site(link)[source]¶ Load sites customisation if any.
If the sites customisation does not exist, it will fallback to the default hooks,
DefaultSite.- Parameters
link (darc.link.Link) – Link object to fetch sites customisation class.
- Returns
The sites customisation class.
- Return type
Type[darc.sites._abc.BaseSite]
See also
See also
Please refer to Customisations for more examples and explanations.