Source Parsing

The darc.parse module provides auxiliary functions to read robots.txt, sitemaps and HTML documents. It also contains utility functions to check if the proxy type, hostname and content type if in any of the black and white lists.

darc.parse._check(temp_list)[source]

Check hostname and proxy type of links.

Parameters

temp_list (List[str]) – List of links to be checked.

Returns

List of links matches the requirements.

Return type

List[str]

Note

If CHECK_NG is True, the function will directly call _check_ng() instead.

darc.parse._check_ng(temp_list)[source]

Check content type of links through HEAD requests.

Parameters

temp_list (List[str]) – List of links to be checked.

Returns

List of links matches the requirements.

Return type

List[str]

darc.parse.check_robots(link)[source]

Check if link is allowed in robots.txt.

Parameters

link (darc.link.Link) – The link object to be checked.

Returns

If link is allowed in robots.txt.

Return type

bool

Note

The root path of a URL will always return True.

Extract links from HTML document.

Parameters
  • link (str) – Original link of the HTML document.

  • html (Union[str, bytes]) – Content of the HTML document.

  • check (bool) – If perform checks on extracted links, default to CHECK.

Returns

An iterator of extracted links.

Return type

Iterator[str]

darc.parse.get_content_type(response)[source]

Get content type from response.

Parameters

response (requests.Response.) – Response object.

Returns

The content type from response.

Return type

str

Note

If the Content-Type header is not defined in response, the function will utilise magic to detect its content type.

darc.parse.get_sitemap(link, text, host=None)[source]

Fetch link to other sitemaps from a sitemap.

Parameters
  • link (str) – Original link to the sitemap.

  • text (str) – Content of the sitemap.

  • host (Optional[str]) – Hostname of the URL to the sitemap, the value may not be same as in link.

Returns

List of link to sitemaps.

Return type

List[darc.link.Link]

Note

As specified in the sitemap protocol, it may contain links to other sitemaps. *

*

https://www.sitemaps.org/protocol.html#index

darc.parse.match_host(host)[source]

Check if hostname in black list.

Parameters

host (str) – Hostname to be checked.

Returns

If host in black list.

Return type

bool

Note

If host is None, then it will always return True.

darc.parse.match_mime(mime)[source]

Check if content type in black list.

Parameters

mime (str) – Content type to be checked.

Returns

If mime in black list.

Return type

bool

darc.parse.match_proxy(proxy)[source]

Check if proxy type in black list.

Parameters

proxy (str) – Proxy type to be checked.

Returns

If proxy in black list.

Return type

bool

Note

If proxy is script, then it will always return True.

darc.parse.read_robots(link, text, host=None)[source]

Read robots.txt to fetch link to sitemaps.

Parameters
  • link (str) – Original link to robots.txt.

  • text (str) – Content of robots.txt.

  • host (Optional[str]) – Hostname of the URL to robots.txt, the value may not be same as in link.

Returns

List of link to sitemaps.

Return type

List[darc.link.Link]

Note

If the link to sitemap is not specified in robots.txt , the fallback link /sitemap.xml will be used.

https://www.sitemaps.org/protocol.html#submit_robots

darc.parse.read_sitemap(link, text, check=False)[source]

Read sitemap.

Parameters
  • link (str) – Original link to the sitemap.

  • text (str) – Content of the sitemap.

  • check (bool) – If perform checks on extracted links, default to CHECK.

Returns

List of links extracted.

Return type

Iterator[str]