Source Parsing

The darc.parse module provides auxiliary functions to read robots.txt, sitemaps and HTML documents. It also contains utility functions to check if the proxy type, hostname and content type if in any of the black and white lists.

darc.parse._check(temp_list)[source]

Check hostname and proxy type of links.

Parameters

temp_list (List[darc.link.Link]) – List of links to be checked.

Returns

List of links matches the requirements.

Return type

List[darc.link.Link]

Note

If CHECK_NG is True, the function will directly call _check_ng() instead.

darc.parse._check_ng(temp_list)[source]

Check content type of links through HEAD requests.

Parameters

temp_list (List[darc.link.Link]) – List of links to be checked.

Returns

List of links matches the requirements.

Return type

List[darc.link.Link]

darc.parse.check_robots(link)[source]

Check if link is allowed in robots.txt.

Parameters

link (darc.link.Link) – The link object to be checked.

Returns

If link is allowed in robots.txt.

Return type

bool

Note

The root path of a URL will always return True.

Extract links from HTML document.

Parameters
  • link (darc.link.Link) – Original link of the HTML document.

  • html (Union[str, bytes]) – Content of the HTML document.

  • check (bool) – If perform checks on extracted links, default to CHECK.

Returns

List of extracted links.

Return type

List[darc.link.Link]

darc.parse.get_content_type(response)[source]

Get content type from response.

Parameters

response (requests.Response.) – Response object.

Returns

The content type from response.

Return type

str

Note

If the Content-Type header is not defined in response, the function will utilise magic to detect its content type.

darc.parse.match_host(host)[source]

Check if hostname in black list.

Parameters

host (str) – Hostname to be checked.

Returns

If host in black list.

Return type

bool

Note

If host is None, then it will always return True.

darc.parse.match_mime(mime)[source]

Check if content type in black list.

Parameters

mime (str) – Content type to be checked.

Returns

If mime in black list.

Return type

bool

darc.parse.match_proxy(proxy)[source]

Check if proxy type in black list.

Parameters

proxy (str) – Proxy type to be checked.

Returns

If proxy in black list.

Return type

bool

Note

If proxy is script, then it will always return True.