Source Parsing

The darc.parse module provides auxiliary functions to read robots.txt, sitemaps and HTML documents. It also contains utility functions to check if the proxy type, hostname and content type if in any of the black and white lists.

darc.parse._check(temp_list)[source]

Check hostname and proxy type of links.

Parameters

temp_list (List[Link]) – List of links to be checked.

Return type

List[Link]

Returns

List of links matches the requirements.

Note

If CHECK_NG is True, the function will directly call _check_ng() instead.

darc.parse._check_ng(temp_list)[source]

Check content type of links through HEAD requests.

Parameters

temp_list (List[Link]) – List of links to be checked.

Return type

List[Link]

Returns

List of links matches the requirements.

darc.parse.check_robots(link)[source]

Check if link is allowed in robots.txt.

Parameters

link (Link) – The link object to be checked.

Return type

bool

Returns

If link is allowed in robots.txt.

Note

The root path of a URL will always return True.

Extract links from HTML document.

Parameters
  • link (Link) – Original link of the HTML document.

  • html (Union[str, bytes]) – Content of the HTML document.

  • check (bool) – If perform checks on extracted links, default to CHECK.

Return type

List[Link]

Returns

List of extracted links.

Extract links from raw text source.

Parameters
  • link (Link) – Original link of the source document.

  • text (str) – Content of source text document.

Return type

List[Link]

Returns

List of extracted links.

Important

The extraction is NOT as reliable since we did not perform TLD checks on the extracted links and we cannot guarantee all links to be extracted.

The URL patterns used to extract links are defined by darc.parse.URL_PAT and you may register your own expressions by DARC_URL_PAT.

darc.parse.get_content_type(response)[source]

Get content type from response.

Parameters

response (requests.Response) – Response object.

Return type

str

Returns

The content type from response.

Note

If the Content-Type header is not defined in response, the function will utilise magic to detect its content type.

darc.parse.match_host(host)[source]

Check if hostname in black list.

Parameters

host (Optional[str]) – Hostname to be checked.

Return type

bool

Returns

If host in black list.

Note

If host is None, then it will always return True.

darc.parse.match_mime(mime)[source]

Check if content type in black list.

Parameters

mime (str) – Content type to be checked.

Return type

bool

Returns

If mime in black list.

darc.parse.match_proxy(proxy)[source]

Check if proxy type in black list.

Parameters

proxy (str) – Proxy type to be checked.

Return type

bool

Returns

If proxy in black list.

Note

If proxy is script, then it will always return True.

darc.parse.URL_PAT: List[re.Pattern]

Regular expression patterns to match all reasonable URLs.

Currently, we have two builtin patterns:

  1. HTTP(S) and other regular URLs, e.g. WebSocket, IRC, etc.

re.compile(r'(?P<url>((https?|wss?|irc):)?(//)?\w+(\.\w+)+/?\S*)', re.UNICODE),
  1. Bitcoin accounts, data URIs, (ED2K) magnet links, email addresses, telephone numbers, JavaScript functions, etc.

re.compile(r'(?P<url>(bitcoin|data|ed2k|magnet|mailto|script|tel):\w+)', re.ASCII)
Environ

DARC_URL_PAT

See also

The patterns are used in darc.parse.extract_links_from_text().