Source Parsing

The darc.parse module provides auxiliary functions to read robots.txt, sitemaps and HTML documents. It also contains utility functions to check if the proxy type, hostname and content type if in any of the black and white lists.

darc.parse._check(temp_list)[source]

Check hostname and proxy type of links.

Parameters:

temp_list (List[Link]) – List of links to be checked.

Return type:

List[Link]

Returns:

List of links matches the requirements.

Note

If CHECK_NG is True, the function will directly call _check_ng() instead.

darc.parse._check_ng(temp_list)[source]

Check content type of links through HEAD requests.

Parameters:

temp_list (List[Link]) – List of links to be checked.

Return type:

List[Link]

Returns:

List of links matches the requirements.

darc.parse.check_robots(link)[source]

Check if link is allowed in robots.txt.

Parameters:

link (Link) – The link object to be checked.

Return type:

bool

Returns:

If link is allowed in robots.txt.

Note

The root path of a URL will always return True.

Extract links from HTML document.

Parameters:
  • link (Link) – Original link of the HTML document.

  • html (Union[str, bytes]) – Content of the HTML document.

  • check (bool) – If perform checks on extracted links, default to CHECK.

Return type:

List[Link]

Returns:

List of extracted links.

Extract links from raw text source.

Parameters:
  • link (Link) – Original link of the source document.

  • text (str) – Content of source text document.

Return type:

List[Link]

Returns:

List of extracted links.

Important

The extraction is NOT as reliable since we did not perform TLD checks on the extracted links and we cannot guarantee all links to be extracted.

The URL patterns used to extract links are defined by darc.parse.URL_PAT and you may register your own expressions by DARC_URL_PAT.

darc.parse.get_content_type(response)[source]

Get content type from response.

Parameters:

response (requests.Response) – Response object.

Return type:

str

Returns:

The content type from response.

Note

If the Content-Type header is not defined in response, the function will utilise magic to detect its content type.

darc.parse.match_host(host)[source]

Check if hostname in black list.

Parameters:

host (Optional[str]) – Hostname to be checked.

Return type:

bool

Returns:

If host in black list.

Note

If host is None, then it will always return True.

darc.parse.match_mime(mime)[source]

Check if content type in black list.

Parameters:

mime (str) – Content type to be checked.

Return type:

bool

Returns:

If mime in black list.

darc.parse.match_proxy(proxy)[source]

Check if proxy type in black list.

Parameters:

proxy (str) – Proxy type to be checked.

Return type:

bool

Returns:

If proxy in black list.

Note

If proxy is script, then it will always return True.

darc.parse.URL_PAT: List[re.Pattern]

Regular expression patterns to match all reasonable URLs.

Currently, we have two builtin patterns:

  1. HTTP(S) and other regular URLs, e.g. WebSocket, IRC, etc.

re.compile(r'(?P<url>((https?|wss?|irc):)?(//)?\w+(\.\w+)+/?\S*)', re.UNICODE),
  1. Bitcoin accounts, data URIs, (ED2K) magnet links, email addresses, telephone numbers, JavaScript functions, etc.

re.compile(r'(?P<url>(bitcoin|data|ed2k|magnet|mailto|script|tel):\w+)', re.ASCII)
Environ:

DARC_URL_PAT

See also

The patterns are used in darc.parse.extract_links_from_text().