Source Parsing¶
The darc.parse
module provides auxiliary functions
to read robots.txt
, sitemaps and HTML documents. It
also contains utility functions to check if the proxy type,
hostname and content type if in any of the black and white
lists.
-
darc.parse.
_check
(temp_list)[source]¶ Check hostname and proxy type of links.
- Parameters
- Return type
- Returns
List of links matches the requirements.
Note
If
CHECK_NG
isTrue
, the function will directly call_check_ng()
instead.
-
darc.parse.
check_robots
(link)[source]¶ Check if
link
is allowed inrobots.txt
.- Parameters
link (
Link
) – The link object to be checked.- Return type
- Returns
If
link
is allowed inrobots.txt
.
Note
The root path of a URL will always return
True
.
-
darc.parse.
extract_links_from_text
(link, text)[source]¶ Extract links from raw text source.
- Parameters
- Return type
- Returns
List of extracted links.
Important
The extraction is NOT as reliable since we did not perform TLD checks on the extracted links and we cannot guarantee all links to be extracted.
The URL patterns used to extract links are defined by
darc.parse.URL_PAT
and you may register your own expressions byDARC_URL_PAT
.
-
darc.parse.
get_content_type
(response)[source]¶ Get content type from
response
.- Parameters
response (
requests.Response
) – Response object.- Return type
- Returns
The content type from
response
.
Note
If the
Content-Type
header is not defined inresponse
, the function will utilisemagic
to detect its content type.
-
darc.parse.
URL_PAT
: List[re.Pattern]¶ Regular expression patterns to match all reasonable URLs.
Currently, we have two builtin patterns:
HTTP(S) and other regular URLs, e.g. WebSocket, IRC, etc.
re.compile(r'(?P<url>((https?|wss?|irc):)?(//)?\w+(\.\w+)+/?\S*)', re.UNICODE),
Bitcoin accounts, data URIs, (ED2K) magnet links, email addresses, telephone numbers, JavaScript functions, etc.
re.compile(r'(?P<url>(bitcoin|data|ed2k|magnet|mailto|script|tel):\w+)', re.ASCII)
- Environ
See also
The patterns are used in
darc.parse.extract_links_from_text()
.