URL Utilities

The Link class is the key data structure of the darc project, it contains all information required to identify a URL’s proxy type, hostname, path prefix when saving, etc.

The link module also provides several wrapper function to the urllib.parse module.

Bases: object

Parsed link.

Parameters
Returns

Parsed link object.

Return type

Link

Note

Link is a dataclass object. It is safely hashable, through hash(url).

__hash__()[source]

Provide hash support to the Link object.

Return type

int

asdict()[source]

Convert to a dict instance.

Return type

Dict[str, Any]

base: str

base folder for saving files

host: Optional[str]

URL’s hostname

name: str

hashed link for saving files

proxy: str

proxy type

url: str

original link

url_backref: Optional[Link] = None

optional Link instance from which current link was extracted

url_parse: ParseResult

parsed URL from urllib.parse.urlparse()

Parse link.

Parameters
Keyword Arguments

backref – optional Link instance from which current link was extracted

Return type

Link

Returns

The parsed link object.

Note

If host is provided, it will override the hostname of the original link.

The parsing process of proxy type is as follows:

  1. If host is None and the parse result from urllib.parse.urlparse() has no netloc (or hostname) specified, then set hostname as (null); else set it as is.

  2. If the scheme is data, then the link is a data URI, set hostname as data and proxy as data.

  3. If the scheme is javascript, then the link is some JavaScript codes, set proxy as script.

  4. If the scheme is bitcoin, then the link is a Bitcoin address, set proxy as bitcoin.

  5. If the scheme is ethereum, then the link is an Ethereum address, set proxy as ethereum.

  6. If the scheme is ed2k, then the link is an ED2K magnet link, set proxy as ed2k.

  7. If the scheme is magnet, then the link is a magnet link, set proxy as magnet.

  8. If the scheme is mailto, then the link is an email address, set proxy as mail.

  9. If the scheme is irc, then the link is an IRC link, set proxy as irc.

  10. If the scheme is NOT any of http or https, then set proxy to the scheme.

  11. If the host is None, set hostname to (null), set proxy to null.

  12. If the host is an onion (.onion) address, set proxy to tor.

  13. If the host is an I2P (.i2p) address, or any of localhost:7657 and localhost:7658, set proxy to i2p.

  14. If the host is localhost on ZERONET_PORT, and the path is not /, i.e. NOT root path, set proxy to zeronet; and set the first part of its path as hostname.

    Example:

    For a ZeroNet address, e.g., http://127.0.0.1:43110/1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D, parse_link() will parse the hostname as 1HeLLo4uzjaLetFx6NH3PMwFP3qbRbTf3D.

  15. If the host is localhost on FREENET_PORT, and the path is not /, i.e. NOT root path, set proxy to freenet; and set the first part of its path as hostname.

    Example:

    For a Freenet address, e.g., http://127.0.0.1:8888/USK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE/sone/77/, parse_link() will parse the hostname as USK@nwa8lHa271k2QvJ8aa0Ov7IHAV-DFOCFgmDt3X6BpCI,DuQSUZiI~agF8c-6tjsFFGuZ8eICrzWCILB60nT8KKo,AQACAAE.

  16. If the host is a proxied onion (.onion.sh) address, set proxy to tor2web.

  17. If none of the cases above satisfied, the proxy will be set as null, marking it a plain normal link.

The base for parsed link Link object is defined as

<root>/<proxy>/<scheme>/<hostname>/

where root is PATH_DB.

The name for parsed link Link object is the sha256 hash (c.f. hashlib.sha256()) of the original link.

darc.link.quote(string, safe='/', encoding=None, errors=None)[source]

Wrapper function for urllib.parse.quote().

Parameters
Return type

str

Returns

The quoted string.

Note

The function suppressed possible errors when calling urllib.parse.quote(). If any, it will return the original string.

darc.link.unquote(string, encoding='utf-8', errors='replace')[source]

Wrapper function for urllib.parse.unquote().

Parameters
  • string (str) – string to be unquoted

  • encoding (str) – string encoding

  • errors (str) – encoding error handler

Return type

str

Returns

The quoted string.

Note

The function suppressed possible errors when calling urllib.parse.unquote(). If any, it will return the original string.

darc.link.urljoin(base, url, allow_fragments=True)[source]

Wrapper function for urllib.parse.urljoin().

Parameters
  • base (AnyStr) – base URL

  • url (AnyStr) – URL to be joined

  • allow_fragments (bool) – if allow fragments

Return type

AnyStr

Returns

The joined URL.

Note

The function suppressed possible errors when calling urllib.parse.urljoin(). If any, it will return base/url directly.

darc.link.urlparse(url, scheme='', allow_fragments=True)[source]

Wrapper function for urllib.parse.urlparse().

Parameters
  • url (str) – URL to be parsed

  • scheme (str) – URL scheme

  • allow_fragments (bool) – if allow fragments

Return type

ParseResult

Returns

The parse result.

Note

The function suppressed possible errors when calling urllib.parse.urlparse(). If any, it will return urllib.parse.ParseResult(scheme=scheme, netloc='', path=url, params='', query='', fragment='') directly.

darc.link.urlsplit(url, scheme='', allow_fragments=True)[source]

Wrapper function for urllib.parse.urlsplit().

Parameters
  • url (str) – URL to be split

  • scheme (str) – URL scheme

  • allow_fragments (bool) – if allow fragments

Return type

SplitResult

Returns

The split result.

Note

The function suppressed possible errors when calling urllib.parse.urlsplit(). If any, it will return urllib.parse.SplitResult(scheme=scheme, netloc='', path=url, params='', query='', fragment='') directly.