darc.db._db_operation(operation, *args, **kwargs)[source]

Retry operation on database.

Parameters:
  • operation (Callable[..., TypeVar(_T)]) – Callable / method to perform.

  • *args (Any) – Arbitrary positional arguments.

  • kwargs (Any) –

Keyword Arguments:

**kwargs – Arbitrary keyword arguments.

Return type:

TypeVar(_T)

Returns:

Any return value from a successful operation call.

darc.db._drop_hostname_db(link)[source]

Remove link from the hostname database.

The function updates the HostnameQueueModel table.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db._drop_hostname_redis(link)[source]

Remove link from the hostname database.

The function updates the queue_hostname database.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db._drop_requests_db(link)[source]

Remove link from the requests database.

The function updates the RequestsQueueModel table.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db._drop_requests_redis(link)[source]

Remove link from the requests database.

The function updates the queue_requests database.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db._drop_selenium_db(link)[source]

Remove link from the selenium database.

The function updates the SeleniumQueueModel table.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db._drop_selenium_redis(link)[source]

Remove link from the selenium database.

The function updates the queue_selenium database.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db._gen_arg_msg(*args, **kwargs)[source]

Sanitise arguments representation string.

Parameters:
  • *args (Any) – Arbitrary arguments.

  • kwargs (Any) –

Keyword Arguments:

**kwargs – Arbitrary keyword arguments.

Return type:

str

Returns:

Sanitised arguments representation string.

darc.db._have_hostname_db(link)[source]

Check if current link is a new host.

The function checks the HostnameQueueModel table.

Parameters:

link (Link) – Link to check against.

Return type:

Tuple[bool, bool]

Returns:

A tuple of two bool values representing if such link is a known host and needs force refetch respectively.

darc.db._have_hostname_redis(link)[source]

Check if current link is a new host.

The function checks the queue_hostname database.

Parameters:

link (Link) – Link to check against.

Return type:

Tuple[bool, bool]

Returns:

A tuple of two bool values representing if such link is a known host and needs force refetch respectively.

darc.db._load_requests_db()[source]

Load link from the requests database.

The function reads the RequestsQueueModel table.

Return type:

List[Link]

Returns:

List of loaded links from the requests database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._load_requests_redis()[source]

Load link from the requests database.

The function reads the queue_requests database.

Return type:

List[Link]

Returns:

List of loaded links from the requests database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._load_selenium_db()[source]

Load link from the selenium database.

The function reads the SeleniumQueueModel table.

Return type:

List[Link]

Returns:

List of loaded links from the selenium database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._load_selenium_redis()[source]

Load link from the selenium database.

The function reads the queue_selenium database.

Parameters:

check – If perform checks on loaded links, default to CHECK.

Return type:

List[Link]

Returns:

List of loaded links from the selenium database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._redis_command(command, *args, **kwargs)[source]

Wrapper function for Redis command.

Parameters:
  • command (str) – Command name.

  • *args (Any) – Arbitrary arguments for the Redis command.

  • kwargs (Any) –

Keyword Arguments:

**kwargs – Arbitrary keyword arguments for the Redis command.

Return type:

Any

Returns:

Values returned from the Redis command.

Warns:

RedisCommandFailed – Warns at each round when the command failed.

See also

Between each retry, the function sleeps for RETRY_INTERVAL second(s) if such value is NOT None.

darc.db._redis_get_lock(key)[source]

Get a lock for Redis operations.

Parameters:

key (Literal['queue_hostname', 'queue_requests', 'queue_selenium']) – Lock target key.

Return type:

Union[Redlock, AbstractContextManager[TypeVar(T_co, covariant=True)]]

Returns:

Return a new pottery.redlock.Redlock object using key key that mimics the behavior of threading.Lock.

Seel Also:

If REDIS_LOCK is False, returns a contextlib.nullcontext instead.

darc.db._save_requests_db(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the requests database.

The function updates the RequestsQueueModel table.

Parameters:
  • entries (Union[Link, List[Link]]) – Links to be added to the requests database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Return type:

None

darc.db._save_requests_redis(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the requests database.

The function updates the queue_requests database.

Parameters:
  • entries (Union[Link, List[Link]]) – Links to be added to the requests database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Forces ZADD to only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Forces ZADD to only update scores of elements that already exist. New elements will not be added.

Return type:

None

darc.db._save_selenium_db(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the selenium database.

The function updates the SeleniumQueueModel table.

Parameters:
  • entries (Union[Link, List[Link]]) – Links to be added to the selenium database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Return type:

None

darc.db._save_selenium_redis(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the selenium database.

The function updates the queue_selenium database.

Parameters:
  • entries (Union[Link, List[Link]]) – Links to be added to the selenium database. It can be either an iterable of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is an iterable of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Forces ZADD to only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Forces ZADD to only update scores of elements that already exist. New elements will not be added.

Return type:

None

When entries is a list of Link instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined by BULK_SIZE.

Notes

The entries will be dumped through pickle so that darc do not need to parse them again.

darc.db.drop_hostname(link)[source]

Remove link from the hostname database.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db.drop_requests(link)[source]

Remove link from the requests database.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db.drop_selenium(link)[source]

Remove link from the selenium database.

Parameters:

link (Link) – Link to be removed.

Return type:

None

darc.db.have_hostname(link)[source]

Check if current link is a new host.

Parameters:

link (Link) – Link to check against.

Return type:

Tuple[bool, bool]

Returns:

A tuple of two bool values representing if such link is a known host and needs force refetch respectively.

darc.db.load_requests(check=False)[source]

Load link from the requests database.

Parameters:

check (bool) – If perform checks on loaded links, default to CHECK.

Return type:

List[Link]

Returns:

List of loaded links from the requests database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db.load_selenium(check=False)[source]

Load link from the selenium database.

Parameters:

check (bool) – If perform checks on loaded links, default to CHECK.

Return type:

List[Link]

Returns:

List of loaded links from the selenium database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db.save_requests(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the requests database.

The function updates the queue_requests database.

Parameters:
  • entries (Union[Link, List[Link]]) – Links to be added to the requests database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Return type:

None

Notes

The entries will be dumped through pickle so that darc do not need to parse them again.

When entries is a list of Link instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined by BULK_SIZE.

darc.db.save_selenium(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the selenium database.

Parameters:
  • entries (Union[Link, List[Link]]) – Links to be added to the selenium database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Return type:

None

Notes

The entries will be dumped through pickle so that darc do not need to parse them again.

When entries is a list of Link instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined by BULK_SIZE.

darc.db.BULK_SIZE: int
Default:

100

Environ:

DARC_BULK_SIZE

Bulk size for updating Redis databases.

darc.db.LOCK_TIMEOUT: float | None
Default:

10

Environ:

DARC_LOCK_TIMEOUT

Lock blocking timeout.

Note

If is an infinit inf, no timeout will be applied.

See also

Get a lock from darc.db.get_lock().

darc.db.MAX_POOL: int
Default:

1_000

Environ:

DARC_MAX_POOL

Maximum number of links loading from the database.

Note

If is an infinit inf, no limit will be applied.

darc.db.REDIS_LOCK: bool
Default:

False

Environ:

DARC_REDIS_LOCK

If use Redis (Lua) lock to ensure process/thread-safely operations.

See also

Toggles the behaviour of darc.db.get_lock().

darc.db.RETRY_INTERVAL: int
Default:

10

Environ:

DARC_RETRY

Retry interval between each Redis command failure.

Note

If is an infinit inf, no interval will be applied.

See also

Toggles the behaviour of darc.db.redis_command().