darc.db._db_operation(operation, *args, **kwargs)[source]

Retry operation on database.

Parameters
  • operation (Callable[..., TypeVar(_T)]) – Callable / method to perform.

  • *args – Arbitrary positional arguments.

  • kwargs (Any) –

Keyword Arguments

**kwargs – Arbitrary keyword arguments.

Return type

TypeVar(_T)

Returns

Any return value from a successful operation call.

darc.db._drop_hostname_db(link)[source]

Remove link from the hostname database.

The function updates the HostnameQueueModel table.

Parameters

link (Link) – Link to be removed.

Return type

None

darc.db._drop_hostname_redis(link)[source]

Remove link from the hostname database.

The function updates the queue_hostname database.

Parameters

link (Link) – Link to be removed.

Return type

None

darc.db._drop_requests_db(link)[source]

Remove link from the requests database.

The function updates the RequestsQueueModel table.

Parameters

link (Link) – Link to be removed.

Return type

None

darc.db._drop_requests_redis(link)[source]

Remove link from the requests database.

The function updates the queue_requests database.

Parameters

link (Link) – Link to be removed.

Return type

None

darc.db._drop_selenium_db(link)[source]

Remove link from the selenium database.

The function updates the SeleniumQueueModel table.

Parameters

link (Link) – Link to be removed.

Return type

None

darc.db._drop_selenium_redis(link)[source]

Remove link from the selenium database.

The function updates the queue_selenium database.

Parameters

link (Link) – Link to be removed.

Return type

None

darc.db._gen_arg_msg(*args, **kwargs)[source]

Sanitise arguments representation string.

Parameters
  • *args (Any) – Arbitrary arguments.

  • kwargs (Any) –

Keyword Arguments

**kwargs – Arbitrary keyword arguments.

Return type

str

Returns

Sanitised arguments representation string.

darc.db._have_hostname_db(link)[source]

Check if current link is a new host.

The function checks the HostnameQueueModel table.

Parameters

link (Link) – Link to check against.

Return type

Tuple[bool, bool]

Returns

A tuple of two bool values representing if such link is a known host and needs force refetch respectively.

darc.db._have_hostname_redis(link)[source]

Check if current link is a new host.

The function checks the queue_hostname database.

Parameters

link (Link) – Link to check against.

Return type

Tuple[bool, bool]

Returns

A tuple of two bool values representing if such link is a known host and needs force refetch respectively.

darc.db._load_requests_db()[source]

Load link from the requests database.

The function reads the RequestsQueueModel table.

Return type

List[Link]

Returns

List of loaded links from the requests database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._load_requests_redis()[source]

Load link from the requests database.

The function reads the queue_requests database.

Return type

List[Link]

Returns

List of loaded links from the requests database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._load_selenium_db()[source]

Load link from the selenium database.

The function reads the SeleniumQueueModel table.

Return type

List[Link]

Returns

List of loaded links from the selenium database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._load_selenium_redis()[source]

Load link from the selenium database.

The function reads the queue_selenium database.

Parameters

check – If perform checks on loaded links, default to CHECK.

Return type

List[Link]

Returns

List of loaded links from the selenium database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db._redis_command(command, *args, **kwargs)[source]

Wrapper function for Redis command.

Parameters
  • command (str) – Command name.

  • *args – Arbitrary arguments for the Redis command.

  • kwargs (Any) –

Keyword Arguments

**kwargs – Arbitrary keyword arguments for the Redis command.

Return type

Any

Returns

Values returned from the Redis command.

Warns

RedisCommandFailed – Warns at each round when the command failed.

See also

Between each retry, the function sleeps for RETRY_INTERVAL second(s) if such value is NOT None.

darc.db._redis_get_lock(key)[source]

Get a lock for Redis operations.

Parameters

key (Literal[‘queue_hostname’, ‘queue_requests’, ‘queue_selenium’]) – Lock target key.

Return type

Union[Redlock, AbstractContextManager[TypeVar(T_co, covariant=True)]]

Returns

Return a new pottery.redlock.Redlock object using key key that mimics the behavior of threading.Lock.

Seel Also:

If REDIS_LOCK is False, returns a contextlib.nullcontext instead.

darc.db._save_requests_db(entries: Link, single: Literal[True], score: Optional[float] = None, nx: bool = False, xx: bool = False) None[source]
darc.db._save_requests_db(entries: List[Link], single: Literal[False] = False, score: Optional[float] = None, nx: bool = False, xx: bool = False) None

Save link to the requests database.

The function updates the RequestsQueueModel table.

Parameters
  • entries (Union[Link, List[Link]]) – Links to be added to the requests database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Return type

None

darc.db._save_requests_redis(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the requests database.

The function updates the queue_requests database.

Parameters
  • entries (Union[Link, List[Link]]) – Links to be added to the requests database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Forces ZADD to only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Forces ZADD to only update scores of elements that already exist. New elements will not be added.

Return type

None

darc.db._save_selenium_db(entries: Link, single: Literal[True], score: Optional[float] = None, nx: bool = False, xx: bool = False) None[source]
darc.db._save_selenium_db(entries: List[Link], single: Literal[False] = False, score: Optional[float] = None, nx: bool = False, xx: bool = False) None

Save link to the selenium database.

The function updates the SeleniumQueueModel table.

Parameters
  • entries (Union[Link, List[Link]]) – Links to be added to the selenium database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Return type

None

darc.db._save_selenium_redis(entries, single=False, score=None, nx=False, xx=False)[source]

Save link to the selenium database.

The function updates the queue_selenium database.

Parameters
  • entries (Union[Link, List[Link]]) – Links to be added to the selenium database. It can be either an iterable of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is an iterable of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Forces ZADD to only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Forces ZADD to only update scores of elements that already exist. New elements will not be added.

Return type

None

When entries is a list of Link instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined by BULK_SIZE.

Notes

The entries will be dumped through pickle so that darc do not need to parse them again.

Return type

None

Parameters
darc.db.drop_hostname(link)[source]

Remove link from the hostname database.

Parameters

link (Link) – Link to be removed.

Return type

None

Return type

None

Parameters

link (Link) –

darc.db.drop_requests(link)[source]

Remove link from the requests database.

Parameters

link (Link) – Link to be removed.

Return type

None

Return type

None

Parameters

link (Link) –

darc.db.drop_selenium(link)[source]

Remove link from the selenium database.

Parameters

link (Link) – Link to be removed.

Return type

None

Return type

None

Parameters

link (Link) –

darc.db.have_hostname(link)[source]

Check if current link is a new host.

Parameters

link (Link) – Link to check against.

Return type

Tuple[bool, bool]

Returns

A tuple of two bool values representing if such link is a known host and needs force refetch respectively.

darc.db.load_requests(check=False)[source]

Load link from the requests database.

Parameters

check (bool) – If perform checks on loaded links, default to CHECK.

Return type

List[Link]

Returns

List of loaded links from the requests database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db.load_selenium(check=False)[source]

Load link from the selenium database.

Parameters

check (bool) – If perform checks on loaded links, default to CHECK.

Return type

List[Link]

Returns

List of loaded links from the selenium database.

Note

At runtime, the function will load links with maximum number at MAX_POOL to limit the memory usage.

darc.db.save_requests(entries: Link, single: Literal[True], score: Optional[float] = None, nx: bool = False, xx: bool = False) None[source]
darc.db.save_requests(entries: List[Link], single: Literal[False] = False, score: Optional[float] = None, nx: bool = False, xx: bool = False) None

Save link to the requests database.

The function updates the queue_requests database.

Parameters
  • entries (Union[Link, List[Link]]) – Links to be added to the requests database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Notes

The entries will be dumped through pickle so that darc do not need to parse them again.

When entries is a list of Link instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined by BULK_SIZE.

Return type

None

darc.db.save_selenium(entries: Link, single: Literal[True], score: Optional[float] = None, nx: bool = False, xx: bool = False) None[source]
darc.db.save_selenium(entries: List[Link], single: Literal[False] = False, score: Optional[float] = None, nx: bool = False, xx: bool = False) None

Save link to the selenium database.

Parameters
  • entries (Union[Link, List[Link]]) – Links to be added to the selenium database. It can be either a list of links, or a single link string (if single set as True).

  • single (bool) – Indicate if entries is a list of links or a single link string.

  • score (Optional[float]) – Score to for the Redis sorted set.

  • nx (bool) – Only create new elements and not to update scores for elements that already exist.

  • xx (bool) – Only update scores of elements that already exist. New elements will not be added.

Notes

The entries will be dumped through pickle so that darc do not need to parse them again.

When entries is a list of Link instances, we tries to perform bulk update to easy the memory consumption. The bulk size is defined by BULK_SIZE.

Return type

None

darc.db.BULK_SIZE: int
Default

100

Environ

DARC_BULK_SIZE

Bulk size for updating Redis databases.

darc.db.LOCK_TIMEOUT: Optional[float]
Default

10

Environ

DARC_LOCK_TIMEOUT

Lock blocking timeout.

Note

If is an infinit inf, no timeout will be applied.

See also

Get a lock from darc.db.get_lock().

darc.db.MAX_POOL: int
Default

1_000

Environ

DARC_MAX_POOL

Maximum number of links loading from the database.

Note

If is an infinit inf, no limit will be applied.

darc.db.REDIS_LOCK: bool
Default

False

Environ

DARC_REDIS_LOCK

If use Redis (Lua) lock to ensure process/thread-safely operations.

See also

Toggles the behaviour of darc.db.get_lock().

darc.db.RETRY_INTERVAL: int
Default

10

Environ

DARC_RETRY

Retry interval between each Redis command failure.

Note

If is an infinit inf, no interval will be applied.

See also

Toggles the behaviour of darc.db.redis_command().