Submission Data Models

The darc.model.web module defines the data models to store the data crawled from the darc project.

See also

Please refer to darc.submit module for more information about data submission.

Hostname Records

The darc.model.web.hostname module defines the data model representing hostnames, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.hostname.HostnameModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for a hostname record.

Important

The alive of a hostname is toggled if crawler() successfully requested a URL with such hostname.

DoesNotExist

alias of HostnameModelDoesNotExist

alive

If the hostname is still active.

We consider the hostname as inactive, only if all subsidiary URLs are inactive.

discovery: datetime.datetime = <DateTimeField: HostnameModel.discovery>

Timestamp of first new_host submission.

hostname: str = <TextField: HostnameModel.hostname>

Hostname (c.f. link.host).

hosts
id = <AutoField: HostnameModel.id>
last_seen: datetime.datetime = <DateTimeField: HostnameModel.last_seen>

Timestamp of last related submission.

proxy: darc.model.utils.Proxy = <IntEnumField: HostnameModel.proxy>

Proxy type (c.f. link.proxy).

robots
since

The hostname is active/inactive since such timestamp.

We confider the timestamp by the earlies timestamp of related subsidiary active/inactive URLs.

sitemaps
urls

URL Records

The darc.model.web.url module defines the data model representing URLs, specifically from requests and selenium submission.

See also

Please refer to darc.submit.submit_requests() and darc.submit.submit_selenium() for more information.

class darc.model.web.url.URLModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for a requested URL.

Important

The alive of a URL is toggled if crawler() successfully requested such URL and the status code is ok.

DoesNotExist

alias of URLModelDoesNotExist

alive: bool = <BooleanField: URLModel.alive>

If the hostname is still active.

discovery: datetime.datetime = <DateTimeField: URLModel.discovery>

Timestamp of first submission.

hash: str = <CharField: URLModel.hash>

Sha256 hash value (c.f. Link.name).

hostname: darc.model.web.hostname.HostnameModel = <ForeignKeyField: URLModel.hostname>

Hostname (c.f. link.host).

hostname_id = <ForeignKeyField: URLModel.hostname>
id = <AutoField: URLModel.id>
last_seen: datetime.datetime = <DateTimeField: URLModel.last_seen>

Timestamp of last submission.

proxy: darc.model.utils.Proxy = <IntEnumField: URLModel.proxy>

Proxy type (c.f. link.proxy).

requests
selenium
since: datetime.datetime = <DateTimeField: URLModel.since>

The hostname is active/inactive since this timestamp.

url: str = <TextField: URLModel.url>

Original URL (c.f. link.url).

robots.txt Records

The darc.model.web.robots module defines the data model representing robots.txt data, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.robots.RobotsModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for robots.txt data.

DoesNotExist

alias of RobotsModelDoesNotExist

document: str = <TextField: RobotsModel.document>

Document data as str.

host: darc.model.web.hostname.HostnameModel = <ForeignKeyField: RobotsModel.host>

Hostname (c.f. link.host).

host_id = <ForeignKeyField: RobotsModel.host>
id = <AutoField: RobotsModel.id>
timestamp: datetime.datetime = <DateTimeField: RobotsModel.timestamp>

Timestamp of the submission.

sitemap.xml Records

The darc.model.web.sitemap module defines the data model representing sitemap.xml data, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.sitemap.SitemapModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for sitemap.xml data.

DoesNotExist

alias of SitemapModelDoesNotExist

document: str = <TextField: SitemapModel.document>

Document data as str.

host: darc.model.web.hostname.HostnameModel = <ForeignKeyField: SitemapModel.host>

Hostname (c.f. link.host).

host_id = <ForeignKeyField: SitemapModel.host>
id = <AutoField: SitemapModel.id>
timestamp: datetime.datetime = <DateTimeField: SitemapModel.timestamp>

Timestamp of the submission.

hosts.txt Records

The darc.model.web.hosts module defines the data model representing hosts.txt data, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.hosts.HostsModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for hosts.txt data.

DoesNotExist

alias of HostsModelDoesNotExist

document: str = <TextField: HostsModel.document>

Document data as str.

host: darc.model.web.hostname.HostnameModel = <ForeignKeyField: HostsModel.host>

Hostname (c.f. link.host).

host_id = <ForeignKeyField: HostsModel.host>
id = <AutoField: HostsModel.id>
timestamp: datetime.datetime = <DateTimeField: HostsModel.timestamp>

Timestamp of the submission.

Crawler Records

The darc.model.web.requests module defines the data model representing crawler, specifically from requests submission.

See also

Please refer to darc.submit.submit_requests() for more information.

class darc.model.web.requests.RequestsHistoryModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for history records from requests submission.

DoesNotExist

alias of RequestsHistoryModelDoesNotExist

cookies: List[Dict[str, Any]] = <JSONField: RequestsHistoryModel.cookies>

Response cookies.

document: bytes = <BlobField: RequestsHistoryModel.document>

Document data as bytes.

id = <AutoField: RequestsHistoryModel.id>
index: int = <IntegerField: RequestsHistoryModel.index>

History index number.

method: str = <CharField: RequestsHistoryModel.method>

Request method (normally GET).

model: darc.model.web.requests.RequestsModel = <ForeignKeyField: RequestsHistoryModel.model>

Original record.

model_id = <ForeignKeyField: RequestsHistoryModel.model>
reason: str = <TextField: RequestsHistoryModel.reason>

Response reason string.

request: Dict[str, str] = <JSONField: RequestsHistoryModel.request>

Request headers.

response: Dict[str, str] = <JSONField: RequestsHistoryModel.response>

Response headers.

status_code: int = <IntegerField: RequestsHistoryModel.status_code>

Status code.

timestamp: datetime.datetime = <DateTimeField: RequestsHistoryModel.timestamp>

Timestamp of the submission.

url: str = <TextField: RequestsHistoryModel.url>

Request URL.

class darc.model.web.requests.RequestsModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for documents from requests submission.

DoesNotExist

alias of RequestsModelDoesNotExist

cookies: List[Dict[str, Any]] = <JSONField: RequestsModel.cookies>

Response cookies.

document: bytes = <BlobField: RequestsModel.document>

Document data as bytes.

history
id = <AutoField: RequestsModel.id>
is_html: bool = <BooleanField: RequestsModel.is_html>

If document is HTML or miscellaneous data.

method: str = <CharField: RequestsModel.method>

Request method (normally GET).

mime_type: str = <CharField: RequestsModel.mime_type>

Conetent type.

reason: str = <TextField: RequestsModel.reason>

Response reason string.

request: Dict[str, str] = <JSONField: RequestsModel.request>

Request headers.

response: Dict[str, str] = <JSONField: RequestsModel.response>

Response headers.

session: List[Dict[str, Any]] = <JSONField: RequestsModel.session>

Session cookies.

status_code: int = <IntegerField: RequestsModel.status_code>

Status code.

timestamp: datetime.datetime = <DateTimeField: RequestsModel.timestamp>

Timestamp of the submission.

url: darc.model.web.url.URLModel = <ForeignKeyField: RequestsModel.url>

Original URL (c.f. link.url).

url_id = <ForeignKeyField: RequestsModel.url>

Loader Records

The darc.model.web.selenium module defines the data model representing loader, specifically from selenium submission.

See also

Please refer to darc.submit.submit_selenium() for more information.

class darc.model.web.selenium.SeleniumModel(*args, **kwargs)[source]

Bases: darc.model.abc.BaseModelWeb

Data model for documents from selenium submission.

DoesNotExist

alias of SeleniumModelDoesNotExist

document: str = <TextField: SeleniumModel.document>

Document data as str.

id = <AutoField: SeleniumModel.id>
screenshot: Optional[bytes] = <BlobField: SeleniumModel.screenshot>

Screenshot in PNG format as bytes.

timestamp: datetime.datetime = <DateTimeField: SeleniumModel.timestamp>

Timestamp of the submission.

url: darc.model.web.url.URLModel = <ForeignKeyField: SeleniumModel.url>

Original URL (c.f. link.url).

url_id = <ForeignKeyField: SeleniumModel.url>