Submission Data Models

The darc.model.web module defines the data models to store the data crawled from the darc project.

See also

Please refer to darc.submit module for more information about data submission.

Hostname Records

The darc.model.web.hostname module defines the data model representing hostnames, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.hostname.HostnameModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for a hostname record.

Important

The alive of a hostname is toggled if crawler() successfully requested a URL with such hostname.

DoesNotExist

alias of HostnameModelDoesNotExist

property alive: bool

If the hostname is still active.

We consider the hostname as inactive, only if all subsidiary URLs are inactive.

discovery: datetime = <DateTimeField: HostnameModel.discovery>

Timestamp of first new_host submission.

hostname: str = <CharField: HostnameModel.hostname>

Hostname (c.f. link.host). The maximum length of the host name and of the fully qualified domain name (FQDN) is 63 bytes per label and 255 characters per FQDN.

hosts: List[HostsModel]

hosts.txt for the hostname, back reference from HostsModel.host.

id = <AutoField: HostnameModel.id>
last_seen: datetime = <DateTimeField: HostnameModel.last_seen>

Timestamp of last related submission.

proxy: Proxy = <IntEnumField: HostnameModel.proxy>

Proxy type (c.f. link.proxy).

robots: List[RobotsModel]

robots.txt for the hostname, back reference from RobotsModel.host.

property since: datetime

The hostname is active/inactive since such timestamp.

We confider the timestamp by the earlies timestamp of related subsidiary active/inactive URLs.

sitemaps: List[SitemapModel]

sitemap.xml for the hostname, back reference from SitemapModel.sitemaps.

urls: List[URLModel]

URLs with the same hostname, back reference from URLModel.hostname.

URL Records

The darc.model.web.url module defines the data model representing URLs, specifically from requests and selenium submission.

See also

Please refer to darc.submit.submit_requests() and darc.submit.submit_selenium() for more information.

class darc.model.web.url.URLModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for a requested URL.

Important

The alive of a URL is toggled if crawler() successfully requested such URL and the status code is ok.

DoesNotExist

alias of URLModelDoesNotExist

classmethod get_by_url(url)[source]

Select by URL.

Parameters:

url (str) – URL to select.

Return type:

URLModel

Returns:

Selected URL model.

alive: bool = <BooleanField: URLModel.alive>

If the hostname is still active.

children
property childrent: List[URLModel]

Back reference to which URLs were identified from the URL.

discovery: datetime = <DateTimeField: URLModel.discovery>

Timestamp of first submission.

hash: str = <CharField: URLModel.hash>

Sha256 hash value (c.f. Link.name).

hostname: HostnameModel = <ForeignKeyField: URLModel.hostname>

Hostname (c.f. link.host).

hostname_id = <ForeignKeyField: URLModel.hostname>
id = <AutoField: URLModel.id>
last_seen: datetime = <DateTimeField: URLModel.last_seen>

Timestamp of last submission.

parents
proxy: Proxy = <IntEnumField: URLModel.proxy>

Proxy type (c.f. link.proxy).

requests: List[RequestsModel]

requests submission record, back reference from RequestsModel.url.

selenium: List[SeleniumModel]

selenium submission record, back reference from SeleniumModel.url.

since: datetime = <DateTimeField: URLModel.since>

The hostname is active/inactive since this timestamp.

url: str = <TextField: URLModel.url>

Original URL (c.f. link.url).

class darc.model.web.url.URLThroughModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for the map of URL extration chain.

DoesNotExist

alias of URLThroughModelDoesNotExist

child: List[URLModel] = <ForeignKeyField: URLThroughModel.child>

Back reference to which URLs were identified from the URL.

child_id = <ForeignKeyField: URLThroughModel.child>
id = <AutoField: URLThroughModel.id>
parent: List[URLModel] = <ForeignKeyField: URLThroughModel.parent>

Back reference to where the URL was identified.

parent_id = <ForeignKeyField: URLThroughModel.parent>

robots.txt Records

The darc.model.web.robots module defines the data model representing robots.txt data, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.robots.RobotsModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for robots.txt data.

DoesNotExist

alias of RobotsModelDoesNotExist

document: str = <TextField: RobotsModel.document>

Document data as str.

host: HostnameModel = <ForeignKeyField: RobotsModel.host>

Hostname (c.f. link.host).

host_id = <ForeignKeyField: RobotsModel.host>
id = <AutoField: RobotsModel.id>
timestamp: datetime = <DateTimeField: RobotsModel.timestamp>

Timestamp of the submission.

sitemap.xml Records

The darc.model.web.sitemap module defines the data model representing sitemap.xml data, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.sitemap.SitemapModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for sitemap.xml data.

DoesNotExist

alias of SitemapModelDoesNotExist

document: str = <TextField: SitemapModel.document>

Document data as str.

host: HostnameModel = <ForeignKeyField: SitemapModel.host>

Hostname (c.f. link.host).

host_id = <ForeignKeyField: SitemapModel.host>
id = <AutoField: SitemapModel.id>
timestamp: datetime = <DateTimeField: SitemapModel.timestamp>

Timestamp of the submission.

hosts.txt Records

The darc.model.web.hosts module defines the data model representing hosts.txt data, specifically from new_host submission.

See also

Please refer to darc.submit.submit_new_host() for more information.

class darc.model.web.hosts.HostsModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for hosts.txt data.

DoesNotExist

alias of HostsModelDoesNotExist

document: str = <TextField: HostsModel.document>

Document data as str.

host: HostnameModel = <ForeignKeyField: HostsModel.host>

Hostname (c.f. link.host).

host_id = <ForeignKeyField: HostsModel.host>
id = <AutoField: HostsModel.id>
timestamp: datetime = <DateTimeField: HostsModel.timestamp>

Timestamp of the submission.

Crawler Records

The darc.model.web.requests module defines the data model representing crawler, specifically from requests submission.

See also

Please refer to darc.submit.submit_requests() for more information.

class darc.model.web.requests.RequestsHistoryModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for history records from requests submission.

DoesNotExist

alias of RequestsHistoryModelDoesNotExist

cookies: Cookies = <JSONField: RequestsHistoryModel.cookies>

Response cookies.

document: bytes = <BlobField: RequestsHistoryModel.document>

Document data as bytes.

id = <AutoField: RequestsHistoryModel.id>
index: int = <IntegerField: RequestsHistoryModel.index>

History index number.

method: str = <CharField: RequestsHistoryModel.method>

Request method (normally GET).

model: RequestsModel = <ForeignKeyField: RequestsHistoryModel.model>

Original record.

model_id = <ForeignKeyField: RequestsHistoryModel.model>
reason: str = <TextField: RequestsHistoryModel.reason>

Response reason string.

request: Headers = <JSONField: RequestsHistoryModel.request>

Request headers.

response: Headers = <JSONField: RequestsHistoryModel.response>

Response headers.

status_code: int = <IntegerField: RequestsHistoryModel.status_code>

Status code.

timestamp: datetime = <DateTimeField: RequestsHistoryModel.timestamp>

Timestamp of the submission.

url: str = <TextField: RequestsHistoryModel.url>

Request URL.

class darc.model.web.requests.RequestsModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for documents from requests submission.

DoesNotExist

alias of RequestsModelDoesNotExist

cookies: Cookies = <JSONField: RequestsModel.cookies>

Response cookies.

document: bytes = <BlobField: RequestsModel.document>

Document data as bytes.

history: List[RequestsHistoryModel]

List of redirect history, back reference from RequestsHistoryModel.model.

id = <AutoField: RequestsModel.id>
is_html: bool = <BooleanField: RequestsModel.is_html>

If document is HTML or miscellaneous data.

method: str = <CharField: RequestsModel.method>

Request method (normally GET).

mime_type: str = <CharField: RequestsModel.mime_type>

Conetent type.

reason: str = <TextField: RequestsModel.reason>

Response reason string.

request: Headers = <JSONField: RequestsModel.request>

Request headers.

response: Headers = <JSONField: RequestsModel.response>

Response headers.

session: Cookies = <JSONField: RequestsModel.session>

Session cookies.

status_code: int = <IntegerField: RequestsModel.status_code>

Status code.

timestamp: datetime = <DateTimeField: RequestsModel.timestamp>

Timestamp of the submission.

url: URLModel = <ForeignKeyField: RequestsModel.url>

Original URL (c.f. link.url).

url_id = <ForeignKeyField: RequestsModel.url>

Loader Records

The darc.model.web.selenium module defines the data model representing loader, specifically from selenium submission.

See also

Please refer to darc.submit.submit_selenium() for more information.

class darc.model.web.selenium.SeleniumModel(*args, **kwargs)[source]

Bases: BaseModelWeb

Data model for documents from selenium submission.

DoesNotExist

alias of SeleniumModelDoesNotExist

document: str = <TextField: SeleniumModel.document>

Document data as str.

id = <AutoField: SeleniumModel.id>
screenshot: Optional[bytes] = <BlobField: SeleniumModel.screenshot>

Screenshot in PNG format as bytes.

timestamp: datetime = <DateTimeField: SeleniumModel.timestamp>

Timestamp of the submission.

url: URLModel = <ForeignKeyField: SeleniumModel.url>

Original URL (c.f. link.url).

url_id = <ForeignKeyField: SeleniumModel.url>