Submission Data Models¶
The darc.model.web module defines the data models
to store the data crawled from the darc project.
See also
Please refer to darc.submit module for more information
about data submission.
Hostname Records¶
The darc.model.web.hostname module defines the data model
representing hostnames, specifically from new_host submission.
See also
Please refer to darc.submit.submit_new_host() for more
information.
- class darc.model.web.hostname.HostnameModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for a hostname record.
Important
The alive of a hostname is toggled if
crawler()successfully requested a URL with such hostname.- DoesNotExist¶
alias of
HostnameModelDoesNotExist
- property alive: bool¶
If the hostname is still active.
We consider the hostname as inactive, only if all subsidiary URLs are inactive.
- discovery: datetime = <DateTimeField: HostnameModel.discovery>¶
Timestamp of first
new_hostsubmission.
-
hostname:
str= <CharField: HostnameModel.hostname>¶ Hostname (c.f.
link.host). The maximum length of the host name and of the fully qualified domain name (FQDN) is 63 bytes per label and 255 characters per FQDN.
- hosts: List[HostsModel]¶
hosts.txtfor the hostname, back reference fromHostsModel.host.
- id = <AutoField: HostnameModel.id>¶
- last_seen: datetime = <DateTimeField: HostnameModel.last_seen>¶
Timestamp of last related submission.
-
proxy:
Proxy= <IntEnumField: HostnameModel.proxy>¶ Proxy type (c.f.
link.proxy).
- robots: List[RobotsModel]¶
robots.txtfor the hostname, back reference fromRobotsModel.host.
- property since: datetime¶
The hostname is active/inactive since such timestamp.
We confider the timestamp by the earlies timestamp of related subsidiary active/inactive URLs.
- sitemaps: List[SitemapModel]¶
sitemap.xmlfor the hostname, back reference fromSitemapModel.sitemaps.
- urls: List[URLModel]¶
URLs with the same hostname, back reference from
URLModel.hostname.
URL Records¶
The darc.model.web.url module defines the data model
representing URLs, specifically from requests and
selenium submission.
See also
Please refer to darc.submit.submit_requests() and
darc.submit.submit_selenium() for more information.
- class darc.model.web.url.URLModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for a requested URL.
Important
The alive of a URL is toggled if
crawler()successfully requested such URL and the status code isok.- DoesNotExist¶
alias of
URLModelDoesNotExist
- children¶
-
hostname:
HostnameModel= <ForeignKeyField: URLModel.hostname>¶ Hostname (c.f.
link.host).
- hostname_id = <ForeignKeyField: URLModel.hostname>¶
- id = <AutoField: URLModel.id>¶
- parents¶
-
proxy:
Proxy= <IntEnumField: URLModel.proxy>¶ Proxy type (c.f.
link.proxy).
-
requests:
List[RequestsModel]¶ requestssubmission record, back reference fromRequestsModel.url.
-
selenium:
List[SeleniumModel]¶ seleniumsubmission record, back reference fromSeleniumModel.url.
- class darc.model.web.url.URLThroughModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for the map of URL extration chain.
- DoesNotExist¶
alias of
URLThroughModelDoesNotExist
-
child:
List[URLModel] = <ForeignKeyField: URLThroughModel.child>¶ Back reference to which URLs were identified from the URL.
- child_id = <ForeignKeyField: URLThroughModel.child>¶
- id = <AutoField: URLThroughModel.id>¶
-
parent:
List[URLModel] = <ForeignKeyField: URLThroughModel.parent>¶ Back reference to where the URL was identified.
- parent_id = <ForeignKeyField: URLThroughModel.parent>¶
robots.txt Records¶
The darc.model.web.robots module defines the data model
representing robots.txt data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host() for more
information.
- class darc.model.web.robots.RobotsModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for
robots.txtdata.- DoesNotExist¶
alias of
RobotsModelDoesNotExist
- host_id = <ForeignKeyField: RobotsModel.host>¶
- id = <AutoField: RobotsModel.id>¶
- timestamp: datetime = <DateTimeField: RobotsModel.timestamp>¶
Timestamp of the submission.
sitemap.xml Records¶
The darc.model.web.sitemap module defines the data model
representing sitemap.xml data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host() for more
information.
- class darc.model.web.sitemap.SitemapModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for
sitemap.xmldata.- DoesNotExist¶
alias of
SitemapModelDoesNotExist
- host_id = <ForeignKeyField: SitemapModel.host>¶
- id = <AutoField: SitemapModel.id>¶
- timestamp: datetime = <DateTimeField: SitemapModel.timestamp>¶
Timestamp of the submission.
hosts.txt Records¶
The darc.model.web.hosts module defines the data model
representing hosts.txt data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host() for more
information.
- class darc.model.web.hosts.HostsModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for
hosts.txtdata.- DoesNotExist¶
alias of
HostsModelDoesNotExist
- host_id = <ForeignKeyField: HostsModel.host>¶
- id = <AutoField: HostsModel.id>¶
- timestamp: datetime = <DateTimeField: HostsModel.timestamp>¶
Timestamp of the submission.
Crawler Records¶
The darc.model.web.requests module defines the data model
representing crawler, specifically
from requests submission.
See also
Please refer to darc.submit.submit_requests() for more
information.
- class darc.model.web.requests.RequestsHistoryModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for history records from
requestssubmission.- DoesNotExist¶
alias of
RequestsHistoryModelDoesNotExist
- cookies: Cookies = <JSONField: RequestsHistoryModel.cookies>¶
Response cookies.
- id = <AutoField: RequestsHistoryModel.id>¶
-
model:
RequestsModel= <ForeignKeyField: RequestsHistoryModel.model>¶ Original record.
- model_id = <ForeignKeyField: RequestsHistoryModel.model>¶
- request: Headers = <JSONField: RequestsHistoryModel.request>¶
Request headers.
- response: Headers = <JSONField: RequestsHistoryModel.response>¶
Response headers.
- timestamp: datetime = <DateTimeField: RequestsHistoryModel.timestamp>¶
Timestamp of the submission.
- class darc.model.web.requests.RequestsModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for documents from
requestssubmission.- DoesNotExist¶
alias of
RequestsModelDoesNotExist
- cookies: Cookies = <JSONField: RequestsModel.cookies>¶
Response cookies.
- history: List[RequestsHistoryModel]¶
List of redirect history, back reference from
RequestsHistoryModel.model.
- id = <AutoField: RequestsModel.id>¶
- request: Headers = <JSONField: RequestsModel.request>¶
Request headers.
- response: Headers = <JSONField: RequestsModel.response>¶
Response headers.
- session: Cookies = <JSONField: RequestsModel.session>¶
Session cookies.
- timestamp: datetime = <DateTimeField: RequestsModel.timestamp>¶
Timestamp of the submission.
- url_id = <ForeignKeyField: RequestsModel.url>¶
Loader Records¶
The darc.model.web.selenium module defines the data model
representing loader, specifically
from selenium submission.
See also
Please refer to darc.submit.submit_selenium() for more
information.
- class darc.model.web.selenium.SeleniumModel(*args, **kwargs)[source]¶
Bases:
BaseModelWebData model for documents from
seleniumsubmission.- DoesNotExist¶
alias of
SeleniumModelDoesNotExist
- id = <AutoField: SeleniumModel.id>¶
- screenshot: Optional[bytes] = <BlobField: SeleniumModel.screenshot>¶
Screenshot in PNG format as
bytes.
- timestamp: datetime = <DateTimeField: SeleniumModel.timestamp>¶
Timestamp of the submission.
- url_id = <ForeignKeyField: SeleniumModel.url>¶