Submission Data Models¶
The darc.model.web
module defines the data models
to store the data crawled from the darc
project.
See also
Please refer to darc.submit
module for more information
about data submission.
Hostname Records¶
The darc.model.web.hostname
module defines the data model
representing hostnames, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
- class darc.model.web.hostname.HostnameModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for a hostname record.
Important
The alive of a hostname is toggled if
crawler()
successfully requested a URL with such hostname.- DoesNotExist¶
alias of
HostnameModelDoesNotExist
- property alive: bool¶
If the hostname is still active.
We consider the hostname as inactive, only if all subsidiary URLs are inactive.
- discovery: datetime = <DateTimeField: HostnameModel.discovery>¶
Timestamp of first
new_host
submission.
-
hostname:
str
= <CharField: HostnameModel.hostname>¶ Hostname (c.f.
link.host
). The maximum length of the host name and of the fully qualified domain name (FQDN) is 63 bytes per label and 255 characters per FQDN.
- hosts: List[HostsModel]¶
hosts.txt
for the hostname, back reference fromHostsModel.host
.
- id = <AutoField: HostnameModel.id>¶
- last_seen: datetime = <DateTimeField: HostnameModel.last_seen>¶
Timestamp of last related submission.
-
proxy:
Proxy
= <IntEnumField: HostnameModel.proxy>¶ Proxy type (c.f.
link.proxy
).
- robots: List[RobotsModel]¶
robots.txt
for the hostname, back reference fromRobotsModel.host
.
- property since: datetime¶
The hostname is active/inactive since such timestamp.
We confider the timestamp by the earlies timestamp of related subsidiary active/inactive URLs.
- sitemaps: List[SitemapModel]¶
sitemap.xml
for the hostname, back reference fromSitemapModel.sitemaps
.
- urls: List[URLModel]¶
URLs with the same hostname, back reference from
URLModel.hostname
.
URL Records¶
The darc.model.web.url
module defines the data model
representing URLs, specifically from requests
and
selenium
submission.
See also
Please refer to darc.submit.submit_requests()
and
darc.submit.submit_selenium()
for more information.
- class darc.model.web.url.URLModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for a requested URL.
Important
The alive of a URL is toggled if
crawler()
successfully requested such URL and the status code isok
.- DoesNotExist¶
alias of
URLModelDoesNotExist
- children¶
-
hostname:
HostnameModel
= <ForeignKeyField: URLModel.hostname>¶ Hostname (c.f.
link.host
).
- hostname_id = <ForeignKeyField: URLModel.hostname>¶
- id = <AutoField: URLModel.id>¶
- parents¶
-
proxy:
Proxy
= <IntEnumField: URLModel.proxy>¶ Proxy type (c.f.
link.proxy
).
-
requests:
List
[RequestsModel
]¶ requests
submission record, back reference fromRequestsModel.url
.
-
selenium:
List
[SeleniumModel
]¶ selenium
submission record, back reference fromSeleniumModel.url
.
- class darc.model.web.url.URLThroughModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for the map of URL extration chain.
- DoesNotExist¶
alias of
URLThroughModelDoesNotExist
-
child:
List
[URLModel
] = <ForeignKeyField: URLThroughModel.child>¶ Back reference to which URLs were identified from the URL.
- child_id = <ForeignKeyField: URLThroughModel.child>¶
- id = <AutoField: URLThroughModel.id>¶
-
parent:
List
[URLModel
] = <ForeignKeyField: URLThroughModel.parent>¶ Back reference to where the URL was identified.
- parent_id = <ForeignKeyField: URLThroughModel.parent>¶
robots.txt
Records¶
The darc.model.web.robots
module defines the data model
representing robots.txt
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
- class darc.model.web.robots.RobotsModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for
robots.txt
data.- DoesNotExist¶
alias of
RobotsModelDoesNotExist
- host_id = <ForeignKeyField: RobotsModel.host>¶
- id = <AutoField: RobotsModel.id>¶
- timestamp: datetime = <DateTimeField: RobotsModel.timestamp>¶
Timestamp of the submission.
sitemap.xml
Records¶
The darc.model.web.sitemap
module defines the data model
representing sitemap.xml
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
- class darc.model.web.sitemap.SitemapModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for
sitemap.xml
data.- DoesNotExist¶
alias of
SitemapModelDoesNotExist
- host_id = <ForeignKeyField: SitemapModel.host>¶
- id = <AutoField: SitemapModel.id>¶
- timestamp: datetime = <DateTimeField: SitemapModel.timestamp>¶
Timestamp of the submission.
hosts.txt
Records¶
The darc.model.web.hosts
module defines the data model
representing hosts.txt
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
- class darc.model.web.hosts.HostsModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for
hosts.txt
data.- DoesNotExist¶
alias of
HostsModelDoesNotExist
- host_id = <ForeignKeyField: HostsModel.host>¶
- id = <AutoField: HostsModel.id>¶
- timestamp: datetime = <DateTimeField: HostsModel.timestamp>¶
Timestamp of the submission.
Crawler Records¶
The darc.model.web.requests
module defines the data model
representing crawler
, specifically
from requests
submission.
See also
Please refer to darc.submit.submit_requests()
for more
information.
- class darc.model.web.requests.RequestsHistoryModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for history records from
requests
submission.- DoesNotExist¶
alias of
RequestsHistoryModelDoesNotExist
- cookies: Cookies = <JSONField: RequestsHistoryModel.cookies>¶
Response cookies.
- id = <AutoField: RequestsHistoryModel.id>¶
-
model:
RequestsModel
= <ForeignKeyField: RequestsHistoryModel.model>¶ Original record.
- model_id = <ForeignKeyField: RequestsHistoryModel.model>¶
- request: Headers = <JSONField: RequestsHistoryModel.request>¶
Request headers.
- response: Headers = <JSONField: RequestsHistoryModel.response>¶
Response headers.
- timestamp: datetime = <DateTimeField: RequestsHistoryModel.timestamp>¶
Timestamp of the submission.
- class darc.model.web.requests.RequestsModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for documents from
requests
submission.- DoesNotExist¶
alias of
RequestsModelDoesNotExist
- cookies: Cookies = <JSONField: RequestsModel.cookies>¶
Response cookies.
- history: List[RequestsHistoryModel]¶
List of redirect history, back reference from
RequestsHistoryModel.model
.
- id = <AutoField: RequestsModel.id>¶
- request: Headers = <JSONField: RequestsModel.request>¶
Request headers.
- response: Headers = <JSONField: RequestsModel.response>¶
Response headers.
- session: Cookies = <JSONField: RequestsModel.session>¶
Session cookies.
- timestamp: datetime = <DateTimeField: RequestsModel.timestamp>¶
Timestamp of the submission.
- url_id = <ForeignKeyField: RequestsModel.url>¶
Loader Records¶
The darc.model.web.selenium
module defines the data model
representing loader
, specifically
from selenium
submission.
See also
Please refer to darc.submit.submit_selenium()
for more
information.
- class darc.model.web.selenium.SeleniumModel(*args, **kwargs)[source]¶
Bases:
BaseModelWeb
Data model for documents from
selenium
submission.- DoesNotExist¶
alias of
SeleniumModelDoesNotExist
- id = <AutoField: SeleniumModel.id>¶
- screenshot: Optional[bytes] = <BlobField: SeleniumModel.screenshot>¶
Screenshot in PNG format as
bytes
.
- timestamp: datetime = <DateTimeField: SeleniumModel.timestamp>¶
Timestamp of the submission.
- url_id = <ForeignKeyField: SeleniumModel.url>¶