Submission Data Models¶
The darc.model.web
module defines the data models
to store the data crawled from the darc
project.
See also
Please refer to darc.submit
module for more information
about data submission.
Hostname Records¶
The darc.model.web.hostname
module defines the data model
representing hostnames, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.hostname.
HostnameModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for a hostname record.
Important
The alive of a hostname is toggled if
crawler()
successfully requested a URL with such hostname.-
DoesNotExist
¶ alias of
HostnameModelDoesNotExist
-
alive
¶ If the hostname is still active.
We consider the hostname as inactive, only if all subsidiary URLs are inactive.
-
discovery
: datetime.datetime = <DateTimeField: HostnameModel.discovery>¶ Timestamp of first
new_host
submission.
-
hosts
¶
-
id
= <AutoField: HostnameModel.id>¶
-
last_seen
: datetime.datetime = <DateTimeField: HostnameModel.last_seen>¶ Timestamp of last related submission.
-
proxy
: darc.model.utils.Proxy = <IntEnumField: HostnameModel.proxy>¶ Proxy type (c.f.
link.proxy
).
-
robots
¶
-
since
¶ The hostname is active/inactive since such timestamp.
We confider the timestamp by the earlies timestamp of related subsidiary active/inactive URLs.
-
sitemaps
¶
-
urls
¶
-
URL Records¶
The darc.model.web.url
module defines the data model
representing URLs, specifically from requests
and
selenium
submission.
See also
Please refer to darc.submit.submit_requests()
and
darc.submit.submit_selenium()
for more information.
-
class
darc.model.web.url.
URLModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for a requested URL.
Important
The alive of a URL is toggled if
crawler()
successfully requested such URL and the status code isok
.-
DoesNotExist
¶ alias of
URLModelDoesNotExist
-
discovery
: datetime.datetime = <DateTimeField: URLModel.discovery>¶ Timestamp of first submission.
-
hostname
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: URLModel.hostname>¶ Hostname (c.f.
link.host
).
-
hostname_id
= <ForeignKeyField: URLModel.hostname>¶
-
id
= <AutoField: URLModel.id>¶
-
last_seen
: datetime.datetime = <DateTimeField: URLModel.last_seen>¶ Timestamp of last submission.
-
proxy
: darc.model.utils.Proxy = <IntEnumField: URLModel.proxy>¶ Proxy type (c.f.
link.proxy
).
-
requests
¶
-
selenium
¶
-
since
: datetime.datetime = <DateTimeField: URLModel.since>¶ The hostname is active/inactive since this timestamp.
-
robots.txt
Records¶
The darc.model.web.robots
module defines the data model
representing robots.txt
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.robots.
RobotsModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for
robots.txt
data.-
DoesNotExist
¶ alias of
RobotsModelDoesNotExist
-
host
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: RobotsModel.host>¶ Hostname (c.f.
link.host
).
-
host_id
= <ForeignKeyField: RobotsModel.host>¶
-
id
= <AutoField: RobotsModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: RobotsModel.timestamp>¶ Timestamp of the submission.
-
sitemap.xml
Records¶
The darc.model.web.sitemap
module defines the data model
representing sitemap.xml
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.sitemap.
SitemapModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for
sitemap.xml
data.-
DoesNotExist
¶ alias of
SitemapModelDoesNotExist
-
host
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: SitemapModel.host>¶ Hostname (c.f.
link.host
).
-
host_id
= <ForeignKeyField: SitemapModel.host>¶
-
id
= <AutoField: SitemapModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: SitemapModel.timestamp>¶ Timestamp of the submission.
-
hosts.txt
Records¶
The darc.model.web.hosts
module defines the data model
representing hosts.txt
data, specifically from new_host
submission.
See also
Please refer to darc.submit.submit_new_host()
for more
information.
-
class
darc.model.web.hosts.
HostsModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for
hosts.txt
data.-
DoesNotExist
¶ alias of
HostsModelDoesNotExist
-
host
: darc.model.web.hostname.HostnameModel = <ForeignKeyField: HostsModel.host>¶ Hostname (c.f.
link.host
).
-
host_id
= <ForeignKeyField: HostsModel.host>¶
-
id
= <AutoField: HostsModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: HostsModel.timestamp>¶ Timestamp of the submission.
-
Crawler Records¶
The darc.model.web.requests
module defines the data model
representing crawler
, specifically
from requests
submission.
See also
Please refer to darc.submit.submit_requests()
for more
information.
-
class
darc.model.web.requests.
RequestsHistoryModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for history records from
requests
submission.-
DoesNotExist
¶ alias of
RequestsHistoryModelDoesNotExist
Response cookies.
-
id
= <AutoField: RequestsHistoryModel.id>¶
-
model
: darc.model.web.requests.RequestsModel = <ForeignKeyField: RequestsHistoryModel.model>¶ Original record.
-
model_id
= <ForeignKeyField: RequestsHistoryModel.model>¶
-
timestamp
: datetime.datetime = <DateTimeField: RequestsHistoryModel.timestamp>¶ Timestamp of the submission.
-
-
class
darc.model.web.requests.
RequestsModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for documents from
requests
submission.-
DoesNotExist
¶ alias of
RequestsModelDoesNotExist
Response cookies.
-
history
¶
-
id
= <AutoField: RequestsModel.id>¶
-
timestamp
: datetime.datetime = <DateTimeField: RequestsModel.timestamp>¶ Timestamp of the submission.
-
url
: darc.model.web.url.URLModel = <ForeignKeyField: RequestsModel.url>¶ Original URL (c.f.
link.url
).
-
url_id
= <ForeignKeyField: RequestsModel.url>¶
-
Loader Records¶
The darc.model.web.selenium
module defines the data model
representing loader
, specifically
from selenium
submission.
See also
Please refer to darc.submit.submit_selenium()
for more
information.
-
class
darc.model.web.selenium.
SeleniumModel
(*args, **kwargs)[source]¶ Bases:
darc.model.abc.BaseModelWeb
Data model for documents from
selenium
submission.-
DoesNotExist
¶ alias of
SeleniumModelDoesNotExist
-
id
= <AutoField: SeleniumModel.id>¶
-
screenshot
: Optional[bytes] = <BlobField: SeleniumModel.screenshot>¶ Screenshot in PNG format as
bytes
.
-
timestamp
: datetime.datetime = <DateTimeField: SeleniumModel.timestamp>¶ Timestamp of the submission.
-
url
: darc.model.web.url.URLModel = <ForeignKeyField: SeleniumModel.url>¶ Original URL (c.f.
link.url
).
-
url_id
= <ForeignKeyField: SeleniumModel.url>¶
-