Installation guide¶
│ │ │Supported Python versions¶
│ │ │Scrapy requires Python 3.6+, either the CPython implementation (default) or │ │ │ -the PyPy 7.2.0+ implementation (see Alternate Implementations).
│ │ │ +the PyPy 7.2.0+ implementation (see Alternate Implementations). │ │ │Installing Scrapy¶
│ │ │If you’re using Anaconda or Miniconda, you can install the package from │ │ │ the conda-forge channel, which has up-to-date packages for Linux, Windows │ │ │ and macOS.
│ │ │To install Scrapy using conda
, run:
Using a virtual environment (recommended)¶
│ │ │TL;DR: We recommend installing Scrapy inside a virtual environment │ │ │ on all platforms.
│ │ │Python packages can be installed either globally (a.k.a system wide), │ │ │ or in user-space. We do not recommend installing Scrapy system wide.
│ │ │Instead, we recommend that you install Scrapy within a so-called
│ │ │ -“virtual environment” (venv
).
│ │ │ +“virtual environment” (venv
).
│ │ │ Virtual environments allow you to not conflict with already-installed Python
│ │ │ system packages (which could break some of your system tools and scripts),
│ │ │ and still install packages normally with pip
(without sudo
and the likes).
See Virtual Environments and Packages on how to create your virtual environment.
│ │ │ +See Virtual Environments and Packages on how to create your virtual environment.
│ │ │Once you have created a virtual environment, you can install Scrapy inside it with pip
,
│ │ │ just like any other Python package.
│ │ │ (See platform-specific guides
│ │ │ below for non-Python dependencies that you may need to install beforehand).
There’s a lesson here: for most scraping code, you want it to be resilient to │ │ │ errors due to things not being found on a page, so that even if some parts fail │ │ │ to be scraped, you can at least get some data.
│ │ │Besides the getall()
and
│ │ │ get()
methods, you can also use
│ │ │ the re()
method to extract using
│ │ │ -regular expressions:
>>> response.css('title::text').re(r'Quotes.*')
│ │ │ ['Quotes to Scrape']
│ │ │ >>> response.css('title::text').re(r'Q\w+')
│ │ │ ['Quotes']
│ │ │ >>> response.css('title::text').re(r'(\w+) to (\w+)')
│ │ │ ['Quotes', 'Scrape']
│ │ │
CloseSpider
extension
│ │ │ (issue 4836)
│ │ │ Removed references to Python 2’s unicode
type (issue 4547,
│ │ │ issue 4703)
We now have an official deprecation policy │ │ │ (issue 4705)
Our documentation policies now cover usage
│ │ │ -of Sphinx’s versionadded
and versionchanged
│ │ │ +of Sphinx’s versionadded
and versionchanged
│ │ │ directives, and we have removed usages referencing Scrapy 1.4.0 and earlier
│ │ │ versions (issue 3971, issue 4310)
Other documentation cleanups (issue 4090, issue 4782, issue 4800, │ │ │ issue 4801, issue 4809, issue 4816, issue 4825)
Quality assurance¶
│ │ │-
│ │ │
Extended typing hints (issue 4243, issue 4691)
│ │ │ Added tests for the
check
command (issue 4663)
│ │ │ Fixed test failures on Debian (issue 4726, issue 4727, issue 4735)
│ │ │ Improved Windows test coverage (issue 4723)
│ │ │ -Switched to formatted string literals where possible │ │ │ +
Switched to formatted string literals where possible │ │ │ (issue 4307, issue 4324, issue 4672)
│ │ │ -Modernized
super()
usage (issue 4707)
│ │ │ +Modernized
super()
usage (issue 4707)
│ │ │ Other code and test cleanups (issue 1790, issue 3288, issue 4165, │ │ │ issue 4564, issue 4651, issue 4714, issue 4738, issue 4745, │ │ │ issue 4747, issue 4761, issue 4765, issue 4804, issue 4817, │ │ │ issue 4820, issue 4822, issue 4839)
│ │ │
CookiesMiddleware
fixes
Backward-incompatible changes¶
│ │ │-
│ │ │
Support for Python 3.5.0 and 3.5.1 has been dropped; Scrapy now refuses to │ │ │ run with a Python version lower than 3.5.2, which introduced │ │ │ -
typing.Type
(issue 4615)
│ │ │ +
typing.Type
(issue 4615)
│ │ │ Deprecations¶
│ │ │-
│ │ │
TextResponse.body_as_unicode
is now deprecated, use │ │ │TextResponse.text
instead │ │ │ @@ -899,45 +899,45 @@ │ │ │ not downloaded; seeFilesPipeline.get_media_requests
for more │ │ │ information (issue 2893, issue 4486)
│ │ │ When using Google Cloud Storage for │ │ │ a media pipeline, a warning is now logged if │ │ │ the configured credentials do not grant the required permissions │ │ │ (issue 4346, issue 4508)
│ │ │ Link extractors are now serializable, │ │ │ -as long as you do not use lambdas for parameters; for │ │ │ +as long as you do not use lambdas for parameters; for │ │ │ example, you can now pass link extractors in
Request.cb_kwargs
or │ │ │Request.meta
when persisting │ │ │ scheduled requests (issue 4554)
│ │ │ -Upgraded the pickle protocol that Scrapy uses │ │ │ +
Upgraded the pickle protocol that Scrapy uses │ │ │ from protocol 2 to protocol 4, improving serialization capabilities and │ │ │ performance (issue 4135, issue 4541)
│ │ │ -scrapy.utils.misc.create_instance()
now raises aTypeError
│ │ │ +scrapy.utils.misc.create_instance()
now raises aTypeError
│ │ │ exception if the resulting instance isNone
(issue 4528, │ │ │ issue 4532)
│ │ │
Bug fixes¶
│ │ │-
│ │ │
CookiesMiddleware
no longer │ │ │ discards cookies defined inRequest.headers
(issue 1992, issue 2400)
│ │ │ CookiesMiddleware
no longer │ │ │ -re-encodes cookies defined asbytes
in thecookies
parameter │ │ │ +re-encodes cookies defined asbytes
in thecookies
parameter │ │ │ of the__init__
method ofRequest
│ │ │ (issue 2400, issue 3575)
│ │ │ When
FEEDS
defines multiple URIs,FEED_STORE_EMPTY
is │ │ │False
and the crawl yields no items, Scrapy no longer stops feed │ │ │ exports after the first URI (issue 4621, issue 4626)
│ │ │ Spider
callbacks defined using coroutine │ │ │ syntax no longer need to return an iterable, and may │ │ │ instead return aRequest
object, an │ │ │ item, orNone
(issue 4609)
│ │ │ The
startproject
command now ensures that the generated project │ │ │ folders and files have the right permissions (issue 4604)
│ │ │ -Fix a
KeyError
exception being sometimes raised from │ │ │ +Fix a
KeyError
exception being sometimes raised from │ │ │scrapy.utils.datatypes.LocalWeakReferencedCache
(issue 4597, │ │ │ issue 4599)
│ │ │ When
FEEDS
defines multiple URIs, log messages about items being │ │ │ stored now contain information from the corresponding feed, instead of │ │ │ always containing information about only one of the feeds (issue 4619, │ │ │ issue 4629)
│ │ │
Documentation¶
│ │ │-
│ │ │
Added a new section about accessing cb_kwargs from errbacks (issue 4598, issue 4634)
│ │ │ Covered chompjs in Parsing JavaScript code (issue 4556, │ │ │ issue 4562)
│ │ │ Removed from Coroutines the warning about the API being │ │ │ experimental (issue 4511, issue 4513)
│ │ │ -Removed references to unsupported versions of Twisted (issue 4533)
│ │ │ +Removed references to unsupported versions of Twisted (issue 4533)
│ │ │ Updated the description of the screenshot pipeline example, which now uses coroutine syntax instead of returning a │ │ │
Deferred
(issue 4514, issue 4593)
│ │ │ Removed a misleading import line from the │ │ │
scrapy.utils.log.configure_logging()
code example (issue 4510, │ │ │ issue 4587)
│ │ │ The display-on-hover behavior of internal documentation references now also │ │ │ covers links to commands,
Request.meta
keys, settings and │ │ │ @@ -974,15 +974,15 @@ │ │ │ issue 4572)
│ │ │ Removed remnants of Python 2 support (issue 4550, issue 4553, │ │ │ issue 4568)
│ │ │ Improved code sharing between the
crawl
andrunspider
│ │ │ commands (issue 4548, issue 4552)
│ │ │ Replaced
chain(*iterable)
withchain.from_iterable(iterable)
│ │ │ (issue 4635)
│ │ │ -You may now run the
asyncio
tests with Tox on any Python version │ │ │ +You may now run the
asyncio
tests with Tox on any Python version │ │ │ (issue 4521)
│ │ │ Updated test requirements to reflect an incompatibility with pytest 5.4 and │ │ │ 5.4.1 (issue 4588)
│ │ │ Improved
SpiderLoader
test coverage for │ │ │ scenarios involving duplicate spider names (issue 4549, issue 4560)
│ │ │ Configured Travis CI to also run the tests with Python 3.5.2 │ │ │ (issue 4518, issue 4615)
│ │ │ @@ -1001,19 +1001,19 @@
│ │ │ New
FEEDS
setting to export to multiple feeds
│ │ │ New
Response.ip_address
attribute
│ │ │ AssertionError
exceptions triggered by assert │ │ │ +
│ │ │ -AssertionError
exceptions triggered by assert │ │ │ statements have been replaced by new exception types, to support running │ │ │ -Python in optimized mode (see-O
) without changing Scrapy’s │ │ │ +Python in optimized mode (see-O
) without changing Scrapy’s │ │ │ behavior in any unexpected ways.If you catch an
AssertionError
exception from Scrapy, update your │ │ │ +If you catch an
│ │ │ │ │ │AssertionError
exception from Scrapy, update your │ │ │ code to catch the corresponding new exception.
│ │ │ Request serialization no longer breaks for │ │ │ callbacks that are spider attributes which are assigned a function with a │ │ │ different name (issue 4500)
│ │ │ None
values inallowed_domains
no longer │ │ │ -cause aTypeError
exception (issue 4410)
│ │ │ +cause a Zsh completion no longer allows options after arguments (issue 4438)
│ │ │ zope.interface 5.0.0 and later versions are now supported │ │ │ (issue 4447, issue 4448)
│ │ │ Spider.make_requests_from_url
, deprecated in Scrapy │ │ │ 1.4.0, now issues a warning when used (issue 4412)
│ │ │ Removed warnings about using old, removed settings (issue 4404)
│ │ │ Removed a warning about importing │ │ │
StringTransport
from │ │ │twisted.test.proto_helpers
in Twisted 19.7.0 or newer (issue 4409)
│ │ │ Removed outdated Debian package build files (issue 4384)
│ │ │ -Removed
object
usage as a base class (issue 4430)
│ │ │ +Removed
object
usage as a base class (issue 4430)
│ │ │ Removed code that added support for old versions of Twisted that we no │ │ │ longer support (issue 4472)
│ │ │ Fixed code style issues (issue 4468, issue 4469, issue 4471, │ │ │ issue 4481)
│ │ │ Removed
twisted.internet.defer.returnValue()
calls (issue 4443, │ │ │ issue 4446, issue 4489)
│ │ │ Python 2 support has been removed
│ │ │ -Partial coroutine syntax support │ │ │ -and experimental
asyncio
support
│ │ │ +Partial coroutine syntax support │ │ │ +and experimental
asyncio
support
│ │ │ New
Response.follow_all
method
│ │ │ FTP support for media pipelines
│ │ │ New
Response.certificate
│ │ │ attribute
│ │ │ IPv6 support through
DNS_RESOLVER
│ │ │ scrapy.linkextractors.FilteringLinkExtractor
is deprecated, use │ │ │scrapy.linkextractors.LinkExtractor
instead (issue 4045)
│ │ │ The
noconnect
query string argument of proxy URLs is deprecated and │ │ │ should be removed from proxy URLs (issue 4198)
│ │ │ The
next
method of │ │ │scrapy.utils.python.MutableChain
is deprecated, use the global │ │ │ -next()
function orMutableChain.__next__
instead (issue 4153)
│ │ │ +Added partial support for Python’s │ │ │ -coroutine syntax and experimental support for
asyncio
andasyncio
-powered libraries │ │ │ +coroutine syntax and experimental support forasyncio
andasyncio
-powered libraries │ │ │ (issue 4010, issue 4259, issue 4269, issue 4270, issue 4271, │ │ │ issue 4316, issue 4318)
│ │ │ The new
Response.follow_all
│ │ │ method offers the same functionality as │ │ │Response.follow
but supports an │ │ │ iterable of URLs as input and returns an iterable of requests │ │ │ (issue 2582, issue 4057, issue 4286)
│ │ │ @@ -1253,22 +1253,22 @@
│ │ │ item_error
for exceptions │ │ │ raised during item processing by item pipelines
│ │ │ spider_error
for exceptions │ │ │ raised from spider callbacks
│ │ │ The
FEED_URI
setting now supportspathlib.Path
values │ │ │ +The
FEED_URI
setting now supportspathlib.Path
values │ │ │ (issue 3731, issue 4074)
│ │ │ A new
request_left_downloader
signal is sent when a request │ │ │ leaves the downloader (issue 4303)
│ │ │ Scrapy logs a warning when it detects a request callback or errback that │ │ │ uses
yield
but also returns a value, since the returned value would be │ │ │ lost (issue 3484, issue 3869)
│ │ │ -Spider
objects now raise anAttributeError
│ │ │ +Spider
objects now raise anAttributeError
│ │ │ exception if they do not have astart_urls
│ │ │ attribute nor reimplementstart_requests
, │ │ │ but have astart_url
attribute (issue 4133, issue 4170)
│ │ │ BaseItemExporter
subclasses may now use │ │ │super().__init__(**kwargs)
instead ofself._configure(kwargs)
in │ │ │ their__init__
method, passingdont_fail=True
to the parent │ │ │__init__
method if needed, and accessingkwargs
atself._kwargs
│ │ │ @@ -1306,17 +1306,17 @@ │ │ │RFPDupeFilter
, the default │ │ │DUPEFILTER_CLASS
, no longer writes an extra\r
character on │ │ │ each line in Windows, which made the size of therequests.seen
file │ │ │ unnecessarily large on that platform (issue 4283)
│ │ │ Z shell auto-completion now looks for
.html
files, not.http
files, │ │ │ and covers the-h
command-line switch (issue 4122, issue 4291)
│ │ │ Adding items to a
scrapy.utils.datatypes.LocalCache
object │ │ │ -without alimit
defined no longer raises aTypeError
exception │ │ │ +without alimit
defined no longer raises aTypeError
exception │ │ │ (issue 4123)
│ │ │ -Fixed a typo in the message of the
ValueError
exception raised when │ │ │ +Fixed a typo in the message of the
ValueError
exception raised when │ │ │scrapy.utils.misc.create_instance()
gets bothsettings
and │ │ │crawler
set toNone
(issue 4128)
│ │ │
│ │ │ Cross-references within our documentation now display a tooltip when │ │ │ hovered (issue 4173, issue 4183)
│ │ │ Improved the documentation about
LinkExtractor.extract_links
and │ │ │ simplified Link Extractors (issue 4045)
│ │ │ Clarified how
ItemLoader.item
│ │ │ works (issue 3574, issue 4099)
│ │ │ -Clarified that
logging.basicConfig()
should not be used when also │ │ │ +Clarified that
logging.basicConfig()
should not be used when also │ │ │ usingCrawlerProcess
(issue 2149, │ │ │ issue 2352, issue 3146, issue 3960)
│ │ │ Clarified the requirements for
Request
objects │ │ │ when using persistence (issue 4124, │ │ │ issue 4139)
│ │ │ Clarified how to install a custom image pipeline (issue 4034, issue 4252)
│ │ │ Fixed the signatures of the
file_path
method in media pipeline examples (issue 4290)
│ │ │ @@ -1348,15 +1348,15 @@
│ │ │ Fixed logic issues, broken links and typos (issue 4247, issue 4258, │ │ │ issue 4282, issue 4288, issue 4305, issue 4308, issue 4323, │ │ │ issue 4338, issue 4359, issue 4361)
│ │ │ Improved consistency when referring to the
__init__
method of an object │ │ │ (issue 4086, issue 4088)
│ │ │ Fixed an inconsistency between code and output in Scrapy at a glance │ │ │ (issue 4213)
│ │ │ -Extended
intersphinx
usage (issue 4147, │ │ │ +Extended
intersphinx
usage (issue 4147, │ │ │ issue 4172, issue 4185, issue 4194, issue 4197)
│ │ │ We now use a recent version of Python to build the documentation │ │ │ (issue 4140, issue 4249)
│ │ │ Cleaned up documentation (issue 4143, issue 4275)
│ │ │ Improved test coverage (issue 4097, issue 4218, issue 4236)
│ │ │ Started reporting slowest tests, and improved the performance of some of │ │ │ them (issue 4163, issue 4164)
│ │ │ Fixed broken tests and refactored some tests (issue 4014, issue 4095, │ │ │ issue 4244, issue 4268, issue 4372)
│ │ │ -Modified the tox configuration to allow running tests │ │ │ +
Modified the tox configuration to allow running tests │ │ │ with any Python version, run Bandit and Flake8 tests by default, and │ │ │ enforce a minimum tox version programmatically (issue 4179)
│ │ │ Cleaned up code (issue 3937, issue 4208, issue 4209, │ │ │ issue 4210, issue 4212, issue 4369, issue 4376, issue 4378)
│ │ │
│ │ │ ItemLoader.load_item()
no │ │ │ longer makes later calls toItemLoader.get_output_value()
or │ │ │ItemLoader.load_item()
return │ │ │ empty data (issue 3804, issue 3819, issue 3897, issue 3976, │ │ │ issue 3998, issue 4036)
│ │ │ Fixed
DummyStatsCollector
raising a │ │ │ -TypeError
exception (issue 4007, issue 4052)
│ │ │ +FilesPipeline.file_path
and │ │ │ImagesPipeline.file_path
no longer choose │ │ │ file extensions that are not registered with IANA (issue 1287, │ │ │ issue 3953, issue 3954)
│ │ │ When using botocore to persist files in S3, all botocore-supported headers │ │ │ are properly mapped now (issue 3904, issue 3905)
│ │ │ FTP passwords in
FEED_URI
containing percent-escaped characters │ │ │ @@ -1908,15 +1908,15 @@ │ │ │-
│ │ │
scrapy.utils.http
(use w3lib.http)
│ │ │ scrapy.utils.markup
(use w3lib.html)
│ │ │ scrapy.utils.multipart
(use urllib3)
│ │ │
│ │ │ The
scrapy.utils.datatypes.MergeDict
class is deprecated for Python 3 │ │ │ -code bases. UseChainMap
instead. (issue 3878)
│ │ │ +code bases. Use The
scrapy.utils.gz.is_gzipped
function is deprecated. Use │ │ │scrapy.utils.gz.gzip_magic_number
instead.
│ │ │ This method starts a
│ │ │reactor
, adjusts its pool │ │ │ size toREACTOR_THREADPOOL_MAXSIZE
, and installs a DNS cache │ │ │ based onDNSCACHE_ENABLED
andDNSCACHE_SIZE
.If
│ │ │ │ │ │stop_after_crawl
is True, the reactor will be stopped after all │ │ │ crawlers have finished, usingjoin()
.
│ │ │
│ │ │ - │ │ │ @@ -589,16 +589,16 @@ │ │ │
-
│ │ │
get
(name, default=None)[source]¶
│ │ │ Get a setting value without affecting its original type.
│ │ │ │ │ │-
│ │ │ @@ -607,16 +607,16 @@
│ │ │
│ │ │1
,'1'
, True` and'True'
returnTrue
, │ │ │ while0
,'0'
,False
,'False'
andNone
returnFalse
.For example, settings populated through environment variables set to │ │ │
│ │ │'0'
will returnFalse
when using this method.
│ │ │ -
│ │ │ @@ -627,44 +627,44 @@
│ │ │
BaseSettings
instance itself, it will be │ │ │ converted to a dictionary, containing all its current settings values │ │ │ as they would be returned byget()
, │ │ │ and losing all information about priority and mutability. │ │ │
│ │ │ - │ │ │ @@ -672,41 +672,41 @@ │ │ │
Get a setting value as a list. If the setting original type is a list, a │ │ │ copy of it will be returned. If it’s a string it will be split by “,”.
│ │ │For example, settings populated through environment variables set to │ │ │
│ │ │ │ │ │'one,two'
will return a list [‘one’, ‘two’] when using this method.-
│ │ │
getpriority
(name)[source]¶
│ │ │ Return the current numerical priority value of a setting, or
│ │ │ │ │ │None
if │ │ │ the givenname
does not exist.-
│ │ │
getwithbase
(name)[source]¶
│ │ │ Get a composition of a dictionary-like setting and its _BASE │ │ │ counterpart.
│ │ │ │ │ │-
│ │ │
maxpriority
()[source]¶
│ │ │ @@ -722,17 +722,17 @@
│ │ │ Store a key/value attribute with a given priority.
│ │ │Settings should be populated before configuring the Crawler object │ │ │ (through the
│ │ │configure()
method), │ │ │ otherwise they won’t have any effect.-
│ │ │
- Parameters │ │ │
-
│ │ │ -
name (str) – the setting name
│ │ │ -value (object) – the value to associate with the setting
│ │ │ -priority (str or int) – the priority of the setting. Should be a key of │ │ │ +
name (str) – the setting name
│ │ │ +value (object) – the value to associate with the setting
│ │ │ +priority (str or int) – the priority of the setting. Should be a key of │ │ │
SETTINGS_PRIORITIES
or an integer
│ │ │
│ │ │
Store settings from a module with a given priority.
│ │ │This is a helper function that calls │ │ │
│ │ │set()
for every globally declared │ │ │ uppercase variable ofmodule
with the providedpriority
.-
│ │ │
- Parameters │ │ │
-
│ │ │ -
module (types.ModuleType or str) – the module or the path of the module
│ │ │ -priority (str or int) – the priority of the settings. Should be a key of │ │ │ +
module (types.ModuleType or str) – the module or the path of the module
│ │ │ +priority (str or int) – the priority of the settings. Should be a key of │ │ │
SETTINGS_PRIORITIES
or an integer
│ │ │
│ │ │
- Parameters │ │ │
-
│ │ │
values (dict or string or
BaseSettings
) – the settings names and values
│ │ │ -priority (str or int) – the priority of the settings. Should be a key of │ │ │ +
priority (str or int) – the priority of the settings. Should be a key of │ │ │
SETTINGS_PRIORITIES
or an integer
│ │ │
│ │ │ -
│ │ │
load
(spider_name)[source]¶
│ │ │ Get the Spider class with the given name. It’ll look into the previously │ │ │ loaded spiders for a spider class with name
│ │ │ │ │ │spider_name
and will raise │ │ │ a KeyError if not found.
│ │ │
│ │ │ -
│ │ │
list
()[source]¶
│ │ │ @@ -850,16 +850,16 @@
│ │ │ Connect a receiver function to a signal.
│ │ │The signal can be any object, although Scrapy comes with some │ │ │ predefined signals that are documented in the Signals │ │ │ section.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │ -
receiver (collections.abc.Callable) – the function to be connected
│ │ │ -signal (object) – the signal to connect to
│ │ │ +receiver (collections.abc.Callable) – the function to be connected
│ │ │ +signal (object) – the signal to connect to
│ │ │
│ │ │
- │ │ │ @@ -871,15 +871,15 @@ │ │ │ │ │ │
-
│ │ │
disconnect_all
(signal, **kwargs)[source]¶
│ │ │ Disconnect all receivers from the given signal.
│ │ │ │ │ │-
│ │ │
send_catch_log
(signal, **kwargs)[source]¶
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/architecture.html
│ │ │ @@ -349,15 +349,15 @@
│ │ │ - │ │ │ +
- │ │ │
- │ │ │
- │ │ │
-
│ │ │ class
scrapy.contracts.
Contract
(method, *args)[source]¶
│ │ │ -
│ │ │
- Parameters │ │ │
-
│ │ │ -
method (collections.abc.Callable) – callback function to which the contract is associated
│ │ │ -args (list) – list of arguments passed into the docstring (whitespace │ │ │ +
method (collections.abc.Callable) – callback function to which the contract is associated
│ │ │ +args (list) – list of arguments passed into the docstring (whitespace │ │ │ separated)
│ │ │
│ │ │
│ │ │Request
callbacks.│ │ │ @@ -282,15 +282,15 @@ │ │ │ adapter = ItemAdapter(item) │ │ │ adapter['field'] = await db.get_some_data(adapter['id']) │ │ │ return item │ │ ││ │ │requesting data from websites, databases and other services (in callbacks, │ │ │ pipelines and middlewares);
│ │ │ storing data in databases (in pipelines and middlewares);
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/developer-tools.html
│ │ │ @@ -448,16 +448,16 @@
│ │ │ -
│ │ │
scrapy.utils.curl.
curl_to_request_kwargs
(curl_command, ignore_unknown_options=True)[source]¶
│ │ │ Convert a cURL command syntax to Request kwargs.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │ -
curl_command (str) – string containing the curl command
│ │ │ -ignore_unknown_options (bool) – If true, only a warning is emitted when │ │ │ +
curl_command (str) – string containing the curl command
│ │ │ +ignore_unknown_options (bool) – If true, only a warning is emitted when │ │ │ cURL options are unknown. Otherwise │ │ │ raises an error. (default: True)
│ │ │
│ │ │ - Returns │ │ │
dictionary of Request kwargs
│ │ │
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/downloader-middleware.html
│ │ │ @@ -691,15 +691,15 @@
│ │ │
-
│ │ │ class
scrapy.extensions.httpcache.
DbmCacheStorage
[source]¶
│ │ │ A DBM storage backend is also available for the HTTP cache middleware.
│ │ │ -By default, it uses the
dbm
, but you can change it with the │ │ │ +By default, it uses the
│ │ │dbm
, but you can change it with the │ │ │HTTPCACHE_DBM_MODULE
setting.-
│ │ │ class
scrapy.downloadermiddlewares.httpproxy.
HttpProxyMiddleware
[source]¶
│ │ │ This middleware sets the HTTP proxy to use for requests, by setting the │ │ │
│ │ │ -proxy
meta value forRequest
objects.Like the Python standard library module
urllib.request
, it obeys │ │ │ +Like the Python standard library module
│ │ │urllib.request
, it obeys │ │ │ the following environment variables:-
│ │ │
http_proxy
│ │ │ https_proxy
│ │ │ no_proxy
│ │ │
You can also set the meta key
proxy
per-request, to a value like │ │ │ @@ -1086,15 +1086,15 @@ │ │ │ │ │ │supports wildcard matching
│ │ │ │ │ │uses the length based rule
Scrapy uses this parser by default.
│ │ │is Python’s built-in robots.txt parser
│ │ │ is compliant with Martijn Koster’s 1996 draft specification
│ │ │ lacks support for wildcard matching
│ │ │ doesn’t use the length based rule
│ │ │ -
│ │ │
-
│ │ │ abstract
allowed
(url, user_agent)[source]¶
│ │ │ Return
│ │ │ │ │ │True
ifuser_agent
is allowed to crawlurl
, otherwise returnFalse
.
-
│ │ │
-
│ │ │ abstract classmethod
from_crawler
(crawler, robotstxt_body)[source]¶
│ │ │ Parse the content of a robots.txt file as bytes. This must be a class method. │ │ │ It must return a new instance of the parser backend.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │
crawler (
Crawler
instance) – crawler which made the request
│ │ │ -robotstxt_body (bytes) – content of a robots.txt file.
│ │ │ +robotstxt_body (bytes) – content of a robots.txt file.
│ │ │
│ │ │
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/dynamic-content.html
│ │ │ @@ -301,30 +301,30 @@
│ │ │ -
│ │ │ abstract
If the response is HTML or XML, use selectors as usual.
│ │ │ -If the response is JSON, use
json.loads()
to load the desired data from │ │ │ +If the response is JSON, use
│ │ │json.loads()
to load the desired data from │ │ │response.text
:│ │ ││ │ │data = json.loads(response.text) │ │ │
If the desired data is inside HTML or XML code embedded within JSON data, │ │ │ you can load that HTML or XML code into a │ │ │
│ │ │Selector
and then │ │ │ use it as usual:│ │ ││ │ │selector = Selector(data['html']) │ │ │
│ │ │ If the response is JavaScript, or HTML with a
<script/>
element │ │ │ containing the desired data, see Parsing JavaScript code.
│ │ │ -If the response is CSS, use a regular expression to │ │ │ +
If the response is CSS, use a regular expression to │ │ │ extract the desired data from │ │ │
response.text
.
│ │ │ If the response is an image or another format based on images (e.g. PDF), │ │ │ read the response as bytes from │ │ │
response.body
and use an OCR │ │ │ @@ -350,27 +350,27 @@ │ │ │If the JavaScript code is within a
<script/>
element of an HTML page, │ │ │ use selectors to extract the text within that │ │ │<script/>
element.
│ │ │ You might be able to use a regular expression to │ │ │ +
You might be able to use a regular expression to │ │ │ extract the desired data in JSON format, which you can then parse with │ │ │ -
│ │ │ +json.loads()
.json.loads()
. │ │ │For example, if the JavaScript code contains a separate line like │ │ │
│ │ │var data = {"field": "value"};
you can extract that data as follows:│ │ ││ │ │>>> pattern = r'\bvar\s+data\s*=\s*(\{.*?\})\s*;\s*\n' │ │ │ >>> json_data = response.css('script::text').re_first(pattern) │ │ │ >>> json.loads(json_data) │ │ │ {'field': 'value'} │ │ │
│ │ │ -chompjs provides an API to parse JavaScript objects into a
│ │ │ +dict
.chompjs provides an API to parse JavaScript objects into a
│ │ │dict
.For example, if the JavaScript code contains │ │ │
│ │ │var data = {field: "value", secondField: "second value"};
│ │ │ you can extract that data as follows:│ │ │>>> import chompjs │ │ │ >>> javascript = response.css('script::text').get() │ │ │ >>> data = chompjs.parse_js_object(javascript) │ │ │ >>> data │ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/email.html │ │ │ @@ -227,17 +227,17 @@ │ │ │
│ │ ││ │ ││ │ │ │ │ ││ │ ││ │ ││ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/feed-exports.html │ │ │ @@ -518,15 +518,15 @@ │ │ │Sending e-mail¶
│ │ │ -Although Python makes sending e-mails relatively easy via the
smtplib
│ │ │ +Although Python makes sending e-mails relatively easy via the
│ │ │smtplib
│ │ │ library, Scrapy provides its own facility for sending e-mails which is very │ │ │ -easy to use and it’s implemented using Twisted non-blocking IO, to avoid interfering with the non-blocking │ │ │ +easy to use and it’s implemented using Twisted non-blocking IO, to avoid interfering with the non-blocking │ │ │ IO of the crawler. It also provides a simple API for sending attachments and │ │ │ it’s very easy to configure, with a few settings.│ │ ││ │ │Quick example¶
│ │ │There are two ways to instantiate the mail sender. You can instantiate it using │ │ │ the standard
│ │ │__init__
method:│ │ ││ │ │from scrapy.mail import MailSender │ │ │ @@ -253,33 +253,33 @@ │ │ │
│ │ ││ │ │mailer.send(to=["someone@example.com"], subject="Some subject", body="Some body", cc=["another@example.com"]) │ │ │
│ │ ││ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/exporters.html │ │ │ @@ -374,17 +374,17 @@ │ │ │MailSender class reference¶
│ │ │MailSender is the preferred class to use for sending emails from Scrapy, as it │ │ │ -uses Twisted non-blocking IO, like the │ │ │ +uses Twisted non-blocking IO, like the │ │ │ rest of the framework.
│ │ │-
│ │ │
-
│ │ │ class
scrapy.mail.
MailSender
(smtphost=None, mailfrom=None, smtpuser=None, smtppass=None, smtpport=None)[source]¶
│ │ │ -
│ │ │
- Parameters │ │ │
-
│ │ │ -
smtphost (str or bytes) – the SMTP host to use for sending the emails. If omitted, the │ │ │ +
smtphost (str or bytes) – the SMTP host to use for sending the emails. If omitted, the │ │ │
MAIL_HOST
setting will be used.
│ │ │ -mailfrom (str) – the address used to send emails (in the
From:
header). │ │ │ +mailfrom (str) – the address used to send emails (in the
From:
header). │ │ │ If omitted, theMAIL_FROM
setting will be used.
│ │ │ smtpuser – the SMTP user. If omitted, the
MAIL_USER
│ │ │ setting will be used. If not given, no SMTP authentication will be │ │ │ performed.
│ │ │ -- │ │ │ -
smtpport (int) – the SMTP port to connect to
│ │ │ -smtptls (bool) – enforce using SMTP STARTTLS
│ │ │ -smtpssl (bool) – enforce using a secure SSL connection
│ │ │ +- │ │ │ +
smtpport (int) – the SMTP port to connect to
│ │ │ +smtptls (bool) – enforce using SMTP STARTTLS
│ │ │ +smtpssl (bool) – enforce using a secure SSL connection
│ │ │
│ │ │
-
│ │ │
-
│ │ │ classmethod
from_settings
(settings)[source]¶
│ │ │ Instantiate using a Scrapy settings object, which will respect │ │ │ @@ -294,25 +294,25 @@ │ │ │
-
│ │ │
-
│ │ │
send
(to, subject, body, cc=None, attachs=(), mimetype='text/plain', charset=None)[source]¶
│ │ │ Send email to the given recipients.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │ -
to (str or list) – the e-mail recipients as a string or as a list of strings
│ │ │ -subject (str) – the subject of the e-mail
│ │ │ -cc (str or list) – the e-mails to CC as a string or as a list of strings
│ │ │ -body (str) – the e-mail body
│ │ │ -attachs (collections.abc.Iterable) – an iterable of tuples
(attach_name, mimetype, │ │ │ +
│ │ │ +to (str or list) – the e-mail recipients as a string or as a list of strings
│ │ │ +subject (str) – the subject of the e-mail
│ │ │ +cc (str or list) – the e-mails to CC as a string or as a list of strings
│ │ │ +body (str) – the e-mail body
│ │ │ -attachs (collections.abc.Iterable) – an iterable of tuples
(attach_name, mimetype, │ │ │ file_object)
whereattach_name
is a string with the name that will │ │ │ appear on the e-mail’s attachment,mimetype
is the mimetype of the │ │ │ attachment andfile_object
is a readable file object with the │ │ │ contents of the attachment │ │ │ -mimetype (str) – the MIME type of the e-mail
│ │ │ +charset (str) – the character encoding to use for the e-mail contents
│ │ │ +mimetype (str) – the MIME type of the e-mail
│ │ │charset (str) – the character encoding to use for the e-mail contents
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │ exception
scrapy.exceptions.
CloseSpider
(reason='cancelled')[source]¶
│ │ │ This exception can be raised from a spider callback to request the spider to be │ │ │ closed/stopped. Supported arguments:
│ │ │ │ │ │
For example:
│ │ ││ │ ││ │ │def parse_page(self, response): │ │ │ if 'Bandwidth exceeded' in response.body: │ │ │ @@ -339,15 +339,15 @@ │ │ │ received in the signal handler that raises the exception. Also, the response │ │ │ object is marked with
"download_stopped"
in itsResponse.flags
│ │ │ attribute. │ │ ││ │ ││ │ │Note
│ │ │
│ │ │ +afail
is a keyword-only parameter, i.e. raising │ │ │StopDownload(False)
orStopDownload(True)
will raise │ │ │ -aTypeError
.TypeError
. │ │ │See the documentation for the
│ │ │bytes_received
signal │ │ │ and the Stopping the download of a Response topic for additional information and examples.
By default, this method looks for a serializer declared in the item │ │ │ field and returns the result of applying │ │ │ that serializer to the value. If no serializer is found, it returns the │ │ │ value unchanged.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │ -
field (
Field
object or adict
instance) – the field being serialized. If the source item object does not define field metadata, field is an empty │ │ │ -dict
.
│ │ │ -name (str) – the name of the field being serialized
│ │ │ +field (
Field
object or adict
instance) – the field being serialized. If the source item object does not define field metadata, field is an empty │ │ │ +dict
.
│ │ │ +name (str) – the name of the field being serialized
│ │ │ value – the value being serialized
│ │ │
│ │ │
-
│ │ │ @@ -453,31 +453,31 @@
│ │ │
-
│ │ │ class
scrapy.exporters.
PythonItemExporter
(*, dont_fail=False, **kwargs)[source]¶
│ │ │ This is a base class for item exporters that extends │ │ │
│ │ │BaseItemExporter
with support for nested items.It serializes items to built-in Python types, so that any serialization │ │ │ -library (e.g.
│ │ │ +library (e.g.json
or msgpack) can be used on top of it.json
or msgpack) can be used on top of it. │ │ │
PythonItemExporter¶
│ │ │-
│ │ │
│ │ ││ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/extensions.html │ │ │ @@ -554,15 +554,15 @@ │ │ │XmlItemExporter¶
│ │ │-
│ │ │
-
│ │ │ class
scrapy.exporters.
XmlItemExporter
(file, item_element='item', root_element='items', **kwargs)[source]¶
│ │ │ Exports items in XML format to the specified file object.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │
file – the file-like object to use for exporting the data. Its
write
method should │ │ │ acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc)
│ │ │ -root_element (str) – The name of root element in the exported XML.
│ │ │ -item_element (str) – The name of each item element in the exported XML.
│ │ │ +root_element (str) – The name of root element in the exported XML.
│ │ │ +item_element (str) – The name of each item element in the exported XML.
│ │ │
│ │ │
The additional keyword arguments of this
│ │ │__init__
method are passed to the │ │ │BaseItemExporter
__init__
method.A typical output of this exporter would be:
│ │ ││ │ ││ │ │<?xml version="1.0" encoding="utf-8"?> │ │ │ @@ -526,28 +526,28 @@ │ │ │ CSV columns and their order. The
export_empty_fields
attribute has │ │ │ no effect on this exporter. │ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │
file – the file-like object to use for exporting the data. Its
write
method should │ │ │ acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc)
│ │ │ -include_headers_line (str) – If enabled, makes the exporter output a header │ │ │ +
include_headers_line (str) – If enabled, makes the exporter output a header │ │ │ line with the field names taken from │ │ │
BaseItemExporter.fields_to_export
or the first exported item fields.
│ │ │ join_multivalued – The char (or chars) that will be used for joining │ │ │ multi-valued fields, if found.
│ │ │ -errors (str) – The optional string that specifies how encoding and decoding │ │ │ +
errors (str) – The optional string that specifies how encoding and decoding │ │ │ errors are to be handled. For more information see │ │ │ -
io.TextIOWrapper
.
│ │ │ +
io.TextIOWrapper
. │ │ │
│ │ │
The additional keyword arguments of this
│ │ │__init__
method are passed to the │ │ │BaseItemExporter
__init__
method, and the leftover arguments to the │ │ │ -csv.writer()
function, so you can use anycsv.writer()
function │ │ │ +csv.writer()
function, so you can use anycsv.writer()
function │ │ │ argument to customize this exporter.A typical output of this exporter would be:
│ │ ││ │ │ @@ -561,19 +561,19 @@ │ │ │ class│ │ │product,price │ │ │ Color TV,1200 │ │ │ DVD player,200 │ │ │
scrapy.exporters.
PickleItemExporter
(file, protocol=0, **kwargs)[source]¶ │ │ │ │ │ │ │ │ │Exports items in pickle format to the given file-like object.
│ │ │-
│ │ │
- Parameters │ │ │
- │ │ │ │ │ │
For more information, see
│ │ │ +pickle
.For more information, see
│ │ │pickle
.The additional keyword arguments of this
│ │ │__init__
method are passed to the │ │ │BaseItemExporter
__init__
method.Pickle isn’t a human readable format, so no output examples are provided.
│ │ ││ │ │ @@ -603,16 +603,16 @@ │ │ ││ │ │JsonItemExporter¶
│ │ │-
│ │ │
-
│ │ │ class
scrapy.exporters.
JsonItemExporter
(file, **kwargs)[source]¶
│ │ │ Exports items in JSON format to the specified file-like object, writing all │ │ │ objects as a list of objects. The additional
│ │ │ +arguments to the__init__
method arguments are │ │ │ passed to theBaseItemExporter
__init__
method, and the leftover │ │ │ -arguments to theJSONEncoder
__init__
method, so you can use any │ │ │ -JSONEncoder
__init__
method argument to customize this exporter.JSONEncoder
__init__
method, so you can use any │ │ │ +JSONEncoder
__init__
method argument to customize this exporter. │ │ │-
│ │ │
- Parameters │ │ │
file – the file-like object to use for exporting the data. Its
│ │ │write
method should │ │ │ acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc)
│ │ │
A typical output of this exporter would be:
│ │ │ @@ -637,16 +637,16 @@ │ │ │JsonLinesItemExporter¶
│ │ │-
│ │ │
-
│ │ │ class
scrapy.exporters.
JsonLinesItemExporter
(file, **kwargs)[source]¶
│ │ │ Exports items in JSON format to the specified file-like object, writing one │ │ │ JSON-encoded item per line. The additional
│ │ │ +the__init__
method arguments are passed │ │ │ to theBaseItemExporter
__init__
method, and the leftover arguments to │ │ │ -theJSONEncoder
__init__
method, so you can use any │ │ │ -JSONEncoder
__init__
method argument to customize this exporter.JSONEncoder
__init__
method, so you can use any │ │ │ +JSONEncoder
__init__
method argument to customize this exporter. │ │ │-
│ │ │
- Parameters │ │ │
file – the file-like object to use for exporting the data. Its
│ │ │write
method should │ │ │ acceptbytes
(a disk file opened in binary mode, aio.BytesIO
object, etc)
│ │ │
A typical output of this exporter would be:
│ │ │ @@ -661,20 +661,20 @@ │ │ │
│ │ ││ │ │MarshalItemExporter¶
│ │ │-
│ │ │
-
│ │ │ class
scrapy.exporters.
MarshalItemExporter
(file, **kwargs)[source]¶
│ │ │ Exports items in a Python-specific binary format (see │ │ │ -
│ │ │ +marshal
).marshal
). │ │ │-
│ │ │
- Parameters │ │ │
file – The file-like object to use for exporting the data. Its │ │ │ -
│ │ │ +write
method should acceptbytes
(a disk file │ │ │ -opened in binary mode, aBytesIO
object, etc)write
method should acceptbytes
(a disk file │ │ │ +opened in binary mode, aBytesIO
object, etc) │ │ │
│ │ │
│ │ ││ │ │Debugger extension¶
│ │ │ │ │ │ │ │ │ -Invokes a Python debugger inside a running Scrapy process when a SIGUSR2 │ │ │ +
Invokes a Python debugger inside a running Scrapy process when a SIGUSR2 │ │ │ signal is received. After the debugger is exited, the Scrapy process continues │ │ │ running normally.
│ │ │For more info see Debugging in Python.
│ │ │This extension only works on POSIX-compliant platforms (i.e. not Windows).
│ │ ││ │ │FEEDS¶
│ │ ││ │ ││ │ │New in version 2.1.
│ │ │Default:
│ │ │ -{}
A dictionary in which every key is a feed URI (or a
pathlib.Path
│ │ │ +A dictionary in which every key is a feed URI (or a
│ │ │pathlib.Path
│ │ │ object) and each value is a nested dictionary containing configuration │ │ │ parameters for the specific feed.This setting is required for enabling the feed export feature.
│ │ │See Storage backends for supported URI schemes.
│ │ │For instance:
│ │ │{ │ │ │ 'items.json': { │ │ │ @@ -563,15 +563,15 @@ │ │ │
│ │ ││ │ │ │ │ │New in version 2.3.0.
│ │ │ │ │ │encoding
: falls back toFEED_EXPORT_ENCODING
. │ │ │fields
: falls back toFEED_EXPORT_FIELDS
. │ │ │ -indent
: falls back toFEED_EXPORT_INDENT
.
│ │ │ +item_export_kwargs
:dict
with keyword arguments for the corresponding item exporter class. │ │ │
│ │ │item_export_kwargs
:dict
with keyword arguments for the corresponding item exporter class.│ │ ││ │ │New in version 2.4.0.
│ │ │
│ │ │overwrite
: whether to overwrite the file if it already exists │ │ │ (True
) or append to its content (False
).The default value depends on the storage backend:
│ │ │ @@ -715,15 +715,15 @@ │ │ │When generating multiple output files, you must use at least one of the following │ │ │ placeholders in the feed URI to indicate how the different output file names are │ │ │ generated:
│ │ │-
│ │ │
%(batch_time)s
- gets replaced by a timestamp when the feed is being created │ │ │ (e.g.2020-03-28T14-45-08.237134
)
│ │ │
│ │ │ -%(batch_id)d
- gets replaced by the 1-based sequence number of the batch.Use printf-style string formatting to │ │ │ +
Use printf-style string formatting to │ │ │ alter the number format. For example, to make the batch ID a 5-digit │ │ │ number by introducing leading zeroes as needed, use
│ │ │%(batch_id)05d
│ │ │ (e.g.3
becomes00003
,123
becomes00123
).
│ │ │
For instance, if your settings include:
│ │ ││ │ │FEED_EXPORT_BATCH_ITEM_COUNT = 100 │ │ │ @@ -744,26 +744,26 @@ │ │ │
Where the first and second files contain exactly 100 items. The last one contains │ │ │ 100 items or fewer.
│ │ ││ │ │FEED_URI_PARAMS¶
│ │ │Default:
│ │ │None
A string with the import path of a function to set the parameters to apply with │ │ │ -printf-style string formatting to the │ │ │ +printf-style string formatting to the │ │ │ feed URI.
│ │ │The function signature should be as follows:
│ │ │-
│ │ │
-
│ │ │
scrapy.extensions.feedexport.
uri_params
(params, spider)¶
│ │ │ - Return a
│ │ │ +dict
of key-value pairs to apply to the feed URI using │ │ │ -printf-style string formatting.Return a
│ │ │dict
of key-value pairs to apply to the feed URI using │ │ │ +printf-style string formatting.-
│ │ │
- Parameters │ │ │
-
│ │ │ -
params (dict) –
default key-value pairs
│ │ │ +params (dict) –
default key-value pairs
│ │ │Specifically:
│ │ │-
│ │ │
│ │ │batch_id
: ID of the file batch. See │ │ │FEED_EXPORT_BATCH_ITEM_COUNT
.If
│ │ │FEED_EXPORT_BATCH_ITEM_COUNT
is0
,batch_id
│ │ │ is always1
.│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/items.html │ │ │ @@ -252,28 +252,28 @@ │ │ ││ │ ││ │ │Item Types¶
│ │ │Scrapy supports the following types of items, via the itemadapter library: │ │ │ dictionaries, Item objects, │ │ │ dataclass objects, and attrs objects.
│ │ ││ │ ││ │ │Dictionaries¶
│ │ │ -As an item type,
│ │ │ +dict
is convenient and familiar.As an item type,
│ │ │dict
is convenient and familiar.│ │ ││ │ │Item objects¶
│ │ │ -Item
provides adict
-like API plus additional features that │ │ │ +
│ │ │Item
provides adict
-like API plus additional features that │ │ │ make it the most feature-complete item type:-
│ │ │
-
│ │ │ class
scrapy.item.
Item
([arg])[source]¶
│ │ │ - Item
objects replicate the standarddict
API, including │ │ │ +
│ │ │Item
objects replicate the standarddict
API, including │ │ │ its__init__
method.
│ │ │Item
allows defining field names, so that:-
│ │ │ -
KeyError
is raised when using undefined field names (i.e. │ │ │ +KeyError
is raised when using undefined field names (i.e. │ │ │ prevents typos going unnoticed)
│ │ │ Item exporters can export all fields by │ │ │ default even if the first scraped object does not have values for all │ │ │ of them
│ │ │
│ │ │ @@ -284,15 +284,15 @@ │ │ │Item
also allows defining field metadata, which can be used to │ │ │ customize serialization.-
│ │ │
copy
()¶
│ │ │
-
│ │ │
-
│ │ │
deepcopy
()¶
│ │ │ - Return a
│ │ │ +deepcopy()
of this item.Return a
│ │ │deepcopy()
of this item.
-
│ │ │
-
│ │ │
fields
¶
│ │ │ A dictionary containing all declared fields for this Item, not only │ │ │ those populated. The keys are the field names and the values are the │ │ │ @@ -311,21 +311,21 @@ │ │ │
│ │ ││ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/leaks.html │ │ │ @@ -382,15 +382,15 @@ │ │ │ │ │ │Dataclass objects¶
│ │ ││ │ ││ │ │ -New in version 2.2.
│ │ │dataclass()
allows defining item classes with field names, │ │ │ +
│ │ │dataclass()
allows defining item classes with field names, │ │ │ so that item exporters can export all fields by │ │ │ default even if the first scraped object does not have values for all of them.Additionally,
│ │ │dataclass
items also allow to:-
│ │ │
define the type and default value of each defined field.
│ │ │ -define custom field metadata through
dataclasses.field()
, which can be used to │ │ │ +define custom field metadata through
dataclasses.field()
, which can be used to │ │ │ customize serialization.
│ │ │
They work natively in Python 3.7 or later, or using the dataclasses │ │ │ backport in Python 3.6.
│ │ │Example:
│ │ ││ │ ││ │ │from dataclasses import dataclass │ │ │ │ │ │ @@ -341,24 +341,24 @@ │ │ │
│ │ ││ │ │ │ │ │attr.s objects¶
│ │ ││ │ ││ │ │ -New in version 2.2.
│ │ │attr.s()
allows defining item classes with field names, │ │ │ +
│ │ │attr.s()
allows defining item classes with field names, │ │ │ so that item exporters can export all fields by │ │ │ default even if the first scraped object does not have values for all of them.Additionally,
│ │ │attr.s
items also allow to:-
│ │ │
define the type and default value of each defined field.
│ │ │ -define custom field metadata, which can be used to │ │ │ +
define custom field metadata, which can be used to │ │ │ customize serialization.
│ │ │
In order to use this type, the attrs package needs to be installed.
│ │ │ +In order to use this type, the attrs package needs to be installed.
│ │ │Example:
│ │ ││ │ ││ │ │import attr │ │ │ │ │ │ @attr.s │ │ │ class CustomItem: │ │ │ one_field = attr.ib() │ │ │ another_field = attr.ib() │ │ │ @@ -406,15 +406,15 @@ │ │ │ documentation to see which metadata keys are used by each component. │ │ │
It’s important to note that the
│ │ │Field
objects used to declare the item │ │ │ do not stay assigned as class attributes. Instead, they can be accessed through │ │ │ theItem.fields
attribute.-
│ │ │
-
│ │ │ class
scrapy.item.
Field
([arg])[source]¶
│ │ │ - The
Field
class is just an alias to the built-indict
class and │ │ │ +The
│ │ │Field
class is just an alias to the built-indict
class and │ │ │ doesn’t provide any extra functionality or attributes. In other words, │ │ │Field
objects are plain-old Python dicts. A separate class is used │ │ │ to support the item declaration syntax │ │ │ based on class attributes.
│ │ │ @@ -424,15 +424,15 @@ │ │ │ attr.ib for additional information. │ │ ││ │ ││ │ ││ │ │Working with Item objects¶
│ │ │Here are some examples of common tasks performed with items, using the │ │ │
│ │ │ +notice the API is very similar to theProduct
item declared above. You will │ │ │ -notice the API is very similar to thedict
API.dict
API. │ │ ││ │ ││ │ │Creating items¶
│ │ ││ │ │ @@ -498,37 +498,37 @@ │ │ │ ... │ │ │ KeyError: 'Product does not support field: lala' │ │ ││ │ │>>> product = Product(name='Desktop PC', price=1000) │ │ │ >>> print(product) │ │ │ Product(name='Desktop PC', price=1000) │ │ │
│ │ ││ │ │Accessing all populated values¶
│ │ │ -To access all populated values, just use the typical
│ │ │ +dict
API:To access all populated values, just use the typical
│ │ │dict
API:│ │ ││ │ │>>> product.keys() │ │ │ ['price', 'name'] │ │ │
│ │ ││ │ │>>> product.items() │ │ │ [('price', 1000), ('name', 'Desktop PC')] │ │ │
│ │ ││ │ │ @@ -583,15 +583,15 @@ │ │ │ classCopying items¶
│ │ │To copy an item, you must first decide whether you want a shallow copy or a │ │ │ deep copy.
│ │ │ -If your item contains mutable values like lists or dictionaries, │ │ │ +
If your item contains mutable values like lists or dictionaries, │ │ │ a shallow copy will keep references to the same mutable values across all │ │ │ different copies.
│ │ │For example, if you have an item with a list of tags, and you create a shallow │ │ │ copy of that item, both the original item and the copy have the same list of │ │ │ tags. Adding a tag to the list of one of the items will add the tag to the │ │ │ other item as well.
│ │ │If that is not the desired behavior, use a deep copy instead.
│ │ │ -See
│ │ │ +copy
for more information.See
│ │ │copy
for more information.To create a shallow copy of an item, you can either call │ │ │
│ │ │copy()
on an existing item │ │ │ (product2 = product.copy()
) or instantiate your item class from an existing │ │ │ item (product2 = Product(product)
).To create a deep copy, call
│ │ │deepcopy()
instead │ │ │ (product2 = product.deepcopy()
).itemadapter.
ItemAdapter
(item: Any)[source]¶ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │Wrapper class to interact with data container objects. It provides a common interface │ │ │ to extract and set data without having to take the object’s type into account.
│ │ ││ │ ││ │ │Request serialization¶
│ │ │For persistence to work,
│ │ │Request
objects must be │ │ │ -serializable withpickle
, except for thecallback
anderrback
│ │ │ +serializable withpickle
, except for thecallback
anderrback
│ │ │ values passed to their__init__
method, which must be methods of the │ │ │ runningSpider
class.If you wish to log the requests that couldn’t be serialized, you can set the │ │ │
│ │ │SCHEDULER_DEBUG
setting toTrue
in the project’s settings page. │ │ │ It isFalse
by default.-
│ │ │
-
│ │ │
scrapy.utils.trackref.
print_live_refs
(class_name, ignore=NoneType)[source]¶
│ │ │ Print a report of live references, grouped by class name.
│ │ │ │ │ │
-
│ │ │
-
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/link-extractors.html
│ │ │ @@ -254,61 +254,61 @@
│ │ │
- │ │ │ class
│ │ │scrapy.linkextractors.lxmlhtml.
LxmlLinkExtractor
(allow=(), deny=(), allow_domains=(), deny_domains=(), deny_extensions=None, restrict_xpaths=(), restrict_css=(), tags=('a', 'area'), attrs=('href'), canonicalize=False, unique=True, process_value=None, strip=True)[source]¶ - │ │ │ class
LxmlLinkExtractor is the recommended link extractor with handy filtering │ │ │ options. It is implemented using lxml’s robust HTMLParser.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │ -
allow (str or list) – a single regular expression (or list of regular expressions) │ │ │ +
allow (str or list) – a single regular expression (or list of regular expressions) │ │ │ that the (absolute) urls must match in order to be extracted. If not │ │ │ given (or empty), it will match all links.
│ │ │ -deny (str or list) – a single regular expression (or list of regular expressions) │ │ │ +
deny (str or list) – a single regular expression (or list of regular expressions) │ │ │ that the (absolute) urls must match in order to be excluded (i.e. not │ │ │ extracted). It has precedence over the
allow
parameter. If not │ │ │ given (or empty) it won’t exclude any links.
│ │ │ -allow_domains (str or list) – a single value or a list of string containing │ │ │ +
allow_domains (str or list) – a single value or a list of string containing │ │ │ domains which will be considered for extracting the links
│ │ │ -deny_domains (str or list) – a single value or a list of strings containing │ │ │ +
deny_domains (str or list) – a single value or a list of strings containing │ │ │ domains which won’t be considered for extracting the links
│ │ │ -deny_extensions (list) –
a single value or list of strings containing │ │ │ +
deny_extensions (list) –
a single value or list of strings containing │ │ │ extensions that should be ignored when extracting links. │ │ │ If not given, it will default to │ │ │
│ │ │scrapy.linkextractors.IGNORED_EXTENSIONS
.│ │ ││ │ │Changed in version 2.0:
│ │ │IGNORED_EXTENSIONS
now includes │ │ │7z
,7zip
,apk
,bz2
,cdr
,dmg
,ico
, │ │ │iso
,tar
,tar.gz
,webm
, andxz
.
│ │ │ -restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines │ │ │ +
restrict_xpaths (str or list) – is an XPath (or list of XPath’s) which defines │ │ │ regions inside the response where links should be extracted from. │ │ │ If given, only the text selected by those XPath will be scanned for │ │ │ links. See examples below.
│ │ │ -restrict_css (str or list) – a CSS selector (or list of selectors) which defines │ │ │ +
restrict_css (str or list) – a CSS selector (or list of selectors) which defines │ │ │ regions inside the response where links should be extracted from. │ │ │ Has the same behaviour as
restrict_xpaths
.
│ │ │ -restrict_text (str or list) – a single regular expression (or list of regular expressions) │ │ │ +
restrict_text (str or list) – a single regular expression (or list of regular expressions) │ │ │ that the link’s text must match in order to be extracted. If not │ │ │ given (or empty), it will match all links. If a list of regular expressions is │ │ │ given, the link will be extracted if it matches at least one.
│ │ │ -tags (str or list) – a tag or a list of tags to consider when extracting links. │ │ │ +
tags (str or list) – a tag or a list of tags to consider when extracting links. │ │ │ Defaults to
('a', 'area')
.
│ │ │ -attrs (list) – an attribute or list of attributes which should be considered when looking │ │ │ +
attrs (list) – an attribute or list of attributes which should be considered when looking │ │ │ for links to extract (only for those tags specified in the
tags
│ │ │ parameter). Defaults to('href',)
│ │ │ -canonicalize (bool) – canonicalize each extracted url (using │ │ │ +
canonicalize (bool) – canonicalize each extracted url (using │ │ │ w3lib.url.canonicalize_url). Defaults to
False
. │ │ │ Note that canonicalize_url is meant for duplicate checking; │ │ │ it can change the URL visible at server side, so the response can be │ │ │ different for requests with canonicalized and raw URLs. If you’re │ │ │ using LinkExtractor to follow links it is more robust to │ │ │ keep the defaultcanonicalize=False
.
│ │ │ -unique (bool) – whether duplicate filtering should be applied to extracted │ │ │ +
unique (bool) – whether duplicate filtering should be applied to extracted │ │ │ links.
│ │ │ -process_value (collections.abc.Callable) –
a function which receives each value extracted from │ │ │ +
process_value (collections.abc.Callable) –
a function which receives each value extracted from │ │ │ the tag and attributes scanned and can modify the value and return a │ │ │ new one, or return
│ │ │None
to ignore the link altogether. If not │ │ │ given,process_value
defaults tolambda x: x
.For example, to extract links from this code:
│ │ ││ │ │ @@ -316,15 +316,15 @@ │ │ ││ │ │<a href="javascript:goToPage('../other/page.html'); return false">Link text</a> │ │ │
│ │ ││ │ │def process_value(value): │ │ │ m = re.search("javascript:goToPage\('(.*?)'", value) │ │ │ if m: │ │ │ return m.group(1) │ │ │
│ │ │ -strip (bool) – whether to strip whitespaces from extracted attributes. │ │ │ +
strip (bool) – whether to strip whitespaces from extracted attributes. │ │ │ According to HTML5 standard, leading and trailing whitespaces │ │ │ must be stripped from
href
attributes of<a>
,<area>
│ │ │ and many other elements,src
attribute of<img>
,<iframe>
│ │ │ elements, etc., so LinkExtractor strips space chars by default. │ │ │ Setstrip=False
to turn it off (e.g. if you’re extracting urls │ │ │ from elements or attributes which allow leading/trailing whitespaces).
│ │ │
Working with dataclass items¶
│ │ │By default, dataclass items require all fields to be │ │ │ passed when created. This could be an issue when using dataclass items with │ │ │ item loaders: unless a pre-populated item is passed to the loader, fields │ │ │ will be populated incrementally using the loader’s
│ │ │add_xpath()
, │ │ │add_css()
andadd_value()
methods.One approach to overcome this is to define items using the │ │ │ -
│ │ │ +field()
function, with adefault
argument:field()
function, with adefault
argument: │ │ │from dataclasses import dataclass, field │ │ │ from typing import Optional │ │ │ │ │ │ @dataclass │ │ │ class InventoryItem: │ │ │ name: Optional[str] = field(default=None) │ │ │ price: Optional[float] = field(default=None) │ │ │ @@ -581,15 +581,15 @@ │ │ │
add_css
(field_name, css, *processors, **kw)[source]¶ │ │ │Similar to
│ │ │ItemLoader.add_value()
but receives a CSS selector │ │ │ instead of a value, which is used to extract a list of unicode strings │ │ │ from the selector associated with thisItemLoader
.See
│ │ │get_css()
forkwargs
.-
│ │ │
- Parameters │ │ │ -
css (str) – the CSS selector to extract data from
│ │ │ + │ │ │css (str) – the CSS selector to extract data from
│ │ │
Examples:
│ │ ││ │ │# HTML snippet: <p class="product-name">Color TV</p> │ │ │ loader.add_css('name', 'p.product-name') │ │ │ # HTML snippet: <p id="price">the price is $1200</p> │ │ │ loader.add_css('price', 'p#price', re='the price is (.*)') │ │ │ @@ -624,15 +624,15 @@ │ │ │
add_xpath
(field_name, xpath, *processors, **kw)[source]¶ │ │ │Similar to
│ │ │ItemLoader.add_value()
but receives an XPath instead of a │ │ │ value, which is used to extract a list of strings from the │ │ │ selector associated with thisItemLoader
.See
│ │ │get_xpath()
forkwargs
.-
│ │ │
- Parameters │ │ │ -
xpath (str) – the XPath to extract data from
│ │ │ + │ │ │xpath (str) – the XPath to extract data from
│ │ │
Examples:
│ │ ││ │ │ @@ -392,15 +392,15 @@ │ │ ││ │ │# HTML snippet: <p class="product-name">Color TV</p> │ │ │ loader.add_xpath('name', '//p[@class="product-name"]') │ │ │ # HTML snippet: <p id="price">the price is $1200</p> │ │ │ loader.add_xpath('price', '//p[@id="price"]', re='the price is (.*)') │ │ │ @@ -651,16 +651,16 @@ │ │ │
get_css
(css, *processors, **kw)[source]¶ │ │ │Similar to
│ │ │ItemLoader.get_value()
but receives a CSS selector │ │ │ instead of a value, which is used to extract a list of unicode strings │ │ │ from the selector associated with thisItemLoader
.-
│ │ │
- Parameters │ │ │
-
│ │ │ -
css (str) – the CSS selector to extract data from
│ │ │ -re (str or typing.Pattern) – a regular expression to use for extracting data from the │ │ │ +
css (str) – the CSS selector to extract data from
│ │ │ +re (str or typing.Pattern) – a regular expression to use for extracting data from the │ │ │ selected CSS region
│ │ │
│ │ │
Examples:
│ │ ││ │ ││ │ │ @@ -878,29 +878,29 @@ │ │ │# HTML snippet: <p class="product-name">Color TV</p> │ │ │ loader.get_css('p.product-name') │ │ │ @@ -681,15 +681,15 @@ │ │ │
- │ │ │
│ │ │get_value
(value, *processors, **kw)[source]¶ │ │ │ @@ -844,19 +844,19 @@ │ │ │ remaining arguments are the same as for theProcess the given
│ │ │value
by the givenprocessors
and keyword │ │ │ arguments.Available keyword arguments:
│ │ │-
│ │ │
- Parameters │ │ │ -
re (str or typing.Pattern) – a regular expression to use for extracting data from the │ │ │ +
│ │ │re (str or typing.Pattern) – a regular expression to use for extracting data from the │ │ │ given value using
│ │ │extract_regex()
method, │ │ │ applied before processors
Examples:
│ │ ││ │ ││ │ │>>> from itemloaders import ItemLoader │ │ │ >>> from itemloaders.processors import TakeFirst │ │ │ @@ -705,16 +705,16 @@ │ │ │
get_xpath
(xpath, *processors, **kw)[source]¶ │ │ │Similar to
│ │ │ItemLoader.get_value()
but receives an XPath instead of a │ │ │ value, which is used to extract a list of unicode strings from the │ │ │ selector associated with thisItemLoader
.-
│ │ │
- Parameters │ │ │
-
│ │ │ -
xpath (str) – the XPath to extract data from
│ │ │ -re (str or typing.Pattern) – a regular expression to use for extracting data from the │ │ │ +
xpath (str) – the XPath to extract data from
│ │ │ +re (str or typing.Pattern) – a regular expression to use for extracting data from the │ │ │ selected XPath region
│ │ │
│ │ │
Examples:
│ │ ││ │ │ │ │ ││ │ │# HTML snippet: <p class="product-name">Color TV</p> │ │ │ loader.get_xpath('//p[@class="product-name"]') │ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/logging.html │ │ │ @@ -232,15 +232,15 @@ │ │ │
Logging¶
│ │ ││ │ ││ │ │ -Note
│ │ │
│ │ │scrapy.log
has been deprecated alongside its functions in favor of │ │ │ explicit calls to the Python standard logging. Keep reading to learn more │ │ │ about the new logging system.Scrapy uses
logging
for event logging. We’ll │ │ │ +Scrapy uses
│ │ │logging
for event logging. We’ll │ │ │ provide some simple examples to get you started, but for more advanced │ │ │ use-cases it’s strongly suggested to read thoroughly its documentation.Logging works out of the box, and can be configured to some extent with the │ │ │ Scrapy settings listed in Logging settings.
│ │ │Scrapy calls
scrapy.utils.log.configure_logging()
to set some reasonable │ │ │ defaults and handle those settings in Logging settings when │ │ │ running commands, so it’s recommended to manually call it if you’re running │ │ │ @@ -299,17 +299,17 @@ │ │ │ logger = logging.getLogger(__name__) │ │ │ logger.warning("This is a warning") │ │ ││ │ ││ │ │Logging from Spiders¶
│ │ │Scrapy provides a
│ │ │logger
within each Spider │ │ │ @@ -370,16 +370,16 @@ │ │ │ messages will be displayed on the standard error. Lastly, if │ │ │LOG_ENABLED
isFalse
, there won’t be any visible log output.
│ │ │LOG_LEVEL
determines the minimum level of severity to display, those │ │ │ messages with lower severity will be filtered out. It ranges through the │ │ │ possible levels listed in Log levels.
│ │ │LOG_FORMAT
andLOG_DATEFORMAT
specify formatting strings │ │ │ used as layouts for all messages. Those strings can contain any placeholders │ │ │ -listed in logging’s logrecord attributes docs and │ │ │ -datetime’s strftime and strptime directives │ │ │ +listed in logging’s logrecord attributes docs and │ │ │ +datetime’s strftime and strptime directives │ │ │ respectively.If
│ │ │LOG_SHORT_NAMES
is set, then the logs will not display the Scrapy │ │ │ component that prints the log. It is unset by default, hence logs contain the │ │ │ Scrapy component responsible for that log output.│ │ ││ │ │Command-line options¶
│ │ │ @@ -401,15 +401,15 @@ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ││ │ ││ │ │See also
│ │ │-
│ │ │ -
- Module
logging.handlers
Further documentation on available handlers
│ │ │ +- Module
logging.handlers
Further documentation on available handlers
│ │ │
│ │ │
│ │ ││ │ │Custom Log Formats¶
│ │ │A custom log format can be set for different actions by extending │ │ │ @@ -543,15 +543,15 @@ │ │ │
scrapy.utils.log.
configure_logging
(settings=None, install_root_handler=True)[source]¶ │ │ │Initialize logging defaults for Scrapy.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │
settings (dict,
Settings
object orNone
) – settings used to create and configure a handler for the │ │ │ root logger (default: None).
│ │ │ -install_root_handler (bool) – whether to install root logging handler │ │ │ +
install_root_handler (bool) – whether to install root logging handler │ │ │ (default: True)
│ │ │
│ │ │
This function does:
│ │ │-
│ │ │
Route warnings and twisted logging through Python standard logging
│ │ │ @@ -564,17 +564,17 @@
│ │ │ using
settings
argument. Whensettings
is empty or None, defaults │ │ │ are used. │ │ │
│ │ │configure_logging
is automatically called when using Scrapy commands │ │ │ orCrawlerProcess
, but needs to be called explicitly │ │ │ when running custom scripts usingCrawlerRunner
. │ │ │ In that case, its usage is not required but it’s recommended.Another option when running custom scripts is to manually configure the logging. │ │ │ -To do this you can use
│ │ │ +To do this you can uselogging.basicConfig()
to set a basic root handler.logging.basicConfig()
to set a basic root handler. │ │ │Note that
│ │ │CrawlerProcess
automatically callsconfigure_logging
, │ │ │ -so it is recommended to only uselogging.basicConfig()
together with │ │ │ +so it is recommended to only uselogging.basicConfig()
together with │ │ │CrawlerRunner
.This is an example on how to redirect
│ │ │INFO
or higher messages to a file:│ │ │ │ │ ││ │ │import logging │ │ │ │ │ │ logging.basicConfig( │ │ │ filename='log.txt', │ │ │ format='%(levelname)s: %(message)s', │ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/practices.html │ │ │ @@ -300,15 +300,15 @@ │ │ │ d = runner.crawl(MySpider) │ │ │ d.addBoth(lambda _: reactor.stop()) │ │ │ reactor.run() # the script will block here until the crawling is finished │ │ │
│ │ ││ │ │ │ │ │ -Running multiple spiders in the same process¶
│ │ │By default, Scrapy runs a single spider per process when you run
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/request-response.html │ │ │ @@ -269,45 +269,45 @@ │ │ │ classscrapy │ │ │ crawl
. However, Scrapy supports running multiple spiders per process using │ │ │ the internal API.scrapy.http.
Request
(*args, **kwargs)[source]¶ │ │ │A
│ │ │Request
object represents an HTTP request, which is usually │ │ │ generated in the Spider and executed by the Downloader, and thus generating │ │ │ aResponse
.-
│ │ │
- Parameters │ │ │
-
│ │ │ -
url (str) –
the URL of this request
│ │ │ -If the URL is invalid, a
│ │ │ +ValueError
exception is raised.url (str) –
the URL of this request
│ │ │ +If the URL is invalid, a
│ │ │ValueError
exception is raised.
│ │ │ -callback (collections.abc.Callable) – the function that will be called with the response of this │ │ │ +
callback (collections.abc.Callable) – the function that will be called with the response of this │ │ │ request (once it’s downloaded) as its first parameter. For more information │ │ │ see Passing additional data to callback functions below. │ │ │ If a Request doesn’t specify a callback, the spider’s │ │ │
parse()
method will be used. │ │ │ Note that if exceptions are raised during processing, errback is called instead.
│ │ │ -method (str) – the HTTP method of this request. Defaults to
'GET'
.
│ │ │ -meta (dict) – the initial values for the
Request.meta
attribute. If │ │ │ +method (str) – the HTTP method of this request. Defaults to
'GET'
.
│ │ │ +meta (dict) – the initial values for the
Request.meta
attribute. If │ │ │ given, the dict passed in this parameter will be shallow copied.
│ │ │ -body (bytes or str) – the request body. If a string is passed, then it’s encoded as │ │ │ +
body (bytes or str) – the request body. If a string is passed, then it’s encoded as │ │ │ bytes using the
encoding
passed (which defaults toutf-8
). If │ │ │body
is not given, an empty bytes object is stored. Regardless of the │ │ │ type of this argument, the final value stored will be a bytes object │ │ │ (never a string orNone
).
│ │ │ -headers (dict) –
the headers of this request. The dict values can be strings │ │ │ +
headers (dict) –
the headers of this request. The dict values can be strings │ │ │ (for single valued headers) or lists (for multi-valued headers). If │ │ │
│ │ │None
is passed as value, the HTTP header will not be sent at all.│ │ │
│ │ ││ │ ││ │ │Caution
│ │ │Cookies set via the
│ │ │Cookie
header are not considered by the │ │ │ CookiesMiddleware. If you need to set cookies for a request, use the │ │ │Request.cookies
parameter. This is a known │ │ │ current limitation that is being worked on.
│ │ │ -the request cookies. These can be sent in two forms.
│ │ │ +the request cookies. These can be sent in two forms.
│ │ │-
│ │ │
Using a dict:
│ │ ││ │ ││ │ │request_with_cookies = Request(url="http://www.example.com", │ │ │ cookies={'currency': 'USD', 'country': 'UY'}) │ │ │
│ │ │ @@ -344,38 +344,38 @@
│ │ │
Caution
│ │ │Cookies set via the
│ │ │Cookie
header are not considered by the │ │ │ CookiesMiddleware. If you need to set cookies for a request, use the │ │ │Request.cookies
parameter. This is a known │ │ │ current limitation that is being worked on.
encoding (str) – the encoding of this request (defaults to
'utf-8'
). │ │ │ + │ │ │ -encoding (str) – the encoding of this request (defaults to
'utf-8'
). │ │ │ This encoding will be used to percent-encode the URL and to convert the │ │ │ body to bytes (if given as a string).priority (int) – the priority of this request (defaults to
0
). │ │ │ + │ │ │ -priority (int) – the priority of this request (defaults to
0
). │ │ │ The priority is used by the scheduler to define the order used to process │ │ │ requests. Requests with a higher priority value will execute earlier. │ │ │ Negative values are allowed in order to indicate relatively low-priority.dont_filter (bool) – indicates that this request should not be filtered by │ │ │ +
│ │ │ -dont_filter (bool) – indicates that this request should not be filtered by │ │ │ the scheduler. This is used when you want to perform an identical │ │ │ request multiple times, to ignore the duplicates filter. Use it with │ │ │ care, or you will get into crawling loops. Default to
False
.errback (collections.abc.Callable) –
a function that will be called if any exception was │ │ │ +
│ │ │ -errback (collections.abc.Callable) –
a function that will be called if any exception was │ │ │ raised while processing the request. This includes pages that failed │ │ │ with 404 HTTP errors and such. It receives a │ │ │
│ │ │Failure
as first parameter. │ │ │ For more information, │ │ │ see Using errbacks to catch exceptions in request processing below.│ │ ││ │ │Changed in version 2.0: The callback parameter is no longer required when the errback │ │ │ parameter is specified.
│ │ │ │ │ │ -flags (list) – Flags sent to the request, can be used for logging or similar purposes.
│ │ │ +cb_kwargs (dict) – A dict with arbitrary data that will be passed as keyword arguments to the Request’s callback.
│ │ │ +flags (list) – Flags sent to the request, can be used for logging or similar purposes.
│ │ │ │ │ │ │ │ │ │ │ │cb_kwargs (dict) – A dict with arbitrary data that will be passed as keyword arguments to the Request’s callback.
-
│ │ │
-
│ │ │
url
¶
│ │ │ A string containing the URL of this request. Keep in mind that this │ │ │ @@ -411,27 +411,27 @@ │ │ │
meta
¶ │ │ │A dict that contains arbitrary metadata for this request. This dict is │ │ │ empty for new Requests, and is usually populated by different Scrapy │ │ │ components (extensions, middlewares, etc). So the data contained in this │ │ │ dict depends on the extensions you have enabled.
│ │ │See Request.meta special keys for a list of special meta keys │ │ │ recognized by Scrapy.
│ │ │ -This dict is shallow copied when the request is │ │ │ +
This dict is shallow copied when the request is │ │ │ cloned using the
│ │ │copy()
orreplace()
methods, and can also be │ │ │ accessed, in your spider, from theresponse.meta
attribute.
-
│ │ │
-
│ │ │
cb_kwargs
¶
│ │ │ A dictionary that contains arbitrary metadata for this request. Its contents │ │ │ will be passed to the Request’s callback as keyword arguments. It is empty │ │ │ for new Requests, which means by default callbacks only get a
│ │ │ -Response
│ │ │ object as argument.This dict is shallow copied when the request is │ │ │ +
This dict is shallow copied when the request is │ │ │ cloned using the
│ │ │copy()
orreplace()
methods, and can also be │ │ │ accessed, in your spider, from theresponse.cb_kwargs
attribute.In case of a failure to process the request, this dict can be accessed as │ │ │
│ │ │failure.request.cb_kwargs
in the request’s errback. For more information, │ │ │ see Accessing additional data in errback functions.
- │ │ │ class
│ │ │scrapy.http.
FormRequest
(url[, formdata, ...])[source]¶The
│ │ │FormRequest
class adds a new keyword parameter to the__init__
method. The │ │ │ remaining arguments are the same as for theRequest
class and are │ │ │ not documented here.-
│ │ │
- Parameters │ │ │ -
formdata (dict or collections.abc.Iterable) – is a dictionary (or iterable of (key, value) tuples) │ │ │ +
│ │ │formdata (dict or collections.abc.Iterable) – is a dictionary (or iterable of (key, value) tuples) │ │ │ containing HTML Form data which will be url-encoded and assigned to the │ │ │ body of the request.
│ │ │
The
│ │ │FormRequest
objects support the following class method in │ │ │ addition to the standardRequest
methods:-
│ │ │ @@ -752,31 +752,31 @@
│ │ │ bug in lxml, which should be fixed in lxml 3.8 and above.
│ │ │
-
│ │ │
- Parameters │ │ │
-
│ │ │
response (
Response
object) – the response containing a HTML form which will be used │ │ │ to pre-populate the form fields
│ │ │ -formname (str) – if given, the form with name attribute set to this value will be used.
│ │ │ -formid (str) – if given, the form with id attribute set to this value will be used.
│ │ │ -formxpath (str) – if given, the first form that matches the xpath will be used.
│ │ │ -formcss (str) – if given, the first form that matches the css selector will be used.
│ │ │ -formnumber (int) – the number of form to use, when the response contains │ │ │ +
formname (str) – if given, the form with name attribute set to this value will be used.
│ │ │ +formid (str) – if given, the form with id attribute set to this value will be used.
│ │ │ +formxpath (str) – if given, the first form that matches the xpath will be used.
│ │ │ +formcss (str) – if given, the first form that matches the css selector will be used.
│ │ │ +formnumber (int) – the number of form to use, when the response contains │ │ │ multiple forms. The first one (and also the default) is
0
.
│ │ │ -formdata (dict) – fields to override in the form data. If a field was │ │ │ +
formdata (dict) – fields to override in the form data. If a field was │ │ │ already present in the response
<form>
element, its value is │ │ │ overridden by the one passed in this parameter. If a value passed in │ │ │ this parameter isNone
, the field will not be included in the │ │ │ request, even if it was present in the response<form>
element.
│ │ │ -clickdata (dict) – attributes to lookup the control clicked. If it’s not │ │ │ +
clickdata (dict) – attributes to lookup the control clicked. If it’s not │ │ │ given, the form data will be submitted simulating a click on the │ │ │ first clickable element. In addition to html attributes, the control │ │ │ can be identified by its zero-based index relative to other │ │ │ submittable inputs inside the form, via the
nr
attribute.
│ │ │ -dont_click (bool) – If True, the form data will be submitted without │ │ │ +
dont_click (bool) – If True, the form data will be submitted without │ │ │ clicking in any element.
│ │ │
│ │ │
The other parameters of this class method are passed directly to the │ │ │
│ │ │FormRequest
__init__
method.Request
class and are │ │ │ not documented here. │ │ │Using the
│ │ │JsonRequest
will set theContent-Type
header toapplication/json
│ │ │ andAccept
header toapplication/json, text/javascript, */*; q=0.01
-
│ │ │
- Parameters │ │ │
-
│ │ │ -
data (object) – is any JSON serializable object that needs to be JSON encoded and assigned to body. │ │ │ +
data (object) – is any JSON serializable object that needs to be JSON encoded and assigned to body. │ │ │ if
Request.body
argument is provided this parameter will be ignored. │ │ │ ifRequest.body
argument is not provided and data argument is providedRequest.method
will be │ │ │ set to'POST'
automatically.
│ │ │ -dumps_kwargs (dict) – Parameters that will be passed to underlying
json.dumps()
method which is used to serialize │ │ │ +dumps_kwargs (dict) – Parameters that will be passed to underlying
json.dumps()
method which is used to serialize │ │ │ data into JSON format.
│ │ │
│ │ │
- │ │ │ class
│ │ │scrapy.http.
Response
(*args, **kwargs)[source]¶ │ │ │ │ │ │A
│ │ │Response
object represents an HTTP response, which is usually │ │ │ downloaded (by the Downloader) and fed to the Spiders for processing.-
│ │ │
- Parameters │ │ │
-
│ │ │ -
url (str) – the URL of this response
│ │ │ -status (int) – the HTTP status of the response. Defaults to
200
.
│ │ │ -headers (dict) – the headers of this response. The dict values can be strings │ │ │ +
url (str) – the URL of this response
│ │ │ +status (int) – the HTTP status of the response. Defaults to
200
.
│ │ │ +headers (dict) – the headers of this response. The dict values can be strings │ │ │ (for single valued headers) or lists (for multi-valued headers).
│ │ │ -body (bytes) – the response body. To access the decoded text as a string, use │ │ │ +
body (bytes) – the response body. To access the decoded text as a string, use │ │ │
response.text
from an encoding-aware │ │ │ Response subclass, │ │ │ such asTextResponse
.
│ │ │ -flags (list) – is a list containing the initial values for the │ │ │ +
flags (list) – is a list containing the initial values for the │ │ │
Response.flags
attribute. If given, the list will be shallow │ │ │ copied.
│ │ │ request (scrapy.http.Request) – the initial value of the
Response.request
attribute. │ │ │ This represents theRequest
that generated this response.
│ │ │ certificate (twisted.internet.ssl.Certificate) – an object representing the server’s SSL certificate.
│ │ │ -ip_address (
ipaddress.IPv4Address
oripaddress.IPv6Address
) – The IP address of the server from which the Response originated.
│ │ │ +ip_address (
ipaddress.IPv4Address
oripaddress.IPv6Address
) – The IP address of the server from which the Response originated.
│ │ │
│ │ │
│ │ ││ │ │New in version 2.1.0: The
│ │ │ip_address
parameter.-
│ │ │ @@ -1036,15 +1036,15 @@
│ │ │
-
│ │ │
-
│ │ │
urljoin
(url)[source]¶
│ │ │ Constructs an absolute url by combining the Response’s
│ │ │ -url
with │ │ │ a possible relative url.This is a wrapper over
urljoin()
, it’s merely an alias for │ │ │ +This is a wrapper over
│ │ │urljoin()
, it’s merely an alias for │ │ │ making this call:│ │ ││ │ │urllib.parse.urljoin(response.url, url) │ │ │
-
│ │ │ @@ -1060,15 +1060,15 @@
│ │ │
│ │ ││ │ │New in version 2.0: The flags parameter.
│ │ │-
│ │ │
-
│ │ │ -
follow_all
(urls, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None) → Generator[scrapy.http.request.Request, None, None][source]¶
│ │ │ + - │ │ ││ │ │
New in version 2.0.
│ │ │Return an iterable of
│ │ │ @@ -1093,15 +1093,15 @@ │ │ │Request
instances to follow all links │ │ │ inurls
. It accepts the same arguments asRequest.__init__
method, │ │ │ but elements ofurls
can be relative URLs orLink
objects, │ │ │ not only absolute URLs.Response
class, which is meant to be used only for binary data, │ │ │ such as images, sounds or any media file. │ │ │
│ │ │TextResponse
objects support a new__init__
method argument, in │ │ │ addition to the baseResponse
objects. The remaining functionality │ │ │ is the same as for theResponse
class and is not documented here.-
│ │ │
- Parameters │ │ │ -
encoding (str) – is a string which contains the encoding to use for this │ │ │ +
│ │ │encoding (str) – is a string which contains the encoding to use for this │ │ │ response. If you create a
│ │ │TextResponse
object with a string as │ │ │ body, it will be converted to bytes encoded using this encoding. If │ │ │ encoding isNone
(default), the encoding will be looked up in the │ │ │ response headers and body instead.
│ │ │ │ │ │TextResponse
objects support the following attributes in addition │ │ │ @@ -1186,15 +1186,15 @@ │ │ │response.xpath('//img/@src')[0]
See A shortcut for creating Requests for usage examples.
│ │ │
follow_all
(urls, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding='utf-8', priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None) → Generator[scrapy.http.request.Request, None, None][source]¶ │ │ │-
│ │ │
-
│ │ │ -
follow_all
(urls=None, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None, css=None, xpath=None) → Generator[scrapy.http.request.Request, None, None][source]¶
│ │ │ + A generator that produces
│ │ │Request
instances to follow all │ │ │ links inurls
. It accepts the same arguments as theRequest
’s │ │ │__init__
method, except that eachurls
element does not need to be │ │ │ an absolute URL, it can be any of the following:-
│ │ │
a relative URL
│ │ │ a
Link
object, e.g. the result of │ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/selectors.html │ │ │ @@ -260,15 +260,15 @@ │ │ ││ │ │
│ │ │-
│ │ │
BeautifulSoup is a very popular web scraping library among Python │ │ │ programmers which constructs a Python object based on the structure of the │ │ │ HTML code and also deals with bad markup reasonably well, but it has one │ │ │ drawback: it’s slow.
│ │ │ lxml is an XML parsing library (which also parses HTML) with a pythonic │ │ │ -API based on
ElementTree
. (lxml is not part of the Python standard │ │ │ +API based onElementTree
. (lxml is not part of the Python standard │ │ │ library.)
│ │ │
Scrapy comes with its own mechanism for extracting data. They’re called │ │ │ selectors because they “select” certain parts of the HTML document specified │ │ │ either by XPath or CSS expressions.
│ │ │XPath is a language for selecting nodes in XML documents, which can also be │ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/settings.html │ │ │ @@ -343,15 +343,15 @@ │ │ │
For a list of available built-in settings see: Built-in settings reference.
│ │ ││ │ ││ │ │Designating the settings¶
│ │ │When you use Scrapy, you have to tell it which settings you’re using. You can │ │ │ do this by using an environment variable,
│ │ │SCRAPY_SETTINGS_MODULE
.The value of
│ │ │ +Python import search path. │ │ │SCRAPY_SETTINGS_MODULE
should be in Python path syntax, e.g. │ │ │myproject.settings
. Note that the settings module should be on the │ │ │ -Python import search path.│ │ ││ │ │Populating the settings¶
│ │ │Settings can be populated using different mechanisms, each of which having a │ │ │ different precedence. Here is the list of them in decreasing order of │ │ │ precedence:
│ │ ││ │ │ @@ -529,21 +529,21 @@ │ │ │
If the asyncio reactor is enabled (see
│ │ │TWISTED_REACTOR
) this setting can be used to specify the │ │ │ asyncio event loop to be used with it. Set the setting to the import path of the │ │ │ desired asyncio event loop class. If the setting is set toNone
the default asyncio │ │ │ event loop will be used.If you are installing the asyncio reactor manually using the
│ │ │ -install_reactor()
│ │ │ function, you can use theevent_loop_path
parameter to indicate the import path of the event loop │ │ │ class to be used.Note that the event loop class must inherit from
│ │ │ +asyncio.AbstractEventLoop
.Note that the event loop class must inherit from
│ │ │asyncio.AbstractEventLoop
.│ │ ││ │ │Caution
│ │ │Please be aware that, when using a non-default event loop │ │ │ (either defined via
│ │ │ASYNCIO_EVENT_LOOP
or installed with │ │ │install_reactor()
), Scrapy will call │ │ │ -asyncio.set_event_loop()
, which will set the specified event loop │ │ │ +asyncio.set_event_loop()
, which will set the specified event loop │ │ │ as the current loop for the current OS thread.│ │ ││ │ │BOT_NAME¶
│ │ │Default:
│ │ │'scrapybot'
The name of the bot implemented by this Scrapy project (also known as the │ │ │ @@ -1020,23 +1020,23 @@ │ │ │
Default:
│ │ │None
File name to use for logging output. If
│ │ │None
, standard error will be used.│ │ ││ │ │LOG_FORMAT¶
│ │ │Default:
│ │ │'%(asctime)s [%(name)s] %(levelname)s: %(message)s'
String for formatting log messages. Refer to the │ │ │ -Python logging documentation for the qwhole │ │ │ +Python logging documentation for the qwhole │ │ │ list of available placeholders.
│ │ ││ │ ││ │ │LOG_DATEFORMAT¶
│ │ │Default:
│ │ │'%Y-%m-%d %H:%M:%S'
String for formatting date/time, expansion of the
│ │ │%(asctime)s
placeholder │ │ │ inLOG_FORMAT
. Refer to the │ │ │ -Python datetime documentation for the │ │ │ +Python datetime documentation for the │ │ │ whole list of available directives.│ │ ││ │ │ @@ -1393,18 +1393,18 @@ │ │ │ import path. Also installs the asyncio event loop with the specified import │ │ │ path if the asyncio reactor is enabled │ │ │LOG_FORMATTER¶
│ │ │Default:
│ │ │scrapy.logformatter.LogFormatter
The class to use for formatting log messages for different actions.
│ │ │
follow_all
(urls=None, callback=None, method='GET', headers=None, body=None, cookies=None, meta=None, encoding=None, priority=0, dont_filter=False, errback=None, cb_kwargs=None, flags=None, css=None, xpath=None) → Generator[scrapy.http.request.Request, None, None][source]¶ │ │ │If a reactor is already installed, │ │ │
│ │ │install_reactor()
has no effect.
│ │ │CrawlerRunner.__init__
raises │ │ │ -Exception
if the installed reactor does not match the │ │ │ +Exception
if the installed reactor does not match the │ │ │TWISTED_REACTOR
setting; therfore, having top-level │ │ │reactor
imports in project files and imported │ │ │ -third-party libraries will make Scrapy raiseException
when │ │ │ +third-party libraries will make Scrapy raiseException
when │ │ │ it checks which reactor is installed.In order to use the reactor installed by Scrapy:
│ │ ││ │ │ -│ │ │import scrapy │ │ │ from twisted.internet import reactor │ │ │ │ │ │ │ │ │ class QuotesSpider(scrapy.Spider): │ │ │ @@ -1425,15 +1425,15 @@ │ │ │ for quote in response.css('div.quote'): │ │ │ yield {'text': quote.css('span.text::text').get()} │ │ │ │ │ │ def stop(self): │ │ │ self.crawler.engine.close_spider(self, 'timeout') │ │ │
which raises
│ │ │ +Exception
, becomes:which raises
│ │ │Exception
, becomes:│ │ ││ │ │import scrapy │ │ │ │ │ │ │ │ │ class QuotesSpider(scrapy.Spider): │ │ │ name = 'quotes' │ │ │ │ │ │ def __init__(self, *args, **kwargs): │ │ │ @@ -1457,15 +1457,15 @@ │ │ │
The default value of the
│ │ │ -TWISTED_REACTOR
setting isNone
, which │ │ │ means that Scrapy will not attempt to install any specific reactor, and the │ │ │ default reactor defined by Twisted for the current platform will be used. This │ │ │ is to maintain backward compatibility and avoid possible problems caused by │ │ │ using a non-default reactor.For additional information, see Choosing a Reactor and GUI Toolkit Integration.
│ │ │ +For additional information, see Choosing a Reactor and GUI Toolkit Integration.
│ │ ││ │ ││ │ │URLLENGTH_LIMIT¶
│ │ │Default:
│ │ │2083
Scope:
│ │ │spidermiddlewares.urllength
The maximum URL length to allow for crawled URLs. For more information about │ │ │ the default value for this setting see: https://boutell.com/newfaq/misc/urllength.html
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/signals.html │ │ │ @@ -452,15 +452,15 @@ │ │ │ │ │ │ │ │ │Sent after a spider has been closed. This can be used to release per-spider │ │ │ resources reserved on
│ │ │spider_opened
.This signal supports returning deferreds from its handlers.
│ │ │-
│ │ │
- Parameters │ │ │
-
│ │ │
spider (
Spider
object) – the spider which has been closed
│ │ │ -reason (str) – a string which describes the reason why the spider was closed. If │ │ │ +
reason (str) – a string which describes the reason why the spider was closed. If │ │ │ it was closed because the spider has completed scraping, the reason │ │ │ is
'finished'
. Otherwise, if the spider was manually closed by │ │ │ calling theclose_spider
engine method, then the reason is the one │ │ │ passed in thereason
argument of that method (which defaults to │ │ │'cancelled'
). If the engine was shutdown (for example, by hitting │ │ │ Ctrl-C to stop it) the reason will be'shutdown'
.
│ │ │
This signal does not support returning deferreds from its handlers.
│ │ │-
│ │ │
- Parameters │ │ │
- │ │ │ │ │ │
process_spider_exception()
will be called. │ │ │-
│ │ │
- Parameters │ │ │
- │ │ │ │ │ │
-
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/spiders.html
│ │ │ @@ -367,16 +367,16 @@
│ │ │
- Parameters │ │ │
-
│ │ │
crawler (
Crawler
instance) – crawler to which the spider will be bound
│ │ │ -args (list) – arguments passed to the
__init__()
method
│ │ │ -kwargs (dict) – keyword arguments passed to the
__init__()
method
│ │ │ +args (list) – arguments passed to the
__init__()
method
│ │ │ +kwargs (dict) – keyword arguments passed to the
__init__()
method
│ │ │
│ │ │
Nonetheless, this method sets the
│ │ │crawler
andsettings
│ │ │ attributes in the new instance so they can be accessed later inside the │ │ │ spider’s code.-
│ │ │
-
│ │ │
- │ │ │ @@ -535,15 +535,15 @@ │ │ │ yield scrapy.Request(f'http://www.example.com/categories/{self.category}') │ │ │
Keep in mind that spider arguments are only strings. │ │ │ The spider will not do any parsing on its own. │ │ │ If you were to set the
│ │ │start_urls
attribute from the command line, │ │ │ you would have to parse it on your own into a list │ │ │ -using something likeast.literal_eval()
orjson.loads()
│ │ │ +using something likeast.literal_eval()
orjson.loads()
│ │ │ and then set it as an attribute. │ │ │ Otherwise, you would cause iteration over astart_urls
string │ │ │ (a very common python pitfall) │ │ │ resulting in each character being seen as a separate url.A valid use case is to set the http auth credentials │ │ │ used by
HttpAuthMiddleware
│ │ │ or the user agent │ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/telnetconsole.html │ │ │ @@ -322,15 +322,15 @@ │ │ │ │ │ │ │ │ │est
│ │ │print a report of the engine status
│ │ │ │ │ │prefs
│ │ │for memory debugging (see Debugging memory leaks)
│ │ │ │ │ │ -p
│ │ │ +a shortcut to the
pprint.pprint()
function │ │ │a shortcut to the
pprint.pprint()
function │ │ │ │ │ │ │ │ │ │ │ │hpy
│ │ │for memory debugging (see Debugging memory leaks)
scrapy.extensions.telnet.
update_telnet_vars
(telnet_vars)¶ │ │ │ │ │ │ │ │ │Sent just before the telnet console is opened. You can hook up to this │ │ │ signal to add, remove or update the variables that will be available in the │ │ │ telnet local namespace. In order to do that, you need to update the │ │ │
│ │ │ │ │ │telnet_vars
dict in your handler.│ │ │Telnet settings¶
-
│ │ │ class
-
│ │ │ class
-
│ │ │
Backward-incompatible changes¶
│ │ │-
│ │ │ -
Deprecation removals¶
│ │ │ @@ -1065,15 +1065,15 @@ │ │ │Bug fixes¶
│ │ │-
│ │ │
TypeError
exception (issue 4410)
│ │ │ Quality assurance¶
│ │ │-
│ │ │
Scrapy 2.0.0 (2020-03-03)¶
│ │ │Highlights:
│ │ │-
│ │ │
next()
function or MutableChain.__next__
instead (issue 4153)
│ │ │
│ │ │ New features¶
│ │ │-
│ │ │
(issue 374, issue 3986, issue 3989, issue 4176, issue 4188)
│ │ │ │ │ │ -Documentation¶
│ │ │-
│ │ │ @@ -1326,15 +1326,15 @@
│ │ │ (issue 4152, issue 4169)
│ │ │
-
│ │ │
TypeError
exception (issue 4007, issue 4052)
│ │ │ ChainMap
instead. (issue 3878)
│ │ │ Other changes¶
│ │ │-
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/api.html
│ │ │ @@ -465,15 +465,15 @@
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │ @@ -741,16 +741,16 @@
│ │ │
-
│ │ │ @@ -766,15 +766,15 @@
│ │ │ will be used and the
priority
parameter ignored. This allows
│ │ │ inserting/updating settings with different priorities with a single
│ │ │ command.
│ │ │ -
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
-
│ │ │
Event-driven networking¶
│ │ │Scrapy is written with Twisted, a popular event-driven networking framework │ │ │ for Python. Thus, it’s implemented using a non-blocking (aka asynchronous) code │ │ │ for concurrency.
│ │ │For more information about asynchronous programming and Twisted see these │ │ │ links:
│ │ │-
│ │ │ -
asyncio¶
│ │ │New in version 2.0.
│ │ │Scrapy has partial support asyncio
. After you install the asyncio
│ │ │ -reactor, you may use asyncio
and
│ │ │ -asyncio
-powered libraries in any coroutine.
Scrapy has partial support asyncio
. After you install the asyncio
│ │ │ +reactor, you may use asyncio
and
│ │ │ +asyncio
-powered libraries in any coroutine.
Warning
│ │ │ -asyncio
support in Scrapy is experimental. Future Scrapy
│ │ │ +
asyncio
support in Scrapy is experimental. Future Scrapy
│ │ │ versions may introduce related changes without a deprecation
│ │ │ period or warning.
Installing the asyncio reactor¶
│ │ │ -To enable asyncio
support, set the TWISTED_REACTOR
setting to
│ │ │ +
To enable asyncio
support, set the TWISTED_REACTOR
setting to
│ │ │ 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'
.
If you are using CrawlerRunner
, you also need to
│ │ │ install the AsyncioSelectorReactor
│ │ │ reactor manually. You can do that using
│ │ │ install_reactor()
:
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor')
│ │ │
-
│ │ │
Detecting check runs¶
│ │ │When scrapy check
is running, the SCRAPY_CHECK
environment variable is
│ │ │ -set to the true
string. You can use os.environ
to perform any change to
│ │ │ +set to the true
string. You can use os.environ
to perform any change to
│ │ │ your spiders or your settings when scrapy check
is used:
import os
│ │ │ import scrapy
│ │ │
│ │ │ class ExampleSpider(scrapy.Spider):
│ │ │ name = 'example'
│ │ ├── ./usr/share/doc/python-scrapy-doc/html/topics/coroutines.html
│ │ │ @@ -221,15 +221,15 @@
│ │ │
│ │ │
│ │ │ Coroutines¶
│ │ │
│ │ │ New in version 2.0.
│ │ │
│ │ │ Scrapy has partial support for the
│ │ │ -coroutine syntax.
│ │ │ +coroutine syntax.
│ │ │
│ │ │ Supported callables¶
│ │ │ The following callables may be defined as coroutines using async def
, and
│ │ │ hence use coroutine syntax (e.g. await
, async for
, async with
):
│ │ │
│ │ │
│ │ │ Coroutines may be used to call asynchronous code. This includes other
│ │ │ coroutines, functions that return Deferreds and functions that return
│ │ │ -awaitable objects such as Future
.
│ │ │ +awaitable objects such as Future
.
│ │ │ This means you can use many useful Python libraries providing such code:
│ │ │ class MySpider(Spider):
│ │ │ # ...
│ │ │ async def parse_with_deferred(self, response):
│ │ │ additional_response = await treq.get('https://additional.url')
│ │ │ additional_data = await treq.content(additional_response)
│ │ │ # ... use response and additional_data to yield items and requests
│ │ │ @@ -301,15 +301,15 @@
│ │ │ additional_data = await r.text()
│ │ │ # ... use response and additional_data to yield items and requests
│ │ │
│ │ │
│ │ │
│ │ │ Note
│ │ │ Many libraries that use coroutines, such as aio-libs, require the
│ │ │ -asyncio
loop and to use them you need to
│ │ │ +asyncio
loop and to use them you need to
│ │ │ enable asyncio support in Scrapy.
│ │ │
│ │ │ Common use cases for asynchronous code include:
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │ DBM storage backend¶
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │ Writing your own storage backend¶
│ │ │ You can implement a cache storage backend by creating a Python class that
│ │ │ @@ -876,15 +876,15 @@
│ │ │
│ │ │ HttpProxyMiddleware¶
│ │ │
│ │ │
│ │ │
│ │ │ RobotFileParser¶
│ │ │ -Based on RobotFileParser
:
│ │ │ +Based on RobotFileParser
:
│ │ │
│ │ │
│ │ │ It is faster than Protego and backward-compatible with versions of Scrapy before 1.8.0.
│ │ │ @@ -1149,31 +1149,31 @@
│ │ │
│ │ │
│ │ │ Handling different response formats¶
│ │ │ Once you have a response with the desired data, how you extract the desired
│ │ │ data from it depends on the type of response:
│ │ │
│ │ │
│ │ │
│ │ │
│ │ │ Once you have a string with the JavaScript code, you can extract the desired
│ │ │ data from it:
│ │ │
│ │ │ -