* fix type in settings.yml: replace suspend_times by suspended_times
* always use delay defined in settings.yml:
* HTTP status 402 and 403: read the value from settings.yml instead of using the hardcoded value of 1 day.
* startpage engine: CAPTCHA suspend the engine for one day instead of one week
This patch is to hardening the parsing of the bing response:
1. To fix [2087] check if the selected result item contains a link, otherwise
skip result item and continue in the result loop. Increment the result
pointer when a result has been added / the enumerate that counts for skipped
items is no longer valid when result items are skipped.
To test the bugfix use: ``!bi :all cerbot``
2. Limit the XPath selection of result items to direct children nodes (list
items ``li``) of the ordered list (``ol``).
To test the selector use: ``!bi :en pontiac aztek wiki``
.. in the result list you should find the wikipedia entry on top,
compare [2068]
[2087] https://github.com/searxng/searxng/issues/2087
[2068] https://github.com/searxng/searxng/issues/2068
Modify the XPath selector to get the wikipedia result plus small fixes.
About result content: especially with the Wikipedia result, we'd get several
paragraph elements, only the first paragraph would be taken and displayed on the
search result
- fix issue reported #1809
- filter out `None` value from issn and isbn list
- add comments (from publicationName)
- add publisher
Closes: https://github.com/searxng/searxng/issues/1809
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Some result items from core.ac.uk do not have an URL::
Traceback (most recent call last):
File "searx/search/processors/online.py", line 154, in search
search_results = self._search_basic(query, params)
File "searx/search/processors/online.py", line 142, in _search_basic
return self.engine.response(response)
File "SearXNG/searx/engines/core.py", line 73, in response
'url': source['urls'][0].replace('http://', 'https://', 1),
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The google news are in a rework, the content area of a news item has been
removed.
Closes: https://github.com/searxng/searxng/issues/1790
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Fix::
searx/locales.py:docstring of searx.locales.get_engine_locale:17: \
WARNING: Definition list ends without a blank line; unexpected unindent.
Improvement: don't show default values in the generated documentation whe it is
more a mess than a usefull information (`:meta hide-value:`).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
no_result_for_http_status contains a list of HTTP status.
These HTTP status are seen an empty result list.
In other cases an exception is thrown as usual.
Previously raise_for_httperror were ignoring all HTTP error,
which make defective engines invisible in the stats.
The request function should not request a language (aka locale) that is not
supported by qwant. Select a locale like zh-TW ends in qwant's API error:
ERROR searx.engines.qwant news: exception : \
API error::locale must be one of the following values: \
en_gb, en_ie, en_us, en_ca, en_my, en_au, en_nz, de_de, de_ch, de_at, fr_fr, \
fr_be, fr_ch, fr_ca, fr_ad, fc_ca, co_fr, es_es, es_ar, es_cl, es_co, es_mx, \
es_pe, es_ad, ca_es, ca_ad, ca_fr, eu_es, eu_fr, it_it, it_ch, pt_pt, pt_ad, \
nl_be, nl_nl
The existing searx.utils.match_language function is unsuitable for this purpose,
it is replaced by function searx.locales.get_engine_locale that is based on the
methods from the babel package.
The quant's _fetch_supported_languages function has been revised to filter out
languages 8aka locales) not supported by qwant.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
By using new property `qwant_categ:` the category of qwant is no longer bound to
the category of SearXNG.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This implements the Deepl Translation engine. It works nearly like lingva but
directly to the deepl API. This api only needs a to-lang, from-lang is a fake
by now.
There is a free option to use [1].
[1] https://www.deepl.com/pro-api?cta=header-pro-api for registering a free account.
Most engines that support languages (and regions) use the Accept-Language from
the WEB browser to build a response that fits to the language (and region).
- add new engine option: send_accept_language_header
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The errors make pyright usage useless since a new error won't be seen [1].
[1] https://github.com/searxng/searxng/pull/1569
```
searx/compat.py:11:27 - error: Expression of type "Type[cached_property[_T@cached_property]]" cannot be assigned to declared type "Type[cached_property]"
"Type[cached_property[_T@cached_property]]" is incompatible with "Type[cached_property]"
Type "Type[cached_property[_T@cached_property]]" cannot be assigned to type "Type[cached_property]" (reportGeneralTypeIssues)
searx/utils.py:69:36 - error: Expression of type "None" cannot be assigned to parameter of type "str"
Type "None" cannot be assigned to type "str" (reportGeneralTypeIssues)
searx/utils.py:573:85 - error: Expression of type "None" cannot be assigned to parameter of type "int"
Type "None" cannot be assigned to type "int" (reportGeneralTypeIssues)
searx/webapp.py:1306:22 - error: Argument of type "str" cannot be assigned to parameter "__a" of type "BytesPath" in function "join"
Type "str" cannot be assigned to type "BytesPath"
"str" is incompatible with "bytes"
"str" is incompatible with protocol "PathLike[bytes]"
"__fspath__" is not present (reportGeneralTypeIssues)
searx/webapp.py:1306:68 - error: Argument of type "Literal['themes']" cannot be assigned to parameter "paths" of type "BytesPath" in function "join"
Type "Literal['themes']" cannot be assigned to type "BytesPath"
"Literal['themes']" is incompatible with "bytes"
"Literal['themes']" is incompatible with protocol "PathLike[bytes]"
"__fspath__" is not present (reportGeneralTypeIssues)
searx/webapp.py:1306:78 - error: Argument of type "str | Any | None" cannot be assigned to parameter "paths" of type "BytesPath" in function "join"
Type "str | Any | None" cannot be assigned to type "BytesPath"
Type "str" cannot be assigned to type "BytesPath"
"str" is incompatible with "bytes"
"str" is incompatible with protocol "PathLike[bytes]"
"__fspath__" is not present (reportGeneralTypeIssues)
searx/webapp.py:1306:85 - error: Argument of type "Literal['img']" cannot be assigned to parameter "paths" of type "BytesPath" in function "join"
Type "Literal['img']" cannot be assigned to type "BytesPath"
"Literal['img']" is incompatible with "bytes"
"Literal['img']" is incompatible with protocol "PathLike[bytes]"
"__fspath__" is not present (reportGeneralTypeIssues)
searx/engines/mongodb.py:8:6 - warning: Import "pymongo" could not be resolved (reportMissingImports)
searx/engines/mysql_server.py:9:8 - warning: Import "mysql.connector" could not be resolved (reportMissingImports)
searx/engines/postgresql.py:9:8 - warning: Import "psycopg2" could not be resolved from source (reportMissingModuleSource)
searx/engines/xpath.py:187:28 - warning: "categories" is not defined (reportUndefinedVariable)
searx/search/__init__.py:184:82 - warning: "flask" is not defined (reportUndefinedVariable)
searx/search/checker/background.py:19:26 - error: Type of "schedule" is partially unknown
Type of "schedule" is "(delay: Any, func: Any, *args: Any) -> Literal[True]" (reportUnknownVariableType)
searx/shared/__init__.py:8:12 - warning: Import "uwsgi" could not be resolved (reportMissingImports)
searx/shared/shared_uwsgi.py:5:8 - warning: Import "uwsgi" could not be resolved (reportMissingImports)
```
The engine name is not only a *name* its also a identifier that is used in
logs, HTTP headers and more. Unicode characters in the name of an engine could
cause various issues.
Closes: https://github.com/searxng/searxng/issues/1544
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Emojipedia is an emoji reference website which documents the meaning and
common usage of emoji characters in the Unicode Standard. It is owned by Zedge
since 2021. Emojipedia is a voting member of The Unicode Consortium.[1]
Cherry picked from @james-still [2[3] and slightly modified to fit SearXNG's
quality gates.
[1] https://en.wikipedia.org/wiki/Emojipedia
[2] 2fc01eb20f
[3] https://github.com/searx/searx/pull/3278
Just in case if content is None, the original code will skip extract_text(), and
just append the None value to 'content'. So just add allow_none=True, and this
will return None without raising a ValueError in extract_text().
- fix the issue of fetching more the 7000 *languages*
- improve the request function and filter by language & country
- implement time_range_support & safesearch
- add more fields to the response from dailymotion (allow_embed, length)
- better clean up of HTML tags in the 'content' field.
This is more or less a complete rework based on the '/videos' API from [1].
This patch cleans up the language list in SearXNG that has been polluted by the
ISO-639-3 2 and 3 letter codes from dailymotion languages which have never been
used.
[1] https://developers.dailymotion.com/tools/
Closes: https://github.com/searxng/searxng/issues/1065
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Add player:
- The players are just playing 30sec from the title. Some of the player will be
blocked because of a cross-origin request and some players will link to apple
when you press the play button.
Avoid exceptions and (and BTW improve results)
- ERROR searx.engines.genius : list index out of range
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The 'scrap_img_by_id' function didn't return any longer anything useful. This
fix allows the google images engine to present the full source image instead of
only the thumbnail.
The function scrap_img_by_id() is rpelaced by a fully rewrite to parse image
URLs by a regular expression. The new function parse_urls_img_from_js(dom)
returns a mapping of data-id to image URL.
Closes: https://github.com/searxng/searxng/issues/909
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Embedded HTML breaks SearXNG architecture. To modularize, HTML is generated in
the templates (oscar & simple) and result parameter 'embedded' is replaced by
'data_src' (and 'audio_src'), an URL for embedded content (<iframe>).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Embedded HTML breaks SearXNG architecture. To modularize, HTML is generated in
the templates (oscar & simple) and result parameter 'embedded' is replaced by
'data_src', an URL for embedded content (<iframe>).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Openstreatmap images are now loaded from uploads.wikimedia.org instead of
commons.wikimedia.org to prevent redirects.
With `image_proxy` enabled images from commons.wikimedia.org cant be loaded
since they are redirected. We already discussed this issue [875] and
@tiekoetter fixed this issue in PR [878].
Related-to:
- [875] https://github.com/searxng/searxng/issues/875
- [878] https://github.com/searxng/searxng/pull/878
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Wikidata info box images are now loaded from uploads.wikimedia.org instead of commons.wikimedia.org to prevent redirects
Co-authored-by: Markus Heiser <markus.heiser@darmarit.de>
Two different threads ( = two different user queries) can call the request
function in a row and then the response function. The namespace will be same
since this is the same engine.
To keep exactly the same value ``base_url`` must be stored in params and then
retrieve using ``resp.search_params["base_url"]``.
Suggested-by: @dalf https://github.com/searxng/searxng/pull/862#discussion_r799324861
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Two different threads ( = two different user queries) can call the request
function in a row and then the response function. The namespace will be same
since this is the same engine.
To keep exactly the same value ``base_url`` must be stored in params and then
retrieve using ``resp.search_params["base_url"]``.
Suggested-by: @dalf https://github.com/searxng/searxng/pull/862#discussion_r799324861
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Currency engine has DuckDuckGo metadata
In the engine selector of the preferences window, the currency search engine has
the same metadata and wikidata url as duckduckgo, I'd assume there should be a
difference of some sort there clarifying what source the currency uses or, if
it's a duckduckgo service, at least clarifying that it's a currency service by
duck duck go.
Closes: https://github.com/searxng/searxng/issues/787
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Other optional parameter ..
`&sort=crawl_date`
can be appended to search_string to sort results by date.
`&domain=example.org`
can be implemented to search_string to get results from just one domain.
Public instances could get relatively fast timed-out for 3600s.
--
Merged from @allendema's commit [1] and slightly modfied / see [2].
Related-to: [1] 455b2b4460
Related-to: [2] https://github.com/searx/searx/pull/3040
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Check 'using_tor_proxy' for each engine individually instead of checking globally
[fix] searx.network: update _rdns test to the last httpx version
Co-authored-by: Alexandre Flament <alex@al-f.net>
In case of CAPTCHA raise a SearxEngineCaptchaException and suspend for 7 days.
When get_sc_code() fails raise a SearxEngineResponseException and suspend for 7
days.
[1] https://github.com/searxng/searxng/pull/695
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Startpage has introduced new anti-scraping measures that make SearXNG instances
run into captchas:
1. some arguments has been removed and a new `sc` has been added.
2. search path changed from `do/search` to `sp/search`
3. POST request is no longer needed
Closes: https://github.com/searxng/searxng/issues/692
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
api.openverse.engineering is a little picky and wants to have a trailing slash
in the path:
/v1/images? -->/ v1/images/?
otherwise it redirects, here is the debug log:
DEBUG searx.network.openverse : HTTP Request: GET https://api.openverse.engineering/v1/images?&page=1&page_size=20&format=json&q=foo "HTTP/2 301 Moved Permanently" (text/html; charset=utf-8)
DEBUG searx.network.openverse : HTTP Request: GET https://api.openverse.engineering/v1/images/?&page=1&page_size=20&format=json&q=foo "HTTP/2 200 OK" (application/json)
WARNING searx.engines.openverse : ErrorContext('searx/search/processors/online.py', 105, 'count_error(', None, '1 redirects, maximum: 0', ('200', 'OK', 'api.openverse.engineering')) True
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The implementation of the etools engine is poor. No date-range support, no
language support and it is broken by a CAPTCHA.
etools is a metasearch engine, the major search engines it supports (google,
bing, wikipedia, Yahoo) are already available in SeaarXNG.
While etools does support several engines we currently don't support directly,
support for them should be added directly to SearXNG if there is demand.
In practice: in SearXNG the worse etools results will be mixed with good results
from other engines we have (as long as there is no captcha).
At best case, what we win with etools is in e.g. results from de.ask.com in a
query from a german request .. in all other cases worse results are bubble up in
SearXNG's result list.
[1] https://github.com/searxng/searxng/issues/696#issuecomment-1005855499
Closes: https://github.com/searxng/searxng/issues/696
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The general category is the category that is searched by default.
From a privacy standpoint it doesn't make sense to send all general
queries to specialized search engines that cannot deal with those
queries anyway.
Previously we didn't have a good place to put search engines that don't
fit into any of the tab categories. This commit automatically puts
search engines that don't belong to any tab category in an "other"
category, that is only displayed in the user preferences (and not above
search results).
Previously all categories were displayed as search engine tabs.
This commit changes that so that only the categories listed under
categories_as_tabs in settings.yml are displayed.
This lets us introduce more categories without cluttering up the UI.
Categories not displayed as tabs can still be searched with !bangs.
Fix pylint issues from commit (3d96a983)
[format.python] initial formatting of the python code
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Disable the python code formatting from python-black, where the readability of
code suffers by formatting.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Follow up queries for the pages needed to be fixed.
- Split search-term in one for initial query and one for following queries.
- Set some headers in HTTP requests, bing needs for paging support.
- IMO //div[@class="sa_cc"] does no longer match in a bing response.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Fix remarks from pylint and improved code-style. In preparation for a bug-fix
of the Bing (Web) engine I add this engine to the pylint-list.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
In the video search, google also sometimes includes news. E.g. in the DE
language when you search for `!gov paris`, google adds an article from a german
newspaper (FAZ), I assume these are sponsored link (not tagged advertisement?)
Those links do not have an image / this patch ignores *video links* wqithout an
image ID.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>