searxng/searx/engines/duden.py

# SPDX-License-Identifier: AGPL-3.0-or-later
"""
 Duden
"""

import re
from urllib.parse import quote, urljoin
from lxml import html
from searx.utils import extract_text, eval_xpath, eval_xpath_list, eval_xpath_getindex
from searx.network import raise_for_httperror

# about
about = {
    "website": 'https://www.duden.de',
    "wikidata_id": 'Q73624591',
    "official_api_documentation": None,
    "use_official_api": False,
    "require_api_key": False,
    "results": 'HTML',
    "language": 'de',
}

categories = ['dictionaries']
paging = True

# search-url
base_url = 'https://www.duden.de/'
search_url = base_url + 'suchen/dudenonline/{query}?search_api_fulltext=&page={offset}'


def request(query, params):
    '''pre-request callback
    params<dict>:
      method  : POST/GET
      headers : {}
      data    : {} # if method == POST
      url     : ''
      category: 'search category'
      pageno  : 1 # number of the requested page
    '''

    offset = params['pageno'] - 1
    if offset == 0:
        search_url_fmt = base_url + 'suchen/dudenonline/{query}'
        params['url'] = search_url_fmt.format(query=quote(query))
    else:
        params['url'] = search_url.format(offset=offset, query=quote(query))
    # after the last page of results, spelling corrections are returned after a HTTP redirect
    # whatever the page number is
    params['soft_max_redirects'] = 1
    params['raise_for_httperror'] = False
    return params


def response(resp):
    '''post-response callback
    resp: requests response object
    '''
    results = []

    if resp.status_code == 404:
        return results

    raise_for_httperror(resp)

    dom = html.fromstring(resp.text)

    number_of_results_element = eval_xpath_getindex(
        dom, '//a[@class="active" and contains(@href,"/suchen/dudenonline")]/span/text()', 0, default=None
    )
    if number_of_results_element is not None:
        number_of_results_string = re.sub('[^0-9]', '', number_of_results_element)
        results.append({'number_of_results': int(number_of_results_string)})

    for result in eval_xpath_list(dom, '//section[not(contains(@class, "essay"))]'):
        url = eval_xpath_getindex(result, './/h2/a', 0).get('href')
        url = urljoin(base_url, url)
        title = eval_xpath(result, 'string(.//h2/a)').strip()
        content = extract_text(eval_xpath(result, './/p'))
        # append result
        results.append({'url': url, 'title': title, 'content': content})

    return results
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 10:31:25 +00:00			`# SPDX-License-Identifier: AGPL-3.0-or-later`
duden.de engine 2018-08-18 17:24:02 +00:00			`"""`
			`Duden`
			`"""`

			`import re`
Drop Python 2 (1/n): remove unicode string and url_utils 2020-08-06 15:42:46 +00:00			`from urllib.parse import quote, urljoin`
[mod] duden engine * add params['soft_max_redirects'] = 1 (when there is spelling suggestion) * avoid try..except * use eval_xpath_* functions 2020-12-07 09:31:11 +00:00			`from lxml import html`
			`from searx.utils import extract_text, eval_xpath, eval_xpath_list, eval_xpath_getindex`
[fix] engine duden - don't raise exception on empty result list Duden expects a word in German, so with query "amazing" the site finds nothing and respons a 404: httpx.HTTPStatusError: Client error '404 Not Found' for url\ 'https://www.duden.de/suchen/dudenonline/amazing' [1] https://github.com/searxng/searxng/issues/1543#issuecomment-1193317054 Suggested-by: @allendema [1] Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2022-08-19 15:58:37 +00:00			`from searx.network import raise_for_httperror`
duden.de engine 2018-08-18 17:24:02 +00:00
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 10:31:25 +00:00			`# about`
			`about = {`
			`"website": 'https://www.duden.de',`
			`"wikidata_id": 'Q73624591',`
			`"official_api_documentation": None,`
			`"use_official_api": False,`
			`"require_api_key": False,`
			`"results": 'HTML',`
[doc] introduce about.language and sort engines by it 2021-12-21 08:39:03 +00:00			`"language": 'de',`
[enh] engines: add about variable move meta information from comment to the about variable so the preferences, the documentation can show these information 2021-01-13 10:31:25 +00:00			`}`

[enh] move dictionaries, Erowid & IMDb out of general category The general category is the category that is searched by default. From a privacy standpoint it doesn't make sense to send all general queries to specialized search engines that cannot deal with those queries anyway. 2021-12-28 15:26:38 +00:00			`categories = ['dictionaries']`
duden.de engine 2018-08-18 17:24:02 +00:00			`paging = True`

			`# search-url`
			`base_url = 'https://www.duden.de/'`
[fix] fix duden engine (#1594) 2019-07-25 06:17:45 +00:00			`search_url = base_url + 'suchen/dudenonline/{query}?search_api_fulltext=&page={offset}'`
duden.de engine 2018-08-18 17:24:02 +00:00

			`def request(query, params):`
			`'''pre-request callback`
			`params<dict>:`
			`method : POST/GET`
			`headers : {}`
			`data : {} # if method == POST`
			`url : ''`
			`category: 'search category'`
			`pageno : 1 # number of the requested page`
			`'''`

[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 08:26:22 +00:00			`offset = params['pageno'] - 1`
[fix] fix duden engine (#1594) 2019-07-25 06:17:45 +00:00			`if offset == 0:`
			`search_url_fmt = base_url + 'suchen/dudenonline/{query}'`
			`params['url'] = search_url_fmt.format(query=quote(query))`
			`else:`
			`params['url'] = search_url.format(offset=offset, query=quote(query))`
[mod] duden engine * add params['soft_max_redirects'] = 1 (when there is spelling suggestion) * avoid try..except * use eval_xpath_* functions 2020-12-07 09:31:11 +00:00			`# after the last page of results, spelling corrections are returned after a HTTP redirect`
			`# whatever the page number is`
			`params['soft_max_redirects'] = 1`
[fix] engine duden - don't raise exception on empty result list Duden expects a word in German, so with query "amazing" the site finds nothing and respons a 404: httpx.HTTPStatusError: Client error '404 Not Found' for url\ 'https://www.duden.de/suchen/dudenonline/amazing' [1] https://github.com/searxng/searxng/issues/1543#issuecomment-1193317054 Suggested-by: @allendema [1] Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2022-08-19 15:58:37 +00:00			`params['raise_for_httperror'] = False`
duden.de engine 2018-08-18 17:24:02 +00:00			`return params`


			`def response(resp):`
			`'''post-response callback`
			`resp: requests response object`
			`'''`
			`results = []`

[fix] engine duden - don't raise exception on empty result list Duden expects a word in German, so with query "amazing" the site finds nothing and respons a 404: httpx.HTTPStatusError: Client error '404 Not Found' for url\ 'https://www.duden.de/suchen/dudenonline/amazing' [1] https://github.com/searxng/searxng/issues/1543#issuecomment-1193317054 Suggested-by: @allendema [1] Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2022-08-19 15:58:37 +00:00			`if resp.status_code == 404:`
			`return results`

			`raise_for_httperror(resp)`

duden.de engine 2018-08-18 17:24:02 +00:00			`dom = html.fromstring(resp.text)`

[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 08:26:22 +00:00			`number_of_results_element = eval_xpath_getindex(`
			`dom, '//a[@class="active" and contains(@href,"/suchen/dudenonline")]/span/text()', 0, default=None`
			`)`
[mod] duden engine * add params['soft_max_redirects'] = 1 (when there is spelling suggestion) * avoid try..except * use eval_xpath_* functions 2020-12-07 09:31:11 +00:00			`if number_of_results_element is not None:`
			`number_of_results_string = re.sub('[^0-9]', '', number_of_results_element)`
duden.de engine 2018-08-18 17:24:02 +00:00			`results.append({'number_of_results': int(number_of_results_string)})`

[mod] duden engine * add params['soft_max_redirects'] = 1 (when there is spelling suggestion) * avoid try..except * use eval_xpath_* functions 2020-12-07 09:31:11 +00:00			`for result in eval_xpath_list(dom, '//section[not(contains(@class, "essay"))]'):`
			`url = eval_xpath_getindex(result, './/h2/a', 0).get('href')`
			`url = urljoin(base_url, url)`
			`title = eval_xpath(result, 'string(.//h2/a)').strip()`
			`content = extract_text(eval_xpath(result, './/p'))`
			`# append result`
[format.python] initial formatting of the python code This patch was generated by black [1]:: make format.python [1] https://github.com/psf/black Signed-off-by: Markus Heiser <markus.heiser@darmarit.de> 2021-12-27 08:26:22 +00:00			`results.append({'url': url, 'title': title, 'content': content})`
duden.de engine 2018-08-18 17:24:02 +00:00
			`return results`