Partial reverse engineering of the Google engines including a improved language
and region handling based on the engine.traits_v1 data.
When ever possible the implementations of the Google engines try to make use of
the async REST APIs. The get_lang_info() has been generalized to a
get_google_info() function / especially the region handling has been improved by
adding the cr parameter.
searx/data/engine_traits.json
Add data type "traits_v1" generated by the fetch_traits() functions from:
- Google (WEB),
- Google images,
- Google news,
- Google scholar and
- Google videos
and remove data from obsolete data type "supported_languages".
A traits.custom type that maps region codes to *supported_domains* is fetched
from https://www.google.com/supported_domains
searx/autocomplete.py:
Reversed engineered autocomplete from Google WEB. Supports Google's languages and
subdomains. The old API suggestqueries.google.com/complete has been replaced
by the async REST API: https://{subdomain}/complete/search?{args}
searx/engines/google.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
- always use the async REST API (formally known as 'use_mobile_ui')
- use *supported_domains* from traits
- improved the result list by fetching './/div[@data-content-feature]'
and parsing the type of the various *content features* --> thumbnails are
added
searx/engines/google_images.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- if exists, freshness_date is added to the result
- issue 1864: result list has been improved a lot (due to the new cr parameter)
searx/engines/google_news.py
Reverse engineering and extensive testing ..
- fetch_traits(): Fetch languages & regions from Google properties.
*supported_domains* is not needed but a ceid list has been added.
- different region handling compared to Google WEB
- fixed for various languages & regions (due to the new ceid parameter) /
avoid CONSENT page
- Google News do no longer support time range
- result list has been fixed: XPath of pub_date and pub_origin
searx/engines/google_videos.py
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- add paging support
- implement a async request ('asearch': 'arc' & 'async':
'use_ac:true,_fmt:html')
- simplified code (thanks to '_fmt:html' request)
- issue 1359: fixed xpath of video length data
searx/engines/google_scholar.py
- fetch_traits(): Fetch languages & regions from Google properties.
- use *supported_domains* from traits
- request(): include patents & citations
- response(): fixed CAPTCHA detection (Scholar has its own CATCHA manager)
- hardening XPath to iterate over results
- fixed XPath of pub_type (has been change from gs_ct1 to gs_cgt2 class)
- issue 1769 fixed: new request implementation is no longer incompatible
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
- fetch_traits(): Fetch languages from peertube's search-index source code.
[mod] Include migration of the request methode from 'supported_languages'
to 'traits' (EngineTraits) object.
[fix] old supported_languages_url is no longer valid since the sources
has been moved to a different path.
- fixed code to pass pylint
- request(): complete re-implementation based on the API docs [1]
- response(): complete re-implementation, adds serveral fields missed before
- add source code documentation
[1] https://docs.joinpeertube.org/api-rest-reference.html#tag/Search/operation/searchVideos
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Tineye becomes active as soon as a https:// signature is found in the search
term, but most of the time a reverse image search is not requested when a URL is
specified, often the URL is just from a C&P.
The frequent requests to tineye lead in the end to the SearXNG instance being
blocked by tineye and the user seeing unexpected error messages.
BTW: many maintainers have disabled this engine in their local SearXNG settings.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
* fix type in settings.yml: replace suspend_times by suspended_times
* always use delay defined in settings.yml:
* HTTP status 402 and 403: read the value from settings.yml instead of using the hardcoded value of 1 day.
* startpage engine: CAPTCHA suspend the engine for one day instead of one week
Make suspended_time changeable in settings.yml
Allow different values to be set for different exceptions.
Co-authored-by: Alexandre Flament <alex@al-f.net>
- Add documentation to the plugin
- Harmonize FastText language model with SearXNG's language model
Reosurces::
import fasttext # --> +10 MB
fasttext.load_model(str(data_dir / 'lid.176.ftz')) # --> +4MB
Suggested-by: @dalf
- To speed up and simplify the deployment use fasttext-wheel instead of fasttext
- Building numpy on the Alpine Linux of docker-images takes ages --> install
py3-numpy from Alpines package manager (apk)
- Alpine Linux on docker-images (musl libc) do not support fasttext-wheel (gnu
libc) --> patch Dockerfile and build from fastetxt:
sed -i s/fasttext-wheel/fasttext/ requirements.txt
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
settings.yml:
* The default URL was unix:///usr/local/searxng-redis/run/redis.sock?db=0
* The default URL is now "false"
The default URL makes the log difficult to deal with:
if the admin didn't install a Redis instance, the logs record a false error.
It worked before because SearXNG initialized the Redis connection when the limiter started.
In this commit, SearXNG initializes Redis in searx/webapp.py
so various components can use Redis without taking care of the initialization step.
no_result_for_http_status contains a list of HTTP status.
These HTTP status are seen an empty result list.
In other cases an exception is thrown as usual.
Previously raise_for_httperror were ignoring all HTTP error,
which make defective engines invisible in the stats.
By using new property `qwant_categ:` the category of qwant is no longer bound to
the category of SearXNG.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Neeva is "the world's first ad-free, private search engine" and uses data from Apple, Bing, Yelp and "others".
They claim to crawl "hundreds of millions" of URLs a day (https://twitter.com/Neeva/status/1536447373903335426).
This implements the Deepl Translation engine. It works nearly like lingva but
directly to the deepl API. This api only needs a to-lang, from-lang is a fake
by now.
There is a free option to use [1].
[1] https://www.deepl.com/pro-api?cta=header-pro-api for registering a free account.
yep.com is still in beta, the api.yep.com does not have paging support. There
is only a 'limit' argument with a maximum of 100 results.
yep.com seems fast; there is nor need for a timeout of 12 sec.
The API returns JSON nevertheless what the HTTP header is, the "show more"
button on yep.com's web site does not set a special HTTP Accept header.
FYI: The index does not support languages, the WEB UI does not offer a language
selection of the results and the entire index seems in English.
Closes: https://github.com/searxng/searxng/issues/1619
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Most engines that support languages (and regions) use the Accept-Language from
the WEB browser to build a response that fits to the language (and region).
- add new engine option: send_accept_language_header
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The engine name is not only a *name* its also a identifier that is used in
logs, HTTP headers and more. Unicode characters in the name of an engine could
cause various issues.
Closes: https://github.com/searxng/searxng/issues/1544
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Emojipedia is an emoji reference website which documents the meaning and
common usage of emoji characters in the Unicode Standard. It is owned by Zedge
since 2021. Emojipedia is a voting member of The Unicode Consortium.[1]
Cherry picked from @james-still [2[3] and slightly modified to fit SearXNG's
quality gates.
[1] https://en.wikipedia.org/wiki/Emojipedia
[2] 2fc01eb20f
[3] https://github.com/searx/searx/pull/3278
Add a new setting: general.donation_url
By default the value is https://docs.searxng.org/donate.html
When the value is false, the link is hidden
When the value is true, the link goes to the infopage donation,
the administrator can create a custom page.
Since PR 932 [1][2] static files can't be delivered by HTTP server any longer.
This patch makes the hash paramter in the URL of static files:
/static/themes/simple/css/searxng.min.css?5fde34a74bc438c7b56ec8c6501e131cc9914bd8
optional. By default the hash parameter is disabled.
HINT:
Instances that do not deliver static files by their HTTP server and have a
long expire time [3] should enable this option.
----
This is only a interim solution, on the long run:
make static.build.commit
creates files including the file name:
css/searxng-5fde34a74bc438c7b56ec8c6501e131cc9914bd8.min.css
and a mapping.json with this content[4]
[1] https://github.com/searxng/searxng/issues/964
[2] https://github.com/searxng/searxng/pull/932#issuecomment-1067039518
[3] 5583336440
[4] https://github.com/searxng/searxng/pull/932#issuecomment-1067216426
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Two different threads ( = two different user queries) can call the request
function in a row and then the response function. The namespace will be same
since this is the same engine.
To keep exactly the same value ``base_url`` must be stored in params and then
retrieve using ``resp.search_params["base_url"]``.
Suggested-by: @dalf https://github.com/searxng/searxng/pull/862#discussion_r799324861
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
can replace filtron:
* rate limite the number of request per IP and per (IP, User-Agent)
* block some bots
use Redis
data stored in Redis never contains the IP addresses, only HMAC using the secret_key
Co-authored-by: Markus Heiser <markus.heiser@darmarit.de>
Other optional parameter ..
`&sort=crawl_date`
can be appended to search_string to sort results by date.
`&domain=example.org`
can be implemented to search_string to get results from just one domain.
Public instances could get relatively fast timed-out for 3600s.
--
Merged from @allendema's commit [1] and slightly modfied / see [2].
Related-to: [1] 455b2b4460
Related-to: [2] https://github.com/searx/searx/pull/3040
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The implementation of the etools engine is poor. No date-range support, no
language support and it is broken by a CAPTCHA.
etools is a metasearch engine, the major search engines it supports (google,
bing, wikipedia, Yahoo) are already available in SeaarXNG.
While etools does support several engines we currently don't support directly,
support for them should be added directly to SearXNG if there is demand.
In practice: in SearXNG the worse etools results will be mixed with good results
from other engines we have (as long as there is no captcha).
At best case, what we win with etools is in e.g. results from de.ask.com in a
query from a german request .. in all other cases worse results are bubble up in
SearXNG's result list.
[1] https://github.com/searxng/searxng/issues/696#issuecomment-1005855499
Closes: https://github.com/searxng/searxng/issues/696
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
* allow not to record metrics (response time, etc...)
* this commit doesn't change the UI. If the metrics are disabled
/stats and /stats/errors will return empty response.
in /preferences, the columns response time and reliability will be empty.
The tab icon names are currently hard coded in the templates.
This commit lets us introduce an icon property in the future, e.g:
categories_as_tabs:
general:
icon: search-outline
These dictionaries are no longer part of the general category,
so they're no longer queried by default -> we can enable them
by default without degrading general query performance.
The general category is the category that is searched by default.
From a privacy standpoint it doesn't make sense to send all general
queries to specialized search engines that cannot deal with those
queries anyway.
Add a redis connector, the default DB connector is a socket at::
unix:///usr/local/searxng-redis/run/redis.sock?db=0
To set up a redis instance simply use::
$ ./manage redis.build
$ sudo -H ./manage redis.install
A hint for developers:
To get access rights to this instance, your developer account needs to be added
to the *searxng-redis* group::
$ sudo -H ./manage redis.addgrp "${USER}"
# don't forget to logout & login to get member of group
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Previously all categories were displayed as search engine tabs.
This commit changes that so that only the categories listed under
categories_as_tabs in settings.yml are displayed.
This lets us introduce more categories without cluttering up the UI.
Categories not displayed as tabs can still be searched with !bangs.
Since digg no longer works, we do nat have a active engine in the social-media
category. Enable reddit by default to have at least one engine back in this
category.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
* disable by default
* settings.yml: ui.query_in_title
* in /preferences: privacy tab
when enabled, the result page's title contains the user query.
previously:
* oscar theme: the query was always included
* simple theme: the query was included with the GET method
The key of the dictionary 'searx.data.ENGINES_LANGUAGES' is the *engine name*
configured in settings.xml. When multiple engines are configured to use the
same origin engine (e.g. `engine: google`)::
- name: google
engine: google
use_mobile_ui: false
...
- name: google italian
engine: google
use_mobile_ui: false
language: it
...
- name: google mobile ui
engine: google
shortcut: gomui
use_mobile_ui: true
There exists no entry for ENGINES_LANGUAGES[engine.name] (e.g. `name: google
mobile ui` or `name: google italian`). This issue can be solved by recreate the
ENGINES_LANGUAGES::
make data.languages
But this is nothing an SearXNG admin would like to do when just configuring
additional engines, since this just doubles entries in ENGINES_LANGUAGES and
BTW: `make data.languages` has various external requirements which might be not
installed or not available, on a production host.
With this patch, if engine.name fails, ENGINES_LANGUAGES[engine.engine] is used
to get the engine.supported_languages (e.g. `google` for the engine named
`google mobile`).
For an engine, when there is `language: ...` in the YAML settings, the engine
supports only one language, in this case engine.supported_languages should
contains this value defined in settings.yml (e.g. `it` for the engine named
`google italian`).
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Closes: https://github.com/searxng/searxng/issues/384
Suggestions should be added too.
suggestion_xpath: //div[@class="text-gray h6"]/a
You can try it with:
!brave recurzuoin
Suggested-by: @allendema in https://github.com/searx/searx/issues/2857#issuecomment-904837023
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
This commit remove the need to update the brand for GIT_URL and GIT_BRANCH:
there are read from the git repository.
It is possible to call python -m searx.version freeze to freeze the current version.
Useful when the code is installed outside git (distro package, docker, etc...)
Deny formats has been implemented in 6ed4616d.
To harden SearXNG instances by default, other formats than HTML should be
denied. Most of JSON, RSS and CSV requests are bots [1]::
Bots are the only users of this feature on a public instance, and they abuse
it too much that the engines rate limit pretty quickly the IP address of the
instance.
[1] https://github.com/searxng/searxng/issues/95
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Do not merge this patch in master branch of SearXNG! This branch exists only
for testing the feature branch fix-searx.sh @return42.
This patch changes the buildenv to::
GIT_URL='https://github.com/return42/searxng'
GIT_BRANCH='fix-searx.sh'
SEARX_PORT='7777'
SEARX_BIND_ADDRESS='127.0.0.12'
To test installation procedure, clone feature branch (fix-searx.sh)::
$ cd ~/Downloads
$ git clone --branch fix-searx.sh https://github.com/return42/searxng searxng
$ cd searxng
$ ./utils/searx.sh install all
...
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Not all settings from the 'brand:' section of the YAML files are needed in the
shell scripts. This patch reduce the variables in ./utils/brand.env to what is
needed. The following ('brand:' settings) can be removed from this file:
- ISSUE_URL
- DOCS_URL
- PUBLIC_INSTANCES
- WIKI_URL
Tasks running outside of an *installed instance*, need the following settings
from the YAML configuration:
- GIT_URL <--> brand.git_url
- GIT_BRANCH <--> brand.git_branch
- SEARX_URL <--> server.base_url (aka PUBLIC_URL)
- SEARX_PORT <--> server.port
- SEARX_BIND_ADDRESS <--> server.bind_address
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
modified docs/admin/engines/settings.rst
- Fix documentation and add section 'brand'.
- Add remarks about **buildenv** variables.
- Add remarks about settings from environment variables $SEARX_DEBUG,
$SEARX_PORT, $SEARX_BIND_ADDRESS and $SEARX_SECRET
modified docs/admin/installation-searx.rst & docs/build-templates/searx.rst
Fix template location /templates/etc/searx/settings.yml
modified docs/dev/makefile.rst
Add description of the 'make buildenv' target and describe
- we have all SearXNG setups are centralized in the settings.yml file
- why some tasks need a utils/brand.env (aka instance's buildenv)
modified manage
Settings file from repository's working tree are used by default and
ask user if a /etc/searx/settings.yml file exists.
modified searx/settings.yml
Add comments about when it is needed to run 'make buildenv'
modified searx/settings_defaults.py
Default for server:port is taken from enviroment variable SEARX_PORT.
modified utils/build_env.py
- Some defaults in the settings.yml are taken from the environment,
e.g. SEARX_BIND_ADDRESS (searx.settings_defaults.SHEMA). When the
'brand.env' file is created these enviroment variables should be
unset first.
- The CONTACT_URL enviroment is not needed in the utils/brand.env
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Qwant is a fast and reliable search engine and AFAIK there is no CAPTCHA. Let
us enable Qwant engines by default.
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The implementation uses the Qwant API (https://api.qwant.com/v3). The API is
undocumented but can be reverse engineered by reading the network log of
https://www.qwant.com/ queries.
This implementation is used by different qwant engines in the settings.yml::
- name: qwant
categories: general
...
- name: qwant news
categories: news
...
- name: qwant images
categories: images
...
- name: qwant videos
categories: videos
...
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
The engine was added in commit a4b07460 but now it shows new issues [1].
In the 90'th of the last century, dogpile had its own WEB index, but nowadays it
is a meta-search engine [2]
Powered by technology, Dogpile returns all the best results from leading
search engines including Google and Yahoo!
Using dogpile as an engine in SearXNG needs more investigation, a XPath solution
like we have is not enough. It is questionable whether it still makes sense to
investigate more into a meta-search engine with a ReCAPTCHA in front.
With this patch the dogpile engine is removed
[1] https://github.com/searxng/searxng/issues/202
[2] https://www.dogpile.com/support/aboutus
Signed-off-by: Markus Heiser <markus.heiser@darmarit.de>
Engine just for Podcasts. An API which returns Podcasts and their Info like:
website, author etc.
Upstream query example: https://gpodder.net/search.json?q=linux
Added synonyme.woxikon.de using the xpath engine. Adds a site which returns
word synonyms although just in German.
Depending on the query not all synonyms are shown because of not the best xpath
selection. But should do the job just fine.
Upstream example query: https://synonyme.woxikon.de/synonyme/test.php