Ponysearch/searx/engines/soundcloud.py

"""
 Soundcloud (Music)

 @website     https://soundcloud.com
 @provide-api yes (https://developers.soundcloud.com/)

 @using-api   yes
 @results     JSON
 @stable      yes
 @parse       url, title, content, publishedDate, embedded
"""

import re
from json import loads
from lxml import html
from dateutil import parser
from searx import logger
from searx.poolrequests import get as http_get
from searx.url_utils import quote_plus, urlencode

try:
    from cStringIO import StringIO
except:
    from io import StringIO

# engine dependent config
categories = ['music']
paging = True

# search-url
# missing attribute: user_id, app_version, app_locale
url = 'https://api-v2.soundcloud.com/'
search_url = url + 'search?{query}'\
                         '&variant_ids='\
                         '&facet=model'\
                         '&limit=20'\
                         '&offset={offset}'\
                         '&linked_partitioning=1'\
                         '&client_id={client_id}'   # noqa

embedded_url = '<iframe width="100%" height="166" ' +\
    'scrolling="no" frameborder="no" ' +\
    'data-src="https://w.soundcloud.com/player/?url={uri}"></iframe>'

cid_re = re.compile(r'client_id:"([^"]*)"', re.I | re.U)
guest_client_id = ''


def get_client_id():
    response = http_get("https://soundcloud.com")

    if response.ok:
        tree = html.fromstring(response.content)
        script_tags = tree.xpath("//script[contains(@src, '/assets/app')]")
        app_js_urls = [script_tag.get('src') for script_tag in script_tags if script_tag is not None]

        # extracts valid app_js urls from soundcloud.com content
        for app_js_url in app_js_urls:
            # gets app_js and searches for the clientid
            response = http_get(app_js_url)
            if response.ok:
                cids = cid_re.search(response.content.decode("utf-8"))
                if cids is not None and len(cids.groups()):
                    return cids.groups()[0]
    logger.warning("Unable to fetch guest client_id from SoundCloud, check parser!")
    return ""


def init():
    global guest_client_id
    # api-key
    guest_client_id = get_client_id()


# do search-request
def request(query, params):
    offset = (params['pageno'] - 1) * 20

    params['url'] = search_url.format(query=urlencode({'q': query}),
                                      offset=offset,
                                      client_id=guest_client_id)

    return params


# get response from search-request
def response(resp):
    results = []

    search_res = loads(resp.text)

    # parse results
    for result in search_res.get('collection', []):
        if result['kind'] in ('track', 'playlist'):
            title = result['title']
            content = result['description']
            publishedDate = parser.parse(result['last_modified'])
            uri = quote_plus(result['uri'])
            embedded = embedded_url.format(uri=uri)

            # append result
            results.append({'url': result['permalink_url'],
                            'title': title,
                            'publishedDate': publishedDate,
                            'embedded': embedded,
                            'content': content})

    # return results
    return results
update versions.cfg to use the current up-to-date packages 2015-05-02 15:45:17 +02:00			`"""`
			`Soundcloud (Music)`

			`@website https://soundcloud.com`
			`@provide-api yes (https://developers.soundcloud.com/)`

			`@using-api yes`
			`@results JSON`
			`@stable yes`
			`@parse url, title, content, publishedDate, embedded`
			`"""`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00
[fix]soundcloud.com guest client_id fetches dynamically 2015-12-30 01:20:14 +01:00			`import re`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`from json import loads`
[enh] py3 compatibility 2016-11-30 18:43:03 +01:00			`from lxml import html`
Integrated media in results + Deezer Engine New "embedded" item for the results, allow to give an iframe to display the media directly in the results. Note that the attributes src of the iframes are not set, but instead data-src is set, allowing to only load the iframe when clicked. Deezer engine based on public API (no key). 2015-01-05 02:04:23 +01:00			`from dateutil import parser`
[fix]soundcloud.com guest client_id fetches dynamically 2015-12-30 01:20:14 +01:00			`from searx import logger`
			`from searx.poolrequests import get as http_get`
[enh] py3 compatibility 2016-11-30 18:43:03 +01:00			`from searx.url_utils import quote_plus, urlencode`

			`try:`
			`from cStringIO import StringIO`
			`except:`
			`from io import StringIO`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00
update soundcloud and add comments 2014-09-02 18:12:30 +02:00			`# engine dependent config`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`categories = ['music']`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00			`paging = True`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00
update soundcloud and add comments 2014-09-02 18:12:30 +02:00			`# search-url`
[fix] fix soundcloud engine, speed up searx start time 2019-07-18 21:34:07 +02:00			`# missing attribute: user_id, app_version, app_locale`
			`url = 'https://api-v2.soundcloud.com/'`
[fix] pep8 2014-12-16 17:26:16 +01:00			`search_url = url + 'search?{query}'\`
[fix] fix soundcloud engine, speed up searx start time 2019-07-18 21:34:07 +02:00			`'&variant_ids='\`
[fix] pep8 2014-12-16 17:26:16 +01:00			`'&facet=model'\`
			`'&limit=20'\`
			`'&offset={offset}'\`
			`'&linked_partitioning=1'\`
			`'&client_id={client_id}' # noqa`
[fix] pep/flake8 compatibility 2014-01-20 02:31:20 +01:00
Integrated media in results + Deezer Engine New "embedded" item for the results, allow to give an iframe to display the media directly in the results. Note that the attributes src of the iframes are not set, but instead data-src is set, allowing to only load the iframe when clicked. Deezer engine based on public API (no key). 2015-01-05 02:04:23 +01:00			`embedded_url = '<iframe width="100%" height="166" ' +\`
			`'scrolling="no" frameborder="no" ' +\`
			`'data-src="https://w.soundcloud.com/player/?url={uri}"></iframe>'`

[enh] py3 compatibility 2016-11-30 18:43:03 +01:00			`cid_re = re.compile(r'client_id:"([^"]*)"', re.I \| re.U)`
[enh] add init function to engines which loads parallel 2017-06-06 22:20:20 +02:00			`guest_client_id = ''`
[enh] py3 compatibility 2016-11-30 18:43:03 +01:00
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00
[fix]soundcloud.com guest client_id fetches dynamically 2015-12-30 01:20:14 +01:00			`def get_client_id():`
			`response = http_get("https://soundcloud.com")`

			`if response.ok:`
[enh] py3 compatibility 2016-11-30 18:43:03 +01:00			`tree = html.fromstring(response.content)`
			`script_tags = tree.xpath("//script[contains(@src, '/assets/app')]")`
[fix]soundcloud.com guest client_id fetches dynamically 2015-12-30 01:20:14 +01:00			`app_js_urls = [script_tag.get('src') for script_tag in script_tags if script_tag is not None]`

			`# extracts valid app_js urls from soundcloud.com content`
			`for app_js_url in app_js_urls:`
			`# gets app_js and searches for the clientid`
			`response = http_get(app_js_url)`
			`if response.ok:`
[fix] fix soundcloud engine, speed up searx start time 2019-07-18 21:34:07 +02:00			`cids = cid_re.search(response.content.decode("utf-8"))`
[fix]soundcloud.com guest client_id fetches dynamically 2015-12-30 01:20:14 +01:00			`if cids is not None and len(cids.groups()):`
			`return cids.groups()[0]`
			`logger.warning("Unable to fetch guest client_id from SoundCloud, check parser!")`
			`return ""`

Add missing blank lines after class or function definition. 2016-07-15 19:49:23 +02:00
[enh] add init function to engines which loads parallel 2017-06-06 22:20:20 +02:00			`def init():`
			`global guest_client_id`
			`# api-key`
			`guest_client_id = get_client_id()`
[fix]soundcloud.com guest client_id fetches dynamically 2015-12-30 01:20:14 +01:00

update soundcloud and add comments 2014-09-02 18:12:30 +02:00			`# do search-request`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`def request(query, params):`
[enh] paging support for soundcloud 2014-01-30 01:50:15 +01:00			`offset = (params['pageno'] - 1) * 20`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00
[enh] paging support for soundcloud 2014-01-30 01:50:15 +01:00			`params['url'] = search_url.format(query=urlencode({'q': query}),`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00			`offset=offset,`
			`client_id=guest_client_id)`

[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`return params`


update soundcloud and add comments 2014-09-02 18:12:30 +02:00			`# get response from search-request`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`def response(resp):`
			`results = []`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`search_res = loads(resp.text)`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00
			`# parse results`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`for result in search_res.get('collection', []):`
[enh] soundcloud playlists 2013-10-20 00:52:32 +02:00			`if result['kind'] in ('track', 'playlist'):`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`title = result['title']`
			`content = result['description']`
Integrated media in results + Deezer Engine New "embedded" item for the results, allow to give an iframe to display the media directly in the results. Note that the attributes src of the iframes are not set, but instead data-src is set, allowing to only load the iframe when clicked. Deezer engine based on public API (no key). 2015-01-05 02:04:23 +01:00			`publishedDate = parser.parse(result['last_modified'])`
			`uri = quote_plus(result['uri'])`
			`embedded = embedded_url.format(uri=uri)`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00
			`# append result`
[fix] pep/flake8 compatibility 2014-01-20 02:31:20 +01:00			`results.append({'url': result['permalink_url'],`
			`'title': title,`
Integrated media in results + Deezer Engine New "embedded" item for the results, allow to give an iframe to display the media directly in the results. Note that the attributes src of the iframes are not set, but instead data-src is set, allowing to only load the iframe when clicked. Deezer engine based on public API (no key). 2015-01-05 02:04:23 +01:00			`'publishedDate': publishedDate,`
			`'embedded': embedded,`
[fix] pep/flake8 compatibility 2014-01-20 02:31:20 +01:00			`'content': content})`
update soundcloud and add comments 2014-09-02 18:12:30 +02:00
			`# return results`
[enh] soundcloud search added 2013-10-17 21:21:23 +02:00			`return results`