Decoding encoded Google News URLs

C

4

6

I saved a search in https://news.google.com/ but google does not use the actual links found on its results page. Rather, you will find links like this:

https://news.google.com/articles/CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA?hl=en-US&gl=US&ceid=US%3Aen

I want the 'real link' that this resolves to using python. If you plug the above url into your browser, for a split second you will see

Opening https://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm

I tried a few things using the Requests module but 'no cigar'.

If it can't be done, are these google links permanent - can they always be used to open up the web page?

UPDATE 1:

After posting this question I used a hack to solve the problem. I simply used urllib again to open up the google url and then parsed the source to find the 'real url'.

It was exciting to see TDG's answer as it would help my program to run faster. But google is being cryptic and it did not work for ever link.

For this mornings news feed, it bombed on the 4th news item:

 RESTART: C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py 
cp1252
cp1252
>>> 1
Tommy Angelo Presents: The Butoff
CBMiTWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvdG9tbXktYW5nZWxvLXByZXNlbnRzLXRoZS1idXRvZmYtMzE4ODEuaHRt0gEA
b'\x08\x13"Mhttps://www.pokernews.com/strategy/tommy-angelo-presents-the-butoff-31881.htm\xd2\x01\x00'
Flopped Set of Nines: Get All In on Flop or Wait?
CBMiXGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvZmxvcHBlZC1zZXQtb2YtbmluZXMtZ2V0LWFsbC1pbi1vbi1mbG9wLW9yLXdhaXQtMzE4ODAuaHRt0gEA
b'\x08\x13"\\https://www.pokernews.com/strategy/flopped-set-of-nines-get-all-in-on-flop-or-wait-31880.htm\xd2\x01\x00'
What Not to Do Online: Don’t Just Stop Thinking and Shove
CBMiZWh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd2hhdC1ub3QtdG8tZG8tb25saW5lLWRvbi10LWp1c3Qtc3RvcC10aGlua2luZy1hbmQtc2hvdmUtMzE4NzAuaHRt0gEA
b'\x08\x13"ehttps://www.pokernews.com/strategy/what-not-to-do-online-don-t-just-stop-thinking-and-shove-31870.htm\xd2\x01\x00'
Hold’em with Holloway, Vol. 77: Joseph Cheong Gets Crazy with a Pair of Ladies
CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA
Traceback (most recent call last):
  File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 68, in <module>
    GetGoogleNews("https://news.google.com/search?q=site%3Ahttps%3A%2F%2Fwww.pokernews.com%2Fstrategy&hl=en-US&gl=US&ceid=US%3Aen", 'news')
  File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\rssFeed1.py", line 34, in GetGoogleNews
    real_URL = base64.b64decode(coded)
  File "C:\Users\Mike\AppData\Local\Programs\Python\Python36-32\lib\base64.py", line 87, in b64decode
    return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
>>>

UPDATE 2:

After reading up on base64 I think the 'Incorrect padding' padding message means that the input string must be divisible by 4. So I added 'aa' to

CBMiV2h0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvaG9sZC1lbS13aXRoLWhvbGxvd2F5LXZvbC03Ny1qb3NlcGgtY2hlb25nLTMxODU4Lmh0bdIBAA

and did not get the error message:

>>> t = s + 'aa'
>>> len(t)/4
32.0
>>> base64.b64decode(t)
b'\x08\x13"Whttps://www.pokernews.com/strategy/hold-em-with-holloway-vol-77-joseph-cheong-31858.htm\xd2\x01\x00\x06\x9a'

Cuman answered 2/7, 2018 at 8:23 Comment(0)

H

3

Basically it is base64 coded string. If you run the following code snippet:

import base64
coded = 'CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA'
url = base64.b64decode(coded)
print(url)

You'll get the following output:

b'\x08\x13"Phttps://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm\xd2\x01\x00'

So it looks like your url with some extras. If all the extras are the same, it will be easy to filter out the url. If not - you'll have to handle every one separately.

Hundredth answered 24/8, 2018 at 14:44 Comment(4)

Looks like a bug in the base64 library. If you paste the string here base64decode.org it gets decoded just fine. I suggest you to open a question about this case in SO. – Hundredth 25/8, 2018 at 11:6

I padded the offending encoded string with 'aa' and now it works. I think python base64.b64decode wants the inputs to have length divisible by 4. – Cuman 25/8, 2018 at 13:27

My news feed really runs fast now due to your observation and assistance. Thanks so much. Not sure if this is good to know, but this code works: real_URL = base64.b64decode(coded)[4:].decode('utf-8', "backslashreplace").split('\\')[0] – Cuman 26/8, 2018 at 3:22

@TDG: No, the additional characters are not the same, but the solution suggested by CopyPastelt works. – Horseleech 19/4 at 14:26

N

4

In 2024, I tried using base64 decode, but it didn't work. So, I found a solution in TypeScript and converted it to Python.

Github Link

Nordgren answered 4/6 at 20:33 Comment(0)

H

3

Basically it is base64 coded string. If you run the following code snippet:

import base64
coded = 'CBMiUGh0dHBzOi8vd3d3LnBva2VybmV3cy5jb20vc3RyYXRlZ3kvd3NvcC1tYWluLWV2ZW50LXRpcHMtbmluZS1jaGFtcGlvbnMtMzEyODcuaHRt0gEA'
url = base64.b64decode(coded)
print(url)

You'll get the following output:

b'\x08\x13"Phttps://www.pokernews.com/strategy/wsop-main-event-tips-nine-champions-31287.htm\xd2\x01\x00'

So it looks like your url with some extras. If all the extras are the same, it will be easy to filter out the url. If not - you'll have to handle every one separately.

Hundredth answered 24/8, 2018 at 14:44 Comment(4)

Looks like a bug in the base64 library. If you paste the string here base64decode.org it gets decoded just fine. I suggest you to open a question about this case in SO. – Hundredth 25/8, 2018 at 11:6

I padded the offending encoded string with 'aa' and now it works. I think python base64.b64decode wants the inputs to have length divisible by 4. – Cuman 25/8, 2018 at 13:27

My news feed really runs fast now due to your observation and assistance. Thanks so much. Not sure if this is good to know, but this code works: real_URL = base64.b64decode(coded)[4:].decode('utf-8', "backslashreplace").split('\\')[0] – Cuman 26/8, 2018 at 3:22

@TDG: No, the additional characters are not the same, but the solution suggested by CopyPastelt works. – Horseleech 19/4 at 14:26

H

3

I use the following code which you can put in a new module, e.g. gnews.py. This answer is applicable to the RSS feeds provided by Google News, and may otherwise need a slight adjustment. Note that I cache the returned value.

Steps used:

Find the base64 text in the encoded URL, and fix its padding.
Find the first URL in the decoded base64 text.

"""Decode encoded Google News entry URLs."""
import base64
import functools
import re

# Ref: https://stackoverflow.com/a/59023463/

_ENCODED_URL_PREFIX = "https://news.google.com/__i/rss/rd/articles/"
_ENCODED_URL_RE = re.compile(fr"^{re.escape(_ENCODED_URL_PREFIX)}(?P<encoded_url>[^?]+)")
_DECODED_URL_RE = re.compile(rb'^\x08\x13".+?(?P<primary_url>http[^\xd2]+)\xd2\x01')


@functools.lru_cache(2048)
def _decode_google_news_url(url: str) -> str:
    match = _ENCODED_URL_RE.match(url)
    encoded_text = match.groupdict()["encoded_url"]  # type: ignore
    encoded_text += "==="  # Fix incorrect padding. Ref: https://stackoverflow.com/a/49459036/
    decoded_text = base64.urlsafe_b64decode(encoded_text)

    match = _DECODED_URL_RE.match(decoded_text)
    primary_url = match.groupdict()["primary_url"]  # type: ignore
    primary_url = primary_url.decode()
    return primary_url


def decode_google_news_url(url: str) -> str:  # Not cached because not all Google News URLs are encoded.
    """Return Google News entry URLs after decoding their encoding as applicable."""
    return _decode_google_news_url(url) if url.startswith(_ENCODED_URL_PREFIX) else url

Usage example:

>>> decode_google_news_url('https://news.google.com/__i/rss/rd/articles/CBMiQmh0dHBzOi8vd3d3LmV1cmVrYWxlcnQub3JnL3B1Yl9yZWxlYXNlcy8yMDE5LTExL2RwcGwtYmJwMTExODE5LnBocNIBAA?oc=5')
'https://www.eurekalert.org/pub_releases/2019-11/dppl-bbp111819.php'

Hedley answered 24/11, 2019 at 23:31 Comment(0)

A

0

as pointed in this stackoverflow response link, just adding '==' at the end of the string will do the trick

Achromatin answered 13/6, 2023 at 20:16 Comment(4)

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. - From Review – Synchrocyclotron 16/6, 2023 at 15:21

The link you posted does’t say that. It says to pad the end of the string with =s, which may be one, two or three depending on the size if the string. – Mok 25/7 at 15:10

@bfontaine, read well the response – Achromatin 29/7 at 12:38

@Achromatin my bad, you’re right! – Mok 29/7 at 13:35

Recommended topics

Hot tags