PHP UTF-encoded URL-string
Asked Answered
P

7

5

When I type in Firefox (in the address line) URL like http://www.example.com/?query=Траливали, it is automatically encoded to http://www.example.com/?query=%D2%F0%E0%EB%E8%E2%E0%EB%E8.

But URL like http://www.example.com/#ajax_call?query=Траливали is not converted.

Other browsers such as IE8 do not convert query at all.

The question is: how to detect (in PHP) if query is encoded? How to decode it?

I've tried:

  1. $str = iconv('cp1251', 'utf-8', urldecode($str) );

  2. $str = utf8_decode(urldecode($str));

  3. $str = (urldecode($str));

  4. many functions from http://php.net/manual/en/function.urldecode.php Nothing works.

Test:

$str = $_GET['str'];

d('%D2%F0%E0%EB%E8%E2%E0%EB%E8' == urldecode('%D2%F0%E0%EB%E8%E2%E0%EB%E8'));

d('%D2%F0%E0%EB%E8%E2%E0%EB%E8' == $str);

d('Траливали' == $str);

d(urldecode($str));

d(utf8_decode(urldecode($str)));

!!! d('%D2%F0%E0%EB%E8%E2%E0%EB%E8' == urlencode($str)); !!!

Returns:

[false] [false] [false] ��������� ???? [true]

Some kind of a solution: http://www.example.com/Траливали/ - send a query as a url part and parse with mod_rewrite.

Photic answered 30/7, 2010 at 2:57 Comment(9)
note that there are 2 steps here: from the browser to your script, and from the script to your browser. Both steps need to be done properly if you want to see your data come out as you want it. So it depends on what your script needs to do. See my updated answer for some suggestions.Momentary
regarding the update: are you saving the file in the same encoding? (I presume utf-8 for the connection?) try testing d('%...' == rawurlencode($str))Momentary
I added some tests, rawurlencode gives the same result as urlencode.Photic
Just tried it copy/pasting from your "Траливали" string, works like a charm here, comparing $str == 'Траливали'. Are you sure you are saving the php script in the right encoding? What happens if you put echo 'Траливали'; in your script? Does it appear on screen correctly?Momentary
Yes, it shows correctly. Script is saved as UTF-8 without BOM (using Notepad++).Photic
You need to take things a little slower and evaluate step by step. Looking at your example url: are you sure "Траливали" is "%D2%F0%E0%EB%E8%E2%E0%EB%E8" in utf-8? Here it comes out as "%D0%A2%D1%80%D0%B0%D0%BB%D0%B8%D0%B2%D0%B0%D0%BB%D0%B8". Could that be the problem?Momentary
%D2%F0%E0%EB%E8%E2%E0%EB%E8 - it is string automatically generated by Firefox.Photic
yes. and it is 1251 encoded string, not utf8.Fleabitten
"The question is: how to detect (in PHP) if query is encoded? How to decode it?"Photic
D
7

It is not converted as having the query part of the URL after the fragment is not valid.

RFC 3986 defines a URI as composed of the following parts:

     foo://example.com:8042/over/there?name=ferret#nose
     \_/   \______________/\_________/ \_________/ \__/
      |           |            |            |        |
   scheme     authority       path        query   fragment

The order cannot be changed. Therefore,

URL1: http://www.example.com/?query=Траливали#ajax_call

will be handled properly while

URL2: http://www.example.com/#ajax_call?query=Траливали

will not. If we look at URL2, IE actually handles the URL properly by detecting the fragment as #ajax_call?query=Траливали without a query. Fragment is always last and are never sent to the server.

IE will properly encode the query component of URL1 as it will detect it as a query.

As for decoding in PHP, %D2 and similar is automatically decoded in the $_GET['query'] variable. The reason why the $_GET variable was not properly populated was because in URL2, there is no query according to the standard.

Also, one last thing... when doing 'Траливали' == $_GET['query'], this will only be true if your PHP script itself is encoded in UTF-8. Your text editor should be able to tell you the encoding of your file.

Diaphysis answered 30/7, 2010 at 3:9 Comment(20)
Yes, indeed. Thank you for such a good reply. But it is a common practice to use fragment for ajax addresses. And it is a source of a problem, not a solution.Photic
@topright: It is the solution. I'm not saying to drop the fragment all together, I'm saying that your fragment should always be last. Rewrite your links to respect that. PHP does not handle the query after the fragment as it does not expect it to the there (it's illegal according to RFC3986). IE does not even bother to try encoding it as it is expecting a fragment (which are limited to ASCII characters only).Diaphysis
It's not. The problem occurs even without query in fragment.Photic
@topright: when doing 'Траливали' == $_GET['query'], you need to make sure your PHP file is also encoded in UTF-8... Check that in your text editor.Diaphysis
"if your PHP script itself is encoded in UTF-8". You are right. My script is encoded as UTF-8 without BOM (using Notepad++).Photic
@topright #ajax_call?query=Траливали means that the fragment consists of the text ajax_call?query=Траливали. The fragment is not send to the server. In other words, anything you put after # in the URL is never send to the server.Imperial
@topright: fragments are great for ajax as they are stored in the history yet do not waste bandwidth by sending useless data to the server. Which is why they are used in AJAX scenarios where it is parsed client-side. What you are trying to do will not work with fragments (they are never sent to PHP) which is why we tell you to use queries instead. You choose to ignore that advice.Diaphysis
Fragment is sent to the server via Ajax call. Server recieves Траливали that way.Photic
Anyway, do you understand my question?Photic
@topright No, your question just got confusing. Where did the heretofore unmentioned AJAX call come from and how does it send the fragment?Imperial
@topright: No they are never sent to the server. Not when using AJAX, not when using a regular GET. Please read RFC 3986 Section 3.5 and Wikipedia. Fragments in Javascript application are processed client-side, not server-side.Diaphysis
Don't believe me? Try it out... echo $_SERVER['REQUEST_URI']; will give you exactly the request as seen by Apache. You'll quickly notice the fragment is missing. Also check your logs... There will be no fragment.Diaphysis
@Imperial I think it would be better not to think of this as a fragment but as some bit of data being sent through AJAX call. And yes, the whole question is incredible mess.Fleabitten
@Col But it all depends on whether Траливали is part of the fragment, or if it's posted in the AJAX request body. The former won't work, the latter should.Imperial
@Imperial I vote for the latter, as it will make a little sense of the question :)Fleabitten
Let's reformulate this. Of course, fragment is not sent to the server as it is. But fragment contains part of url (path and query). Javascript uses it to build the url. Ajax sends this query (taken from the fragment) to the server. It is common practice and I'm surprised that some of you don't know it.Photic
"the whole question is incredible mess. – Col. Shrapnel" My question is (quote): "how to detect (in PHP) if query is encoded? How to decode it?" :)Photic
@topright: See, now the question is clear, and I'm willing to bet that the problem lies in your JavaScript Fragment-To-Query code.... Can you post that bit of code?Diaphysis
@Andrew Moore: The problem occurs with or without using Ajax.Photic
@topright: $str = mb_convert_encoding($_GET['query'], 'utf-8');. Firefox encodes in cp1251 by default. urldecode is handled transparently by PHP.Diaphysis
M
4
rawurldecode($_GET['query']);

but this should actually have been done already by php ;-)

edit you're stating "nothing works" - what are you trying? if the text doesn't appear on screen as you want it, when you echo $_GET['query']; for example, your problem might be the encoding you are specifying for the page sent back to the browser.

Include a line

header("Content-Type: text/html; charset=utf-8");

and see if it helps.

Momentary answered 30/7, 2010 at 3:1 Comment(2)
please show the entire script then and show us what exactly fails.Momentary
I added some tests in the post.Photic
F
2

How the fragment is encoded, is unfortunately, browser-dependent:

Is fragment ID (hash) encoded by applying RFC-mandated URL escaping rules?
MSIE: NO
Firefox: PARTLY
Safari: YES
Opera: NO
Chrome: NO
Android: YES

As to the question of what encoding the browser uses to encode international (read: non-ASCII) characters before converting them to %nn escape sequences, "most browsers deal with this by sending UTF-8 data by default on any text entered in the URL bar by hand, and using page encoding on all followed links." (same source).

Fulllength answered 30/7, 2010 at 3:24 Comment(2)
Not that it really matters how the fragment is encoded at it is only processed client side.Diaphysis
@And How is so? For javascript "á" != "%C3%A1"Fulllength
S
1

You could use UTF8::autoconvert_request() for this.

Take a look at http://code.google.com/p/php5-utf8/ for more information.

Sphagnum answered 3/6, 2011 at 19:20 Comment(0)
C
0

URLs are limited to certain ascii chars. Non-url friendly chars are supposed to be url-encoded (the %hh encoding you see). Some browsers might automatically encode urls that appear on the addr line.

Coretta answered 30/7, 2010 at 3:3 Comment(3)
-1: There is no problem with passing UTF-8 in query. Multibyte characters will simply be encoded in two bytes, which will then be decoded properly.Diaphysis
But the browser is still encoding the url behind the scenes. The server should see a well-formed url which the webapp will be able to decode.Coretta
The browser does not need to understand the charset to URL encode. It simply reads 8 bytes and transforms it into an hexadecimal value. Any character not considered printable ascii is encoded by the user-agent per RFC3986.Diaphysis
F
0

The answer is easy: string being encoded always. As it's stated in the HTTP standard.
And what is firefox displays - it doesn't matter.

Also, as PHP decode query string automatically, no decoding required either.

Note that '%D2%F0%E0%EB%E8%E2%E0%EB%E8' is single-byte encoding, so, you have your page probably in 1251. At least HTTP header says that to the browser.
While AJAX always use utf-8.

So, you have just to either use single encoding (utf-8) for your pages, or distinguish ajax calls from regular ones.

As for the fragment - do not use a fragment value to send it to the server. Have a JS variable, and then use it twice - to set a fragment and to send to the server using JSON.

Fleabitten answered 30/7, 2010 at 3:33 Comment(0)
O
0

RFC 1738 states that only alphanumerics, the special characters $-_.+!*'()," and reserved characters ;/?:@=& are unencoded within a URL. Everything else is encoded by the HTTP client, i.e. Web browser. You can use rawurldecode() whether or not PHP automatically decodes the query string. There's no danger in double-decoding.

Overtone answered 30/7, 2010 at 9:22 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.