handling filename* parameters with spaces via RFC 5987 results in '+' in filenames
Asked Answered
B

3

7

I have some legacy code I am dealing with (so no I can't just use a URL with an encoded filename component) that allows a user to download a file from our website. Since our filenames are often in many different languages they are all stored as UTF-8. I wrote some code to handle the RFC5987 conversion to a proper filename* parameter. This works great until I have a filename with non-ascii characters and spaces. Per RFC, the space character is not part of attr_char so it gets encoded as %20. I have new versions of Chrome as well as Firefox and they are all converting to %20 to + on download. I have tried not encoding the space and putting the encoded filename in quotes and get the same result. I have sniffed the response coming from the server to verify that the servlet container wasn't mucking with my headers and they look correct to me. The RFC even has examples that contain %20. Am I missing something, or do all of these browsers have a bug related to this?

Many thanks in advance. The code I use to encode the filename is below.

Peter

public static boolean bcsrch(final char[] chars, final char c) {
    final int len = chars.length;
    int base = 0;
    int last = len - 1; /* Last element in table */
    int p;

    while (last >= base) {
        p = base + ((last - base) >> 1);

        if (c == chars[p])
            return true; /* Key found */
        else if (c < chars[p])
            last = p - 1;
        else
            base = p + 1;
    }

    return false; /* Key not found */
}

public static String rfc5987_encode(final String s) {
    final int len = s.length();
    final StringBuilder sb = new StringBuilder(len << 1);
    final char[] digits = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
    final char[] attr_char = {'!','#','$','&','\'','+','-','.','0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','^','_','a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','|', '~'};
    for (int i = 0; i < len; ++i) {
        final char c = s.charAt(i);
        if (bcsrch(attr_char, c))
            sb.append(c);
        else {
            final char[] encoded = {'%', 0, 0};
            encoded[1] = digits[0x0f & (c >>> 4)];
            encoded[2] = digits[c & 0x0f];
            sb.append(encoded);
        }
    }

    return sb.toString();
}

Update

Here is a screen shot of the download dialog I get for a file with Chinese characters with spaces as mentioned in my comment.

screen cap of download dialog

Buttonwood answered 2/7, 2012 at 23:8 Comment(8)
Here is a sample header that is causing this issue: Content-Disposition:attachment; filename*=UTF-8''Museum%20%5A%69%86.jpgButtonwood
See greenbytes.de/tech/tc2231/#attwithquotedsemicolon - that test case has a space character in a quoted-string and appears to work in Firefox. Are we testing different things?Purser
That looks like something else. That test checks for semicolon within a quoted string. My problem is that I have a filename with Chinese characters as well as spaces, so I am using the filename* form, and in token not quoted form since I read some docs that recommended not using quotes with % escapes. With the example from my comment above the Chinese characters are recognized and converted properly, but the %20 is getting mapped to +.Buttonwood
Peter: I was just trying to understand what the test case does on your system; could you please check? Also, your code looks incorrect in that it doesn't actually convert into UTF-8 first; your example above simply has three (encoded) US-ASCII characters after the whitespace.Purser
Julian: Ugh. I remember reading that UTF-8 conversion was needed first and forgot to add it. That is most likely the cause. Recoding now to test.Buttonwood
Yup that was it. I spaced on the UTF-16 to UTF-8 conversion. Sorry for wasting your time Julian. I will clean up my rewritten code and post an update with notes and solution tomorrow. Hopefully someone out there will benefit from my mistake. Thanks again!Buttonwood
No, the problem was that you treated characters as bytes instead of converting them to bytes first.Oaken
Balus: Correct. In my mind the UTF-8 conversion implies conversion to bytes. Also merely converting the raw chars to bytes would not be correct.Buttonwood
B
15

So as Julian pointed out in the comments, I made a rookie Java error and forgot to do my character to byte conversion (thus I encoded the character's codepoint instead of the character's byte representation), hence the encoding was completely incorrect. This is clearly mentioned as a requirement in RFC 5987. I will be posting corrected code for doing the conversion. Once the encoding is correct, the filename* parameter is recognized properly by the browser and the filename used for the download is correct.

Below is the corrected escaping code which operates on the UTF-8 bytes of the string. The filename that was giving me trouble, now properly encoded looks like this:

Content-Disposition:attachment; filename*=UTF-8''Museum%20%E5%8D%9A%E7%89%A9%E9%A6%86.jpg

public static String rfc5987_encode(final String s) throws UnsupportedEncodingException {
    final byte[] s_bytes = s.getBytes("UTF-8");
    final int len = s_bytes.length;
    final StringBuilder sb = new StringBuilder(len << 1);
    final char[] digits = {'0','1','2','3','4','5','6','7','8','9','A','B','C','D','E','F'};
    final byte[] attr_char = {'!','#','$','&','+','-','.','0','1','2','3','4','5','6','7','8','9',           'A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','^','_','`',                        'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','|', '~'};
    for (int i = 0; i < len; ++i) {
        final byte b = s_bytes[i];
        if (Arrays.binarySearch(attr_char, b) >= 0)
            sb.append((char) b);
        else {
            sb.append('%');
            sb.append(digits[0x0f & (b >>> 4)]);
            sb.append(digits[b & 0x0f]);
        }
    }

    return sb.toString();
}
Buttonwood answered 3/7, 2012 at 9:14 Comment(2)
Would like to point out that character ' isn't in the allowed list of rfc5987.Slipperwort
.. but character ` (grave accent symbol, backtick) is. Have edited the answer.Slipperwort
U
3

2022 update

This answer adds to an answer from 10 years ago, by providing information about an apache library that has methods to encode and decode strings according to RFC 5987.

An RFC5987 encoder and decoder are available in the class org.apache.cxf.attachment.Rfc5987Util.

I was able to import the jar into my maven project by adding the dependency:

<dependency>
   <groupId>org.apache.cxf</groupId>
   <artifactId>cxf-core</artifactId>
   <version>3.5.2</version>
</dependency>

(check for the latest version at https://jar-download.com/artifacts/org.apache.cxf/cxf-core)

Test example

@Test
public void verifyRfc5987EncodingandDecoding() throws UnsupportedEncodingException {
   final String s = "!\"$£%^&*()_-+={[}]:@~;'#,./<>?\\|✓éèæðŃœ";

   assertThat(Rfc5987Util.decode(
         Rfc5987Util.encode(s, "UTF-8"),
               "UTF-8"),
         equalTo(s));
}
Untitled answered 28/4, 2022 at 16:26 Comment(0)
C
3

In addition to @matt-wallis answer: In case you're already using org.springframework:spring-web in your project, you might want to use the ContentDisposition-builder:

String contentDispositionHeaderValue = ContentDisposition.attachment()
    .filename(someFilename, StandardCharsets.UTF_8)
    .build()
    .toString();
response.addHeader("Content-Disposition", contentDispositionHeaderValue);

See https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/http/ContentDisposition.html

Cassaundracassava answered 15/3, 2023 at 9:36 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.