How can I Read and Transfer chunks of file with Hadoop WebHDFS?
Asked Answered
H

1

6

I need to transfer big files (at least 14MB) from the Cosmos instance of the FIWARE Lab to my backend.

I used the Spring RestTemplate as a client interface for the Hadoop WebHDFS REST API described here but I run into an IO Exception:

Exception in thread "main" org.springframework.web.client.ResourceAccessException: I/O error on GET request for "http://cosmos.lab.fiware.org:14000/webhdfs/v1/user/<user.name>/<path>?op=open&user.name=<user.name>":Truncated chunk ( expected size: 14744230; actual size: 11285103); nested exception is org.apache.http.TruncatedChunkException: Truncated chunk ( expected size: 14744230; actual size: 11285103)
    at org.springframework.web.client.RestTemplate.doExecute(RestTemplate.java:580)
    at org.springframework.web.client.RestTemplate.execute(RestTemplate.java:545)
    at org.springframework.web.client.RestTemplate.exchange(RestTemplate.java:466)

This is the actual code that generates the Exception:

RestTemplate restTemplate = new RestTemplate();
restTemplate.setRequestFactory(new HttpComponentsClientHttpRequestFactory());
restTemplate.getMessageConverters().add(new ByteArrayHttpMessageConverter()); 
HttpEntity<?> entity = new HttpEntity<>(headers);

UriComponentsBuilder builder = 
    UriComponentsBuilder.fromHttpUrl(hdfs_path)
        .queryParam("op", "OPEN")
        .queryParam("user.name", user_name);

ResponseEntity<byte[]> response =
    restTemplate
        .exchange(builder.build().encode().toUri(), HttpMethod.GET, entity, byte[].class);

FileOutputStream output = new FileOutputStream(new File(local_path));
IOUtils.write(response.getBody(), output);
output.close();

I think this is due to a transfer timeout on the Cosmos instance, so I tried to send a curl on the path by specifying offset, buffer and length parameters, but they seem to be ignored: I got the whole file.

Thanks in advance.

Hadfield answered 28/11, 2015 at 18:17 Comment(2)
Maybe looking at py webhdfs will give you some clues -> github.com/pywebhdfs/pywebhdfs/blob/master/pywebhdfs/…Mcmasters
Thanks, but it doesn't help. The problem is that the optional length parameter of the OPEN operation (see def read_file(self, path, **kwargs) in your link) is totally ignored by the serverHadfield
H
4

Ok, I found out a solution. I don't understand why, but the transfer succeds if I use a Jetty HttpClient instead of the RestTemplate (and so Apache HttpClient). This works now:

ContentExchange exchange = new ContentExchange(true){
            ByteArrayOutputStream bos = new ByteArrayOutputStream();

            protected void onResponseContent(Buffer content) throws IOException {
                bos.write(content.asArray(), 0, content.length());
            }

            protected void onResponseComplete() throws IOException {
                if (getResponseStatus()== HttpStatus.OK_200) {
                    FileOutputStream output = new FileOutputStream(new File(<local_path>));
                    IOUtils.write(bos.toByteArray(), output);
                    output.close();
                }
            }

        };

UriComponentsBuilder builder = UriComponentsBuilder.fromHttpUrl(<hdfs_path>)
                .queryParam("op", "OPEN")
                .queryParam("user.name", <user_name>);

exchange.setURL(builder.build().encode().toUriString());
exchange.setMethod("GET");
exchange.setRequestHeader("X-Auth-Token", <token>);

HttpClient client = new HttpClient();
client.setConnectorType(HttpClient.CONNECTOR_SELECT_CHANNEL);
client.setMaxConnectionsPerAddress(200);
client.setThreadPool(new QueuedThreadPool(250)); 
client.start();
client.send(exchange);
exchange.waitForDone();

Is there any known bug on the Apache Http Client for chunked files transfer?

Was I doing something wrong in my RestTemplate request?

UPDATE: I still don't have a solution

After few tests I see that I don't have solved my problems. I found out that the hadoop version installed on the Cosmos instance is quite old Hadoop 0.20.2-cdh3u6 and I read that WebHDFS doesn't support partial file transfer with length parameter (introduced since v 0.23.3). These are the headers I received from the Server when I send a GET request using curl:

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: HEAD, POST, GET, OPTIONS, DELETE
Access-Control-Allow-Headers: origin, content-type, X-Auth-Token, Tenant-ID, Authorization
server: Apache-Coyote/1.1
set-cookie: hadoop.auth="u=<user>&p=<user>&t=simple&e=1448999699735&s=rhxMPyR1teP/bIJLfjOLWvW2pIQ="; Version=1; Path=/
Content-Type: application/octet-stream; charset=utf-8
content-length: 172934567
date: Tue, 01 Dec 2015 09:54:59 GMT
connection: close

As you see the Connection header is set to close. Actually, the connection is usually closed each time the GET request lasts more than 120 seconds, even if the file transfer has not been completed.

In conclusion, I can say that Cosmos is totally useless if it doesn't support large file transfer.

Please correct me if I'm wrong, or if you know a workaround.

Hadfield answered 30/11, 2015 at 8:7 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.