Strange Jackson Illegal character ((CTRL-CHAR, code 0)) Exception in Map Reduce Combiner
Asked Answered
S

2

23

I have a Map-Reduce job with a mapper which takes a record and converts it into an object, an instance of MyObject, which is marshalled to JSON using Jackson. The value is just another Text field in the record.

The relevant piece of the mapper is something like the following:

ObjectMapper mapper = new ObjectMapper();
MyObject val = new MyObject();
val.setA(stringA);
val.setB(stringB);
Writer strWriter = new StringWriter();
mapper.writeValue(strWriter, val);
key.set(strWriter.toString());

The outputs of the mapper are sent to a Combiner which unmarshalls the JSON object and aggregates key-value pairs. It is conceptually very simple and is something like:

public void reduce(Text key, Iterable<IntWritable> values, Context cxt) 
    throws IOException, InterruptedException {
    int count = 0;
    TermIndex x = _mapper.readValue(key.toString(), MyObject.class);
    for (IntWritable int : values) ++count;
    ...
    emit (key, value)
}

The MyObject class consists of two fields (both strings), get/set methods and a default constructor. One of the fields stores snippets of text based on a web crawl, but is always a string.

public class MyObject {
  private String A;
  private String B;

  public MyObject() {}

  public String getA() {
    return A;
  }
  public void setA(String A) {
    this.A = A;
  }
  public String getB() {
    return B;
  } 
  public void setIdx(String B) {
    this.B = B;
  }
}

My MapReduce job appears to be running fine until it reaches certain records, which I cannot easily access (because the mapper is generating the records from a crawl), and the following exception is being thrown:

Error: com.fasterxml.jackson.core.JsonParseException: 

    Illegal character ((CTRL-CHAR, code 0)): only regular white space (\r, \n, \t) is allowed between tokens
     at [Source: java.io.StringReader@5ae2bee7; line: 1, column: 3]

Would anyone have any suggestions about the cause of this?

See answered 18/7, 2014 at 19:42 Comment(4)
Use okhttp 1.5.1. Hope it will solve your issue.Mediatize
I realize you said you don't have easy access but I suggest front-ending the crawl and removing spurious control characters like 0 (NULL) from the stream and then pass it to jackson. I have seen financial feeds for various securities have spurious data like this that always needs to be culled. It is most likely a defect on the sending side.Beaulahbeaulieu
At low level something is injecting null bytes (byte 0) into stream, and parser does not accept those (they are invalid for JSON). You need to figure out how and why this happens; it could be many things including concurrency issues, or timing (trying to parse content before it's loaded into buffer).Anselme
If possible, add a few log lines, so that you can see which records exactly are failing. Also, since you are crawling data, you might be having the same problem as here (GZip encoding): #8092024Heavy
E
2
  • You can use StringUtils from Apache Commons library to escape the string.
  • Or you can replace selectively the control characters from the string before JSON marshaling.

You can also refer to this post: Illegal character - CTRL-CHAR

Eucharis answered 2/3, 2017 at 10:12 Comment(0)
D
0

I am getting a UTF-16 encoded response and after each simple character in the byte[] there is a ((CTRL-CHAR, code 0)) see screen.

it work for me:

StringUtils.toEncodedString(responseBodyAsByteArray, StandardCharsets.UTF_16LE)
Domeniga answered 28/10, 2021 at 11:43 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.