How to get data off of a character
Asked Answered
N

5

7

I am working on a project in Unity which uses Assembly C#. I try to get special character such as é, but in the console it just displays a blank character: "". For instance translating "How are you?" Should return "Cómo Estás?", but it returns "Cmo Ests". I put the return string "Cmo Ests" in a character array and realized that it is a non-null blank character. I am using Encoding.UTF8, and when I do:

char ch = '\u00e9';
print (ch);

It will print "é". I have tried getting the bytes off of a given string using:

byte[] utf8bytes = System.Text.Encoding.UTF8.GetBytes(temp);

While translating "How are you?", it will return a byte string, but for the special characters such as é, I get the series of bytes 239, 191, 189, which is a replacement character.

What type of information do I need to retrieve from the characters in order to accurately determining what character it is? Do I need to do something with the information that Google gives me, or is it something else? I am need a general case that I can place in my program and will work for any input string. If anyone can help, it would be greatly appreciated.

Here is the code that is referenced:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using UnityEngine;
using System.Collections;
using System.Net;
using HtmlAgilityPack;


public class Dictionary{
string[] formatParams;
HtmlDocument doc;
string returnString;
char[] letters;
public char[] charString;
public Dictionary(){
    formatParams = new string[2];
    doc = new HtmlDocument();
    returnString = "";
}

public string Translate(String input, String languagePair, Encoding encoding)
    {
        formatParams[0]= input;
        formatParams[1]= languagePair;
        string url = String.Format("http://www.google.com/translate_t?hl=en&ie=UTF8&text={0}&langpair={1}", formatParams);

        string result = String.Empty;

        using (WebClient webClient = new WebClient())
        {
            webClient.Encoding = encoding;
            result = webClient.DownloadString(url);
        }       
        doc.LoadHtml(result);
        input = alter (input);
        string temp = doc.DocumentNode.SelectSingleNode("//span[@title='"+input+"']").InnerText;
        charString = temp.ToCharArray();
        return temp;
    }
// Use this for initialization
void Start () {

}
string alter(string inputString){
    returnString = "";
    letters = inputString.ToCharArray();
    for(int i=0; i<inputString.Length;i++){
        if(letters[i]=='\''){
            returnString = returnString + "&#39;";  
        }else{
            returnString = returnString + letters[i];   
        }
    }
    return returnString;
}
}
Neume answered 9/11, 2012 at 15:35 Comment(11)
You should include the code that's generating the response.Fragonard
I don't see what the problem is, honestly. What I see in your question is you getting exactly what you're asking for. If you ask for UTF8 bytes, you're going to get UTF8 bytes. 239, 191, 189 are the UTF8 encoding for your single unicode character. If you need to translate from utf8 to unicode, do that: #11294494Elias
What does your print() Method do? If you're trying to treat your UTF8 bytes as characters, you'll have problems. UTF8 characters can be more than 1 byte long.Finance
@Elias The problem is, 239 191 189 is a generic missing character code, so an é and a ó will have the same code. I need to know how to distinguish between the two.Neume
@Neil print() is the same thing as Console.Writeln() or System.out.println()Neume
I have edited your title. Please see, "Should questions include “tags” in their titles?", where the consensus is "no, they should not".Altruist
Can you give us some examples of what you pass in to your Translate method?Neuburger
There are several issues with your approach. First of all the UTF8 encoding is a multibyte encoding. This means that if you use any non-ASCII character (having char code > 127), you will get a series of special characters that indicate to the system that this is an Unicode char. So actually your sequence 239, 191, 189 indicates a single character which is not an ASCII character. If you use UTF16, then you get fixed-size encodings (2-byte encodings) which actually map a character to an unsigned short (0-65535).Andradite
The char type in c# is a two-byte type, so it is actually an unsigned short. This contrasts with other languages, such as C/C++ where the char type is a 1-byte type.Andradite
The unity tag is for Microsoft Unity. Please don't misuse it.Phoney
This is NOT Microsoft Unity. I am using a third-party, 3D development software Unity.Neume
M
1

Maybe you should use another API/URL. This function below uses a different url that returns JSON data and seems to work better:

    public static string Translate(string input, string fromLanguage, string toLanguage)
    {
        using (WebClient webClient = new WebClient())
        {
            string url = string.Format("http://translate.google.com/translate_a/t?client=j&text={0}&sl={1}&tl={2}", Uri.EscapeUriString(input), fromLanguage, toLanguage);
            string result = webClient.DownloadString(url);

            // I used JavaScriptSerializer but another JSON parser would work
            JavaScriptSerializer serializer = new JavaScriptSerializer();
            Dictionary<string, object> dic = (Dictionary<string, object>)serializer.DeserializeObject(result);
            Dictionary<string, object> sentences = (Dictionary<string, object>)((object[])dic["sentences"])[0];
            return (string)sentences["trans"];
        }
    }

If I run this in a Console App:

    Console.WriteLine(Translate("How are you?", "en", "es"));

It will display

¿Cómo estás?
Meperidine answered 20/11, 2012 at 18:24 Comment(3)
When trying to put this into the program, it says that it is missing the namespace. I tried "using System.web;" but it still says that the namespace is missing. What namespace do I have to use to get this to work?Neume
You need to add an assembly reference to System.Web.ExtensionsMeperidine
@CameronBarge I've made some edits to Simon's post (they are being peer-reviewed..), but in general you need to include the System.Web.Extensions assembly (ie in "references") and have "usings" for System.Net and System.Web.Script.Serialization.Almazan
K
0

You actually pretty much have it. Just insert the coded letter with a \u and it works.

string mystr = "C\u00f3mo Est\u00e1s?";
Keynesianism answered 9/11, 2012 at 16:8 Comment(1)
Thank you, but this is for a single case. I need to have a general solution.Neume
F
0

I don't know much about the GoogleTranslate API, but my first thought is that you've got a Unicode Normalization problem.

Have a look at System.String.Normalize() and it's friends.

Unicode is very complicated, so I'll over simplify! Many symbols can be represented in different ways in Unicode, that is: 'é' could be represented as 'é' (one character), or as an 'e' + 'accent character' (two characters), or, depending what comes back from the API, something else altogether.

The Normalize function will convert your string to one with the same Textual meaning, but potentially a different binary value which may fix your output problem.

Finance answered 20/11, 2012 at 15:9 Comment(0)
S
0

I had the same problem working one of my project [Language Resource Localization Translation]

I was doing the same thing and was using.. System.Text.Encoding.UTF8.GetBytes() and because of utf8 encoding was receiving special characters like your e.g 239, 191, 189 in result string.

please take a look of my solution... hope this helps

Don't Use encoding at all Google translation will return correct like á as it self in the string. do some string manipulation and read the string as it is...

Generic Solution [works for every language translation which google support]

try
{
    //Don't use UtF Encoding 
    // use default webclient encoding

    var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + txtNewResourceValue.Text.Trim() + "◄", "en|" + item.Text.Substring(0, 2));                    

     var webClient = new WebClient();
     string result = webClient.DownloadString(url); //get all data from google translate in UTF8 coding..

      int start = result.IndexOf("id=result_box");
      int end = result.IndexOf("id=spell-place-holder");
      int length = end - start;
      result = result.Substring(start, length);
      result = reverseString(result);

      start = result.IndexOf(";8669#&");//◄
      end = result.IndexOf(";8569#&");  //►
      length = end - start;

      result = result.Substring(start +7 , length - 8);
      objDic2.Text =  reverseString(result);

       //hard code substring; finding the correct translation within the string.
        dictList.Add(objDic2);
}
catch (Exception ex)
 {
  lblMessages.InnerHtml = "<strong>Google translate exception occured no resource   saved..." + ex.Message + "</strong>";
                error = true;
}

public static string reverseString(string s)
{
    char[] arr = s.ToCharArray();
    Array.Reverse(arr);
    return new string(arr);

}

as you can see from the code no encoding has been performed and i am sending 2 special key charachters as "►" + txtNewResourceValue.Text.Trim() + "◄"to determine the start and end of the return translation from google.

Also i have checked hough my language utility tool I am getting "Cómo Estás?" when sending How are you to google translation... :)

Best regards [Shaz]

---------------------------Edited-------------------------

public string Translate(String input, String languagePair) {

    try
    {


        //Don't use UtF Encoding 
        // use default webclient encoding
        //input        [string to translate]
        //Languagepair [eg|es]

        var url = String.Format("http://www.google.com/translate_t?hl=en&text={0}&langpair={1}", "►" + input.Trim() + "◄", languagePair);

        var webClient = new WebClient();
        string result = webClient.DownloadString(url); //get all data from google translate 

        int start = result.IndexOf("id=result_box");
        int end = result.IndexOf("id=spell-place-holder");
        int length = end - start;
        result = result.Substring(start, length);
        result = reverseString(result);

        start = result.IndexOf(";8669#&");//◄
        end = result.IndexOf(";8569#&");  //►
        length = end - start;

        result = result.Substring(start + 7, length - 8);

        //return transalted string
        return reverseString(result); 


    }
    catch (Exception ex)
    {
        return "Google translate exception occured no resource   saved..." + ex.Message";

    }
}
Selfcongratulation answered 22/11, 2012 at 11:42 Comment(2)
Thank you for your reply. Could you provide me a little more insight as to where this should go in my code, eg. method name and parameters. Any help would be appreciated.Neume
@Cameron please have a look on new edited code which should works for you.. any question please let me know...Selfcongratulation
A
0

There are several issues with your approach. First of all the UTF8 encoding is a multibyte encoding. This means that if you use any non-ASCII character (having char code > 127), you will get a series of special characters that indicate to the system that this is an Unicode char. So actually your sequence 239, 191, 189 indicates a single character which is not an ASCII character. If you use UTF16, then you get fixed-size encodings (2-byte encodings) which actually map a character to an unsigned short (0-65535).

The char type in c# is a two-byte type, so it is actually an unsigned short. This contrasts with other languages, such as C/C++ where the char type is a 1-byte type.

So in your case, unless you really need to be using byte[] arrays, you should use char[] arrays. Or if you want to encode the characters so that they can be used in HTML, then you can just iterate through the characters and check if the character code is > 128, then you can replace it with the &hex; character code.

Andradite answered 26/11, 2012 at 16:38 Comment(0)

© 2022 - 2024 — McMap. All rights reserved.