Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

development

Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

big-blog 2020. 6. 30. 08:02

Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

기본적으로 주어진 Html 문서를 해독하고 " "-> " ", ">"-> 와 같은 모든 특수 문자를 바꿉니다 ">".

.NET에서는을 사용할 수 있습니다 HttpUtility.HtmlDecode.

Java에서 동등한 기능은 무엇입니까?

나는 이것을 위해 Apache Commons StringEscapeUtils.unescapeHtml4 () 를 사용했다.

엔티티 이스케이프가 포함 된 문자열을 이스케이프하면 이스케이프에 해당하는 실제 유니 코드 문자가 포함 된 문자열로 이스케이프됩니다. HTML 4.0 엔티티를 지원합니다.

다른 답변에서 언급 된 라이브러리는 훌륭한 솔루션이지만 프로젝트에서 실제 html을 이미 파고 들었다면 Jsoup프로젝트는 "암페어 앤 파운드 FFFF 세미콜론" 항목을 관리하는 것보다 더 많은 것을 제공 해야합니다.

// textValue: <p>This is a&nbsp;sample. \"Granny\" Smith &#8211;.<\/p>\r\n
// becomes this: This is a sample. "Granny" Smith –.
// with one line of code:
// Jsoup.parse(textValue).getText(); // for older versions of Jsoup
Jsoup.parse(textValue).text();

// Another possibility may be the static unescapeEntities method:
boolean strictMode = true;
String unescapedString = org.jsoup.parser.Parser.unescapeEntities(textValue, strictMode);

또한 최고의 DOM, CSS 및 jquery와 유사한 메소드를 사용하여 데이터를 추출하고 조작하기위한 편리한 API를 얻을 수 있습니다. 오픈 소스이며 MIT 라이센스입니다.

내 프로젝트에서 Apache Commons StringEscapeUtils.unescapeHtml3 ()을 시도했지만 성능에 만족하지 못했습니다. 결과적으로 불필요한 작업이 많이 발생합니다. 우선, 문자열에서 이스케이프 처리 할 것이없는 경우에도 모든 호출에 대해 StringWriter를 할당합니다. 해당 코드를 다르게 다시 작성했는데 이제 훨씬 빠르게 작동합니다. 구글에서 이것을 찾는 사람은 그것을 사용하는 것을 환영합니다.

다음 코드는 모든 HTML 3 기호와 숫자 이스케이프를 이스케이프 처리합니다 (Apache unescapeHtml3과 동일). HTML 4가 필요한 경우지도에 더 많은 항목을 추가 할 수 있습니다.

package com.example;

import java.io.StringWriter;
import java.util.HashMap;

public class StringUtils {

    public static final String unescapeHtml3(final String input) {
        StringWriter writer = null;
        int len = input.length();
        int i = 1;
        int st = 0;
        while (true) {
            // look for '&'
            while (i < len && input.charAt(i-1) != '&')
                i++;
            if (i >= len)
                break;

            // found '&', look for ';'
            int j = i;
            while (j < len && j < i + MAX_ESCAPE + 1 && input.charAt(j) != ';')
                j++;
            if (j == len || j < i + MIN_ESCAPE || j == i + MAX_ESCAPE + 1) {
                i++;
                continue;
            }

            // found escape 
            if (input.charAt(i) == '#') {
                // numeric escape
                int k = i + 1;
                int radix = 10;

                final char firstChar = input.charAt(k);
                if (firstChar == 'x' || firstChar == 'X') {
                    k++;
                    radix = 16;
                }

                try {
                    int entityValue = Integer.parseInt(input.substring(k, j), radix);

                    if (writer == null) 
                        writer = new StringWriter(input.length());
                    writer.append(input.substring(st, i - 1));

                    if (entityValue > 0xFFFF) {
                        final char[] chrs = Character.toChars(entityValue);
                        writer.write(chrs[0]);
                        writer.write(chrs[1]);
                    } else {
                        writer.write(entityValue);
                    }

                } catch (NumberFormatException ex) { 
                    i++;
                    continue;
                }
            }
            else {
                // named escape
                CharSequence value = lookupMap.get(input.substring(i, j));
                if (value == null) {
                    i++;
                    continue;
                }

                if (writer == null) 
                    writer = new StringWriter(input.length());
                writer.append(input.substring(st, i - 1));

                writer.append(value);
            }

            // skip escape
            st = j + 1;
            i = st;
        }

        if (writer != null) {
            writer.append(input.substring(st, len));
            return writer.toString();
        }
        return input;
    }

    private static final String[][] ESCAPES = {
        {"\"",     "quot"}, // " - double-quote
        {"&",      "amp"}, // & - ampersand
        {"<",      "lt"}, // < - less-than
        {">",      "gt"}, // > - greater-than

        // Mapping to escape ISO-8859-1 characters to their named HTML 3.x equivalents.
        {"\u00A0", "nbsp"}, // non-breaking space
        {"\u00A1", "iexcl"}, // inverted exclamation mark
        {"\u00A2", "cent"}, // cent sign
        {"\u00A3", "pound"}, // pound sign
        {"\u00A4", "curren"}, // currency sign
        {"\u00A5", "yen"}, // yen sign = yuan sign
        {"\u00A6", "brvbar"}, // broken bar = broken vertical bar
        {"\u00A7", "sect"}, // section sign
        {"\u00A8", "uml"}, // diaeresis = spacing diaeresis
        {"\u00A9", "copy"}, // © - copyright sign
        {"\u00AA", "ordf"}, // feminine ordinal indicator
        {"\u00AB", "laquo"}, // left-pointing double angle quotation mark = left pointing guillemet
        {"\u00AC", "not"}, // not sign
        {"\u00AD", "shy"}, // soft hyphen = discretionary hyphen
        {"\u00AE", "reg"}, // ® - registered trademark sign
        {"\u00AF", "macr"}, // macron = spacing macron = overline = APL overbar
        {"\u00B0", "deg"}, // degree sign
        {"\u00B1", "plusmn"}, // plus-minus sign = plus-or-minus sign
        {"\u00B2", "sup2"}, // superscript two = superscript digit two = squared
        {"\u00B3", "sup3"}, // superscript three = superscript digit three = cubed
        {"\u00B4", "acute"}, // acute accent = spacing acute
        {"\u00B5", "micro"}, // micro sign
        {"\u00B6", "para"}, // pilcrow sign = paragraph sign
        {"\u00B7", "middot"}, // middle dot = Georgian comma = Greek middle dot
        {"\u00B8", "cedil"}, // cedilla = spacing cedilla
        {"\u00B9", "sup1"}, // superscript one = superscript digit one
        {"\u00BA", "ordm"}, // masculine ordinal indicator
        {"\u00BB", "raquo"}, // right-pointing double angle quotation mark = right pointing guillemet
        {"\u00BC", "frac14"}, // vulgar fraction one quarter = fraction one quarter
        {"\u00BD", "frac12"}, // vulgar fraction one half = fraction one half
        {"\u00BE", "frac34"}, // vulgar fraction three quarters = fraction three quarters
        {"\u00BF", "iquest"}, // inverted question mark = turned question mark
        {"\u00C0", "Agrave"}, // А - uppercase A, grave accent
        {"\u00C1", "Aacute"}, // Б - uppercase A, acute accent
        {"\u00C2", "Acirc"}, // В - uppercase A, circumflex accent
        {"\u00C3", "Atilde"}, // Г - uppercase A, tilde
        {"\u00C4", "Auml"}, // Д - uppercase A, umlaut
        {"\u00C5", "Aring"}, // Е - uppercase A, ring
        {"\u00C6", "AElig"}, // Ж - uppercase AE
        {"\u00C7", "Ccedil"}, // З - uppercase C, cedilla
        {"\u00C8", "Egrave"}, // И - uppercase E, grave accent
        {"\u00C9", "Eacute"}, // Й - uppercase E, acute accent
        {"\u00CA", "Ecirc"}, // К - uppercase E, circumflex accent
        {"\u00CB", "Euml"}, // Л - uppercase E, umlaut
        {"\u00CC", "Igrave"}, // М - uppercase I, grave accent
        {"\u00CD", "Iacute"}, // Н - uppercase I, acute accent
        {"\u00CE", "Icirc"}, // О - uppercase I, circumflex accent
        {"\u00CF", "Iuml"}, // П - uppercase I, umlaut
        {"\u00D0", "ETH"}, // Р - uppercase Eth, Icelandic
        {"\u00D1", "Ntilde"}, // С - uppercase N, tilde
        {"\u00D2", "Ograve"}, // Т - uppercase O, grave accent
        {"\u00D3", "Oacute"}, // У - uppercase O, acute accent
        {"\u00D4", "Ocirc"}, // Ф - uppercase O, circumflex accent
        {"\u00D5", "Otilde"}, // Х - uppercase O, tilde
        {"\u00D6", "Ouml"}, // Ц - uppercase O, umlaut
        {"\u00D7", "times"}, // multiplication sign
        {"\u00D8", "Oslash"}, // Ш - uppercase O, slash
        {"\u00D9", "Ugrave"}, // Щ - uppercase U, grave accent
        {"\u00DA", "Uacute"}, // Ъ - uppercase U, acute accent
        {"\u00DB", "Ucirc"}, // Ы - uppercase U, circumflex accent
        {"\u00DC", "Uuml"}, // Ь - uppercase U, umlaut
        {"\u00DD", "Yacute"}, // Э - uppercase Y, acute accent
        {"\u00DE", "THORN"}, // Ю - uppercase THORN, Icelandic
        {"\u00DF", "szlig"}, // Я - lowercase sharps, German
        {"\u00E0", "agrave"}, // а - lowercase a, grave accent
        {"\u00E1", "aacute"}, // б - lowercase a, acute accent
        {"\u00E2", "acirc"}, // в - lowercase a, circumflex accent
        {"\u00E3", "atilde"}, // г - lowercase a, tilde
        {"\u00E4", "auml"}, // д - lowercase a, umlaut
        {"\u00E5", "aring"}, // е - lowercase a, ring
        {"\u00E6", "aelig"}, // ж - lowercase ae
        {"\u00E7", "ccedil"}, // з - lowercase c, cedilla
        {"\u00E8", "egrave"}, // и - lowercase e, grave accent
        {"\u00E9", "eacute"}, // й - lowercase e, acute accent
        {"\u00EA", "ecirc"}, // к - lowercase e, circumflex accent
        {"\u00EB", "euml"}, // л - lowercase e, umlaut
        {"\u00EC", "igrave"}, // м - lowercase i, grave accent
        {"\u00ED", "iacute"}, // н - lowercase i, acute accent
        {"\u00EE", "icirc"}, // о - lowercase i, circumflex accent
        {"\u00EF", "iuml"}, // п - lowercase i, umlaut
        {"\u00F0", "eth"}, // р - lowercase eth, Icelandic
        {"\u00F1", "ntilde"}, // с - lowercase n, tilde
        {"\u00F2", "ograve"}, // т - lowercase o, grave accent
        {"\u00F3", "oacute"}, // у - lowercase o, acute accent
        {"\u00F4", "ocirc"}, // ф - lowercase o, circumflex accent
        {"\u00F5", "otilde"}, // х - lowercase o, tilde
        {"\u00F6", "ouml"}, // ц - lowercase o, umlaut
        {"\u00F7", "divide"}, // division sign
        {"\u00F8", "oslash"}, // ш - lowercase o, slash
        {"\u00F9", "ugrave"}, // щ - lowercase u, grave accent
        {"\u00FA", "uacute"}, // ъ - lowercase u, acute accent
        {"\u00FB", "ucirc"}, // ы - lowercase u, circumflex accent
        {"\u00FC", "uuml"}, // ь - lowercase u, umlaut
        {"\u00FD", "yacute"}, // э - lowercase y, acute accent
        {"\u00FE", "thorn"}, // ю - lowercase thorn, Icelandic
        {"\u00FF", "yuml"}, // я - lowercase y, umlaut
    };

    private static final int MIN_ESCAPE = 2;
    private static final int MAX_ESCAPE = 6;

    private static final HashMap<String, CharSequence> lookupMap;
    static {
        lookupMap = new HashMap<String, CharSequence>();
        for (final CharSequence[] seq : ESCAPES) 
            lookupMap.put(seq[1].toString(), seq[0]);
    }

}

Java에서 HTML 이스케이프 처리를 위해 다음 라이브러리를 사용할 수도 있습니다. unbescape .

이 방법으로 HTML을 이스케이프 할 수 있습니다.

final String unescapedText = HtmlEscape.unescapeHtml(escapedText);

이것은 나를 위해 일을했다.

import org.apache.commons.lang.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml(encodedXML);

또는

import org.apache.commons.lang3.StringEscapeUtils;
...
String decodedXML= StringEscapeUtils.unescapeHtml4(encodedXML);

도움이 되었기를 바랍니다 :)

외부 라이브러리가없는 매우 간단하지만 비효율적 인 솔루션은 다음과 같습니다.

public static String unescapeHtml3( String str ) {
    try {
        HTMLDocument doc = new HTMLDocument();
        new HTMLEditorKit().read( new StringReader( "<html><body>" + str ), doc, 0 );
        return doc.getText( 1, doc.getLength() );
    } catch( Exception ex ) {
        return str;
    }
}

디코딩 할 문자열 수가 적은 경우에만 사용해야합니다.

가장 신뢰할 수있는 방법은

String cleanedString = StringEscapeUtils.unescapeHtml4(originalString);

에서 org.apache.commons.lang3.StringEscapeUtils.

그리고 공백을 피하기 위해

cleanedString = cleanedString.trim();

이렇게하면 웹 양식의 복사 및 붙여 넣기로 인한 공백이 DB에 유지되지 않습니다.

필자의 경우 모든 변수의 모든 엔티티를 테스트하여 replace 메소드를 사용하면 코드는 다음과 같습니다.

text = text.replace("&Ccedil;", "Ç");
text = text.replace("&ccedil;", "ç");
text = text.replace("&Aacute;", "Á");
text = text.replace("&Acirc;", "Â");
text = text.replace("&Atilde;", "Ã");
text = text.replace("&Eacute;", "É");
text = text.replace("&Ecirc;", "Ê");
text = text.replace("&Iacute;", "Í");
text = text.replace("&Ocirc;", "Ô");
text = text.replace("&Otilde;", "Õ");
text = text.replace("&Oacute;", "Ó");
text = text.replace("&Uacute;", "Ú");
text = text.replace("&aacute;", "á");
text = text.replace("&acirc;", "â");
text = text.replace("&atilde;", "ã");
text = text.replace("&eacute;", "é");
text = text.replace("&ecirc;", "ê");
text = text.replace("&iacute;", "í");
text = text.replace("&ocirc;", "ô");
text = text.replace("&otilde;", "õ");
text = text.replace("&oacute;", "ó");
text = text.replace("&uacute;", "ú");

제 경우에는 이것이 아주 잘 작동했습니다.

Consider using the HtmlManipulator Java class. You may need to add some items (not all entities are in the list).

The Apache Commons StringEscapeUtils as suggested by Kevin Hakanson did not work 100% for me; several entities like &#145 (left single quote) were translated into '222' somehow. I also tried org.jsoup, and had the same problem.

Incase you want to mimic what php function htmlspecialchars_decode does use php function get_html_translation_table() to dump the table and then use the java code like,

static Map<String,String> html_specialchars_table = new Hashtable<String,String>();
static {
        html_specialchars_table.put("&lt;","<");
        html_specialchars_table.put("&gt;",">");
        html_specialchars_table.put("&amp;","&");
}
static String htmlspecialchars_decode_ENT_NOQUOTES(String s){
        Enumeration en = html_specialchars_table.keys();
        while(en.hasMoreElements()){
                String key = en.nextElement();
                String val = html_specialchars_table.get(key);
                s = s.replaceAll(key, val);
        }
        return s;
}

참고URL : https://stackoverflow.com/questions/994331/how-to-unescape-html-character-entities-in-java

'development' 카테고리의 다른 글

Safari의 html5 localStorage 오류 :“QUOTA_EXCEEDED_ERR : DOM 예외 22 : 할당량을 초과 한 스토리지에 무언가를 추가하려고했습니다.” (0)	2020.06.30
PHP에서 "풀에 메모리를 할당 할 수 없습니다"는 무엇입니까? (0)	2020.06.30
PHP로 RSS / Atom 피드를 파싱하는 가장 좋은 방법 (0)	2020.06.30
'T'유형의 값은로 변환 할 수 없습니다 (0)	2020.06.30
시스템에서 부스트 버전을 확인하는 방법은 무엇입니까? (0)	2020.06.30

현재글Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

big-blog

Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

'development' 카테고리의 다른 글

'development'의 다른글

티스토리툴바

Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

Java에서 HTML 문자 엔티티를 이스케이프 해제하는 방법은 무엇입니까?

'development' 카테고리의 다른 글

'development'의 다른글

관련글

티스토리툴바