development

BOM이없는 UTF-8과 UTF-8의 차이점은 무엇입니까?

big-blog 2020. 9. 29. 08:02
반응형

BOM이없는 UTF-8과 UTF-8의 차이점은 무엇입니까?


BOM이 없는 UTF-8과 UTF-8의 차이점은 무엇입니까 ? 어떤게 더 좋아?


UTF-8 BOM은 텍스트 스트림 (EF BB BF)의 시작 부분에있는 바이트 시퀀스로 독자가 파일을 UTF-8로 인코딩 된 것으로보다 안정적으로 추측 할 수 있습니다.

일반적으로 BOM은 인코딩의 엔디안을 나타내는 데 사용되지만 엔디안은 UTF-8과 관련이 없으므로 BOM이 필요하지 않습니다.

에 따르면 유니 코드 표준UTF-8 파일의 BOM은 사용하지 않는 것이 좋습니다 :

2.6 인코딩 체계

... BOM 사용은 UTF-8에 필요하거나 권장되지 않지만, UTF-8 데이터가 BOM을 사용하는 다른 인코딩 형식에서 변환되거나 BOM이 UTF-8 서명으로 사용되는 컨텍스트에서 발생할 수 있습니다. . 자세한 내용은 섹션 16.8, Specials 의 "바이트 순서 표시"하위 섹션 을 참조하십시오.


다른 훌륭한 답변은 이미 다음과 같이 대답했습니다.

  • UTF-8과 BOM 기반 UTF-8 사이에는 공식적인 차이가 없습니다.
  • BOM이 적용된 UTF-8 문자열은 다음 세 바이트로 시작합니다. EF BB BF
  • 해당 바이트 (있는 경우)는 파일 / 스트림에서 문자열을 추출 할 때 무시해야합니다.

그러나 이에 대한 추가 정보로 UTF-8에 대한 BOM은 문자열이 UTF-8로 인코딩 된 경우 "냄새를 맡는"좋은 방법이 될 수 있습니다. 또는 다른 인코딩의 합법적 인 문자열 일 수도 있습니다.

예를 들어 데이터 [EF BB BF 41 42 43]는 다음 중 하나 일 수 있습니다.

  • 합법적 인 ISO-8859-1 문자열 " ABC"
  • 합법적 인 UTF-8 문자열 "ABC"

따라서 첫 번째 바이트를보고 파일 콘텐츠의 인코딩을 인식하는 것이 멋질 수 있지만 위의 예에서 볼 수 있듯이 이것에 의존해서는 안됩니다.

인코딩은 구분이 아니라 알고 있어야합니다.


UTF-8 인코딩 파일에 BOM을 넣는 데는 최소한 세 가지 문제가 있습니다.

  1. 텍스트가없는 파일은 항상 BOM을 포함하므로 더 이상 비어 있지 않습니다.
  2. UTF-8의 ASCII 하위 집합 내에있는 텍스트를 포함하는 파일은 BOM이 ASCII가 아니기 때문에 더 이상 자체적으로 ASCII가 아닙니다. 이로 인해 일부 기존 도구가 손상되고 사용자가 이러한 레거시 도구를 교체하는 것이 불가능할 수 있습니다.
  3. 이제 각 파일의 시작 부분에 BOM이 있기 때문에 여러 파일을 함께 연결할 수 없습니다.

그리고 다른 사람들이 언급했듯이 어떤 것이 UTF-8인지 감지하기 위해 BOM을 갖는 것만으로는 충분하지도 않고 필요하지도 않습니다.

  • 임의의 바이트 시퀀스가 ​​BOM을 구성하는 정확한 시퀀스로 시작될 수 있기 때문에 충분하지 않습니다.
  • UTF-8처럼 바이트를 읽을 수 있기 때문에 필요하지 않습니다. 성공하면 정의상 유효한 UTF-8입니다.

좋은 답변이 많은 오래된 질문이지만 한 가지 추가해야합니다.

모든 답변은 매우 일반적입니다. 내가 추가하고 싶은 것은 실제로 실제 문제를 일으키는 BOM 사용의 예이지만 많은 사람들이 그것에 대해 알지 못합니다.

BOM 중단 스크립트

쉘 스크립트, Perl 스크립트, Python 스크립트, Ruby 스크립트, Node.js 스크립트 또는 인터프리터가 실행해야하는 기타 실행 파일-모두 다음 중 하나와 같은 shebang 행으로 시작 합니다.

#!/bin/sh
#!/usr/bin/python
#!/usr/local/bin/perl
#!/usr/bin/env node

이러한 스크립트를 호출 할 때 실행해야하는 인터프리터를 시스템에 알려줍니다. 스크립트가 UTF-8로 인코딩 된 경우 처음에 BOM을 포함하고 싶을 수 있습니다. 하지만 실제로 "#!" 문자는 단순한 문자가 아닙니다. 실제로 두 개의 ASCII 문자로 구성된 매직 넘버 입니다. 해당 문자 앞에 BOM과 같은 것을 넣으면 파일에 다른 매직 넘버가있는 것처럼 보이며 문제가 발생할 수 있습니다.

Wikipedia, 기사 : Shebang, 섹션 : 매직 넘버 :

shebang 문자는 현재 Unix 계열 시스템의 스크립트 및 기타 텍스트 파일에 일반적으로 사용되는 UTF-8을 포함하여 확장 ASCII 인코딩에서 동일한 2 바이트로 표시됩니다. 그러나 UTF-8 파일은 선택적 BOM (byte order mark)으로 시작할 수 있습니다. "exec"함수가 바이트 0x23 및 0x21을 구체적으로 감지 하면 shebang 전에 BOM (0xEF 0xBB 0xBF)이 있으면 스크립트 인터프리터가 실행되지 않습니다. 일부 당국은 이러한 이유로 POSIX (Unix와 유사한) 스크립트에서 바이트 순서 표시를 사용하지 말 것을 권장합니다. [14] 더 넓은 상호 운용성과 철학적 문제가 있습니다. 또한 UTF-8에서는 인코딩에 엔디안 문제가 없으므로 바이트 순서 표시가 필요하지 않습니다. 인코딩을 UTF-8로 식별하는 데만 사용됩니다. [강조 추가]

BOM은 JSON에서 불법입니다.

RFC 7159, 섹션 8.1 참조 :

구현은 JSON 텍스트의 시작 부분에 바이트 순서 표시를 추가해서는 안됩니다.

BOM은 JSON에서 중복됩니다.

JSON에서 불법 일뿐만 아니라 모든 JSON 스트림에서 사용되는 문자 인코딩과 엔디안을 명확하게 결정하는 더 안정적인 방법이 있기 때문에 문자 인코딩을 결정할 필요없습니다 (자세한 내용은 이 답변 참조).

BOM이 JSON 파서를 중단 함

JSON에서 불법 이며 필요하지 않을 뿐만 아니라 실제로 RFC 4627에 제시된 방법을 사용하여 인코딩을 결정하는 모든 소프트웨어중단합니다 .

JSON의 인코딩 및 엔디안을 확인하고 NUL 바이트의 처음 4 바이트를 검사합니다.

00 00 00 xx - UTF-32BE
00 xx 00 xx - UTF-16BE
xx 00 00 00 - UTF-32LE
xx 00 xx 00 - UTF-16LE
xx xx xx xx - UTF-8

이제 파일이 BOM으로 시작하면 다음과 같이 표시됩니다.

00 00 FE FF - UTF-32BE
FE FF 00 xx - UTF-16BE
FF FE 00 00 - UTF-32LE
FF FE xx 00 - UTF-16LE
EF BB BF xx - UTF-8

참고 :

  1. UTF-32BE는 3 개의 NUL로 시작하지 않으므로 인식되지 않습니다.
  2. UTF-32LE 첫 번째 바이트 뒤에 3 개의 NUL이 나오지 않으므로 인식되지 않습니다.
  3. UTF-16BE는 처음 4 바이트에 NUL이 1 개만 있으므로 인식되지 않습니다.
  4. UTF-16LE는 처음 4 바이트에 1 개의 NUL 만 있으므로 인식되지 않습니다.

구현에 따라 이들 모두가 UTF-8로 잘못 해석 된 다음 잘못된 UTF-8로 잘못 해석되거나 거부되거나 전혀 인식되지 않을 수 있습니다.

또한 구현에서 내가 권장하는대로 유효한 JSON을 테스트하면 RFC에 따라 128 미만의 ASCII 문자로 시작하지 않기 때문에 실제로 UTF-8로 인코딩 된 입력도 거부됩니다.

Other data formats

BOM in JSON is not needed, is illegal and breaks software that works correctly according to the RFC. It should be a nobrainer to just not use it then and yet, there are always people who insist on breaking JSON by using BOMs, comments, different quoting rules or different data types. Of course anyone is free to use things like BOMs or anything else if you need it - just don't call it JSON then.

For other data formats than JSON, take a look how it really looks like. If the only encodings are UTF-* and the first character must be an ASCII character lower than 128 then you already have all the information needed to determine both the encoding and the endianness of your data. Adding BOMs even as an optional feature would only make it more complicated and error prone.

Other uses of BOM

As for the uses outside of JSON or scripts, I think there are already very good answers here. I wanted to add more detailed info specifically about scripting and serialization because it is an example of BOM characters causing real problems.


What's different between UTF-8 and UTF-8 without BOM?

Short answer: In UTF-8, a BOM is encoded as the bytes EF BB BF at the beginning of the file.

Long answer:

Originally, it was expected that Unicode would be encoded in UTF-16/UCS-2. The BOM was designed for this encoding form. When you have 2-byte code units, it's necessary to indicate which order those two bytes are in, and a common convention for doing this is to include the character U+FEFF as a "Byte Order Mark" at the beginning of the data. The character U+FFFE is permanently unassigned so that its presence can be used to detect the wrong byte order.

UTF-8 has the same byte order regardless of platform endianness, so a byte order mark isn't needed. However, it may occur (as the byte sequence EF BB FF) in data that was converted to UTF-8 from UTF-16, or as a "signature" to indicate that the data is UTF-8.

Which is better?

Without. As Martin Cote answered, the Unicode standard does not recommend it. It causes problems with non-BOM-aware software.

A better way to detect whether a file is UTF-8 is to perform a validity check. UTF-8 has strict rules about what byte sequences are valid, so the probability of a false positive is negligible. If a byte sequence looks like UTF-8, it probably is.


UTF-8 with BOM is better identified. I have reached this conclusion the hard way. I am working on a project where one of the results is a CSV file, including Unicode characters.

If the CSV file is saved without a BOM, Excel thinks it's ANSI and shows gibberish. Once you add "EF BB BF" at the front (for example, by re-saving it using Notepad with UTF-8; or Notepad++ with UTF-8 with BOM), Excel opens it fine.

Prepending the BOM character to Unicode text files is recommended by RFC 3629: "UTF-8, a transformation format of ISO 10646", November 2003 at http://tools.ietf.org/html/rfc3629 (this last info found at: http://www.herongyang.com/Unicode/Notepad-Byte-Order-Mark-BOM-FEFF-EFBBBF.html)


BOM tends to boom (no pun intended (sic)) somewhere, someplace. And when it booms (for example, doesn't get recognized by browsers, editors, etc.), it shows up as the weird characters  at the start of the document (for example, HTML file, JSON response, RSS, etc.) and causes the kind of embarrassments like the recent encoding issue experienced during the talk of Obama on Twitter.

It's very annoying when it shows up at places hard to debug or when testing is neglected. So it's best to avoid it unless you must use it.


Question: What's different between UTF-8 and UTF-8 without a BOM? Which is better?

Here are some excerpts from the Wikipedia article on the byte order mark (BOM) that I believe offer a solid answer to this question.

On the meaning of the BOM and UTF-8:

The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use. Byte order has no meaning in UTF-8, so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8.

Argument for NOT using a BOM:

The primary motivation for not using a BOM is backwards-compatibility with software that is not Unicode-aware... Another motivation for not using a BOM is to encourage UTF-8 as the "default" encoding.

Argument FOR using a BOM:

The argument for using a BOM is that without it, heuristic analysis is required to determine what character encoding a file is using. Historically such analysis, to distinguish various 8-bit encodings, is complicated, error-prone, and sometimes slow. A number of libraries are available to ease the task, such as Mozilla Universal Charset Detector and International Components for Unicode.

Programmers mistakenly assume that detection of UTF-8 is equally difficult (it is not because of the vast majority of byte sequences are invalid UTF-8, while the encodings these libraries are trying to distinguish allow all possible byte sequences). Therefore not all Unicode-aware programs perform such an analysis and instead rely on the BOM.

In particular, Microsoft compilers and interpreters, and many pieces of software on Microsoft Windows such as Notepad will not correctly read UTF-8 text unless it has only ASCII characters or it starts with the BOM, and will add a BOM to the start when saving text as UTF-8. Google Docs will add a BOM when a Microsoft Word document is downloaded as a plain text file.

On which is better, WITH or WITHOUT the BOM:

The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it “SHOULD forbid use of U+FEFF as a signature.”

My Conclusion:

Use the BOM only if compatibility with a software application is absolutely essential.

Also note that while the referenced Wikipedia article indicates that many Microsoft applications rely on the BOM to correctly detect UTF-8, this is not the case for all Microsoft applications. For example, as pointed out by @barlop, when using the Windows Command Prompt with UTF-8, commands such type and more do not expect the BOM to be present. If the BOM is present, it can be problematic as it is for other applications.


† The chcp command offers support for UTF-8 (without the BOM) via code page 65001.


Quoted at the bottom of the Wikipedia page on BOM: http://en.wikipedia.org/wiki/Byte-order_mark#cite_note-2

"Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature"


It should be noted that for some files you must not have the BOM even on Windows. Examples are SQL*plus or VBScript files. In case such files contains a BOM you get an error when you try to execute them.


This question already has a million-and-one answers and many of them are quite good, but I wanted to try and clarify when a BOM should or should not be used.

As mentioned, any use of the UTF BOM (Byte Order Mark) in determining whether a string is UTF-8 or not is educated guesswork. If there is proper metadata available (like charset="utf-8"), then you already know what you're supposed to be using, but otherwise you'll need to test and make some assumptions. This involves checking whether the file a string comes from begins with the hexadecimal byte code, EF BB BF.

If a byte code corresponding to the UTF-8 BOM is found, the probability is high enough to assume it's UTF-8 and you can go from there. When forced to make this guess, however, additional error checking while reading would still be a good idea in case something comes up garbled. You should only assume a BOM is not UTF-8 (i.e. latin-1 or ANSI) if the input definitely shouldn't be UTF-8 based on it's source. If there is no BOM, however, you can simply determine whether it's supposed to be UTF-8 by validating against the encoding.

Why is a BOM not recommended?

  1. Non-Unicode-aware or poorly compliant software may assume it's latin-1 or ANSI and won't strip the BOM from the string, which can obviously cause issues.
  2. It's not really needed (just check if the contents are compliant and always use UTF-8 as the fallback when no compliant encoding can be found)

When should you encode with a BOM?

If you're unable to record the metadata in any other way (through a charset tag or file system meta), and the programs being used like BOMs, you should encode with a BOM. This is especially true on Windows where anything without a BOM is generally assumed to be using a legacy code page. The BOM tells programs like Office that, yes, the text in this file is Unicode; here's the encoding used.

When it comes down to it, the only files I ever really have problems with are CSV. Depending on the program, it either must, or must not have a BOM. For example, if you're using Excel 2007+ on Windows, it must be encoded with a BOM if you want to open it smoothly and not have to resort to importing the data.


UTF-8 with BOM only helps if the file actually contains some non-ASCII characters. If it is included and there aren't any, then it will possibly break older applications that would have otherwise interpreted the file as plain ASCII. These applications will definitely fail when they come across a non ASCII character, so in my opinion the BOM should only be added when the file can, and should, no longer be interpreted as plain ASCII.

Edit: Just want to make it clear that I prefer to not have the BOM at all, add it in if some old rubbish breaks with out it, and replacing that legacy application is not feasible.

Don't make anything expect a BOM for UTF8.


UTF-8 without BOM has no BOM, which doesn't make it any better than UTF-8 with BOM, except when the consumer of the file needs to know (or would benefit from knowing) whether the file is UTF-8-encoded or not.

The BOM is usually useful to determine the endianness of the encoding, which is not required for most use cases.

Also, the BOM can be unnecessary noise/pain for those consumers that don't know or care about it, and can result in user confusion.


I look at this from a different perspective. I think UTF-8 with BOM is better as it provides more information about the file. I use UTF-8 without BOM only if I face problems.

I am using multiple languages (even Cyrillic) on my pages for a long time and when the files are saved without BOM and I re-open them for editing with an editor (as cherouvim also noted), some characters are corrupted.

Note that Windows' classic Notepad automatically saves files with a BOM when you try to save a newly created file with UTF-8 encoding.

I personally save server side scripting files (.asp, .ini, .aspx) with BOM and .html files without BOM.


When you want to display information encoded in UTF-8 you may not face problems. Declare for example an HTML document as UTF-8 and you will have everything displayed in your browser that is contained in the body of the document.

But this is not the case when we have text, CSV and XML files, either on Windows or Linux.

For example, a text file in Windows or Linux, one of the easiest things imaginable, it is not (usually) UTF-8.

Save it as XML and declare it as UTF-8:

<?xml version="1.0" encoding="UTF-8"?>

It will not display (it will not be be read) correctly, even if it's declared as UTF-8.

I had a string of data containing French letters, that needed to be saved as XML for syndication. Without creating a UTF-8 file from the very beginning (changing options in IDE and "Create New File") or adding the BOM at the beginning of the file

$file="\xEF\xBB\xBF".$string;

I was not able to save the French letters in an XML file.


One practical difference is that if you write a shell script for Mac OS X and save it as plain UTF-8, you will get the response:

#!/bin/bash: No such file or directory

in response to the shebang line specifying which shell you wish to use:

#!/bin/bash

If you save as UTF-8, no BOM (say in BBEdit) all will be well.


As mentioned above, UTF-8 with BOM may cause problems with non-BOM-aware (or compatible) software. I once edited HTML files encoded as UTF-8 + BOM with the Mozilla-based KompoZer, as a client required that WYSIWYG program.

Invariably the layout would get destroyed when saving. It took my some time to fiddle my way around this. These files then worked well in Firefox, but showed a CSS quirk in Internet Explorer destroying the layout, again. After fiddling with the linked CSS files for hours to no avail I discovered that Internet Explorer didn't like the BOMfed HTML file. Never again.

Also, I just found this in Wikipedia:

The shebang characters are represented by the same two bytes in extended ASCII encodings, including UTF-8, which is commonly used for scripts and other text files on current Unix-like systems. However, UTF-8 files may begin with the optional byte order mark (BOM); if the "exec" function specifically detects the bytes 0x23 0x21, then the presence of the BOM (0xEF 0xBB 0xBF) before the shebang will prevent the script interpreter from being executed. Some authorities recommend against using the byte order mark in POSIX (Unix-like) scripts,[15] for this reason and for wider interoperability and philosophical concerns


The Unicode Byte Order Mark (BOM) FAQ provides a concise answer:

Q: How I should deal with BOMs?

A: Here are some guidelines to follow:

  1. A particular protocol (e.g. Microsoft conventions for .txt files) may require use of the BOM on certain Unicode data streams, such as files. When you need to conform to such a protocol, use a BOM.

  2. Some protocols allow optional BOMs in the case of untagged text. In those cases,

    • Where a text data stream is known to be plain text, but of unknown encoding, BOM can be used as a signature. If there is no BOM, the encoding could be anything.

    • Where a text data stream is known to be plain Unicode text (but not which endian), then BOM can be used as a signature. If there is no BOM, the text should be interpreted as big-endian.

  3. Some byte oriented protocols expect ASCII characters at the beginning of a file. If UTF-8 is used with these protocols, use of the BOM as encoding form signature should be avoided.

  4. Where the precise type of the data stream is known (e.g. Unicode big-endian or Unicode little-endian), the BOM should not be used. In particular, whenever a data stream is declared to be UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be used.


From http://en.wikipedia.org/wiki/Byte-order_mark:

The byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

Always using a BOM in your file will ensure that it always opens correctly in an editor which supports UTF-8 and BOM.

My real problem with the absence of BOM is the following. Suppose we've got a file which contains:

abc

Without BOM this opens as ANSI in most editors. So another user of this file opens it and appends some native characters, for example:

abg-αβγ

Oops... Now the file is still in ANSI and guess what, "αβγ" does not occupy 6 bytes, but 3. This is not UTF-8 and this causes other problems later on in the development chain.


Here is my experience with Visual Studio, SourceTree and Bitbucket pull requests, which has been giving me some problems:

So turns out BOM with signature will include a red dot character on each file when reviewing a pull request (can be quite annoying).

enter image description here

If you hover on it, it will show a character like "ufeff", but turns out sourcetree does not shows these types of bytemarks, so it will most likely end up in your pull requests, which should be ok because that's how VS 2017 encode new files now, so maybe bitbucket should ignore this or make it show in another way, more info here:

Red dot marker BitBucket diff view


UTF with BOM is better if you use UTF-8 in HTML files, if you use Serbian Cyrillic, Serbian Latin, German, Hungarian or something exotic language in the same page. That is my opinion (30 years of computing and IT industry).

참고URL : https://stackoverflow.com/questions/2223882/whats-the-difference-between-utf-8-and-utf-8-without-bom

반응형