How to tell where letters begin and end in hex
Thread poster: Samuel Murray
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 19:16
Member (2006)
English to Afrikaans
+ ...
Oct 19, 2021

Hello everyone

The hex for " ÿ " is "20 C3 BF 20". The spaces are "20" and the "ÿ" is "C3 BF". How can I tell that "20" is the first letter, and not "20 C3"? And how can I tell that "C3 BF" is the second letter, and not just "C3"?

Thanks
Samuel


 
Mikhail Zavidin
Mikhail Zavidin
Local time: 20:16
English to Russian
+ ...
It is all about UTF-8 and BOM (Byte order mark) Oct 19, 2021

As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols.
Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia.

Byte order mark
If the UTF-16 Unicode byte order mark (BOM,
... See more
As I can understand the string in question is UTF-8 encoded. So the order depends on BOM of the file if any. In it is not then read the string from left to right due to its backward compatibility with ASCII. Unicode use one to four one-byte (8-bit) code units. The first 127 are like the first 127 ASCII symbols.
Generally I suggest you get acquainted with info about UTF-8 encoding for example in Wikipedia.

Byte order mark
If the UTF-16 Unicode byte order mark (BOM, U+FEFF) character is at the start of a UTF-8 file, the first three bytes will be 0xEF, 0xBB, 0xBF.


https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

UTF-8 is a variable-width character encoding used for electronic communication. Defined by the Unicode Standard, the name is derived from Unicode (or Universal Coded Character Set) Transformation Format – 8-bit.[1]

UTF-8 is capable of encoding all 1,112,064[nb 1] valid character code points in Unicode using one to four one-byte (8-bit) code units. Code points with lower numerical values, which tend to occur more frequently, are encoded using fewer bytes. It was designed for backward compatibility with ASCII: the first 128 characters of Unicode, which correspond one-to-one with ASCII, are encoded using a single byte with the same binary value as ASCII, so that valid ASCII text is valid UTF-8-encoded Unicode as well.


https://en.wikipedia.org/wiki/UTF-8#Codepage_layout

Hope this helps
Collapse


 
Samuel Murray
Samuel Murray  Identity Verified
Netherlands
Local time: 19:16
Member (2006)
English to Afrikaans
+ ...
TOPIC STARTER
@Mikhail Oct 19, 2021

Mikhail Zavidin wrote:
As I can understand the string in question is UTF-8 encoded.


Correct. If it was UTF-16LE, it would be "2000 FF00 2000" instead of "20 C3BF 20".

So the order depends on BOM of the file if any.


UTF-8 doesn't have a byte order (and the "byte order mark" added to UTF-8 files is called a "byte order mark" for historical reasons and not because it indicates a byte order (it doesn't indicate a byte order because UTF-8 doesn't have a byte order (or: has only one byte order, depending on how you explain it))). Anyway, the byte order (even if there was one) isn't really relevant to the question.

I'm trying to figure out how I can tell just by looking at "20 C3 BF 20" that "20 C3" and "BF 20" are not characters, but that "20" is a character, "C3 BF" is a character, and "20" is a character? The problem is that sometimes a character is encoded as two digits and sometimes it is encoded as four digits, and I want to know how can I tell which is when.


 
Mikhail Zavidin
Mikhail Zavidin
Local time: 20:16
English to Russian
+ ...
Why not review the conversion table Oct 20, 2021

Code point UTF-8 conversion
First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8#Encoding

As I can understand, if the byte contains the
... See more
Code point UTF-8 conversion
First code point Last code point Byte 1 Byte 2 Byte 3 Byte 4
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 [nb 2]U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

https://en.wikipedia.org/wiki/UTF-8#Encoding

As I can understand, if the byte contains the higher bits set to 110 (110xxxxx), this symbol consist of 2 bytes.
Then if the byte contains the higher bits set to 1110 (1110xxxx) this symbol consist of 3 bytes.
And so on, according with the above table.

In your example the first 20 (00100000), contains 0 in higher bit of the byte, so this is single byte symbol from the first 127 symbols of the ASCII table. The second byte is C3, meaning 1100 0011 and represents a two byte symbol.
And so on.

Hope this helps

[Edited at 2021-10-20 10:14 GMT]

[Edited at 2021-10-20 10:17 GMT]

[Edited at 2021-10-20 10:18 GMT]
Collapse


 


To report site rules violations or get help, contact a site moderator:

Moderator(s) of this forum
Laureana Pavon[Call to this topic]

You can also contact site staff by submitting a support request »

How to tell where letters begin and end in hex






Trados Studio 2022 Freelance
The leading translation software used by over 270,000 translators.

Designed with your feedback in mind, Trados Studio 2022 delivers an unrivalled, powerful desktop and cloud solution, empowering you to work in the most efficient and cost-effective way.

More info »
Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »