Documentation of itemcolorstrings.dat structure

I spent some time to break itemcolorstrings.dat into its components to understand it and to create the current color number mod. With this topic I want to share my knowledge so that nobody has to do this again.

:warning:This topic is very technical. Knowledge about binary and hexadecimal data is required to understand everything
Always keep a backup when modifying files. The worst thing that can happen is a game crash on startup.
Feel free to ask questions if anything is unclear :slight_smile:

What you need if you want to try it yourself

  • A Hex Editor for viewing and editing. I used a freeware editor called HxD to make the changes: Downloads | mh-nexus . It is available in many languages.
  • Any version of itemcolorstrings.dat from Boundless

:warning: Do not try editing with a normal text editor. It will likely break the file when saving.

General info

  • Positions in the file are given as hexadecimal positions with the decimal representation in brackets like: A10 (2576)
  • Pointers (references in the file) are in little endian. This means that a pointer to position 6789ABCD is stored as CD AB 89 67
  • Strings so far are encoded with 1 byte per character in ISO_8859_1 (Latin-1) format.
  • Quoted words like “Language definition” reference to other parts of the file structure.

Data shown is on the example of the first English names for item descriptions

Time to dive into the data!

Overall file structure

1 Header
2 Language definition
    3 Language section
	4 Name section
	    5 Combination offset data
	    6 Word combination data
	    7 String offset data
	    8 String data
	4 Name section
	...
    3 Language section
    ...
1. Header

The file starts with some kind of information which I did not decode. It was not required so far to change anything of this

2. Language definition

Start
This is where the interesting things start with the language definition.
This is currently 58 Bytes long
Start position: DA2 (3490). If it changes in the future it is easily identifiable by the words: english french german italian spanish

The structure here (5 times):

    - 1 Byte: length of the following string
    - x Bytes: String data of above length in ISO_8859_1 (Latin-1) format
    - 4 Bytes: Pointer in file to language section. Example for english: DC 0D 00 00 is position 00000DDC (3548)

Ends after the spanish pointer. This is the position where the english pointer references to

3. Language section

language
This is a list of the 4 “Name section” for a language
There is 1 for every language at the position specified in the “Language definition”
This section is very small with 12 Bytes
The structure here:

    - 4 Bytes: Pointer to color names
    - 4 Bytes: Pointer to metal names
    - 4 Bytes: Pointer to item names

Directly after starts the first “Name section” for item descriptions

4. Name section

name
This is the section which contains the pointers how the names are put together.
There are currently 4 of these per language. First starts directly after the “Language section”. The others at the pointers specified in the “Language section”
This section is very small with 12 Bytes
The structure here:

    - 4 Bytes: Pointer to "Word combination data". Instructions how the words are put together
    - 4 Bytes: Pointer to "String offset data". The length of each word is in here
    - 4 Bytes: Pointer to "String data". All words without any separator

Directly after starts the “Combination offset data”. The length of each word combination

5. Combination offset data

combination%20offset
This is a list of lengths for the “Word combination data”
There is 1 per “Name section”
This data varies in size depending on the combinations
The structure here:

    - 1 Byte: bit length of values
    - x Bytes: bit array with above bit length for each entry. Ends with start of "Word combination data"

How to decode:
09 00 06 14 48 B0 A0 01 84 09 16 32 ...
09 is the length of 9 bits

The following data needs to be translated to bits and split into above bit length (here: 9 bits) to interpret them as a number.
Showing the first hex values, their binary representation -> the split value = its decimal value

00 00000000
06 0000011 0 -> 0 00000000 = 0
14 000101 00 -> 00 0000011 = 3
48 01001 000 -> 000 000101 = 5
B0 1011 0000 -> 0000 01001 = 9
A0 101 00000 -> 00000 1011 = 11
01 00 000001 -> 000001 101 = 13
84 1 0000100 -> 0000100 00 = 16
09  00001001 -> 00001001 1 = 19
16 00010110
32 0011001 0 -> 0 00010110 = 22
6. Word combination data

combination%20data
This contains the information how names are created from the single words
There is 1 per “Name section”
This data varies in size depending on the “Combination offset data”
The structure here:

- x Bytes: 1st combination going from "Combination offset data" entry 1 up to entry 2. In the example 0 to 2 (3 Bytes)
- x Bytes: 2nd combination
...
- x Bytes: last combination going from last entry to end of data

How to decode:
The data needs to be translated to bits

The general structure:
   - length indicator bits which is either 0, 01, 11. This represents 3, 6, 10 bits are following
   - number with amount of bits of length indicator representing word amount
   Then for each word:
	- length indicator
	- word position in list created with "String offset data"

06 49 04 06 C8 56 14 73 14 ...
Showing the first hex values split by the "Combination offset data" -> reorder due to little endian,
their binary representation and the interpretation reading the binary from right to left

06 49 04 -> 04 49 06, 00000100 01001001 00000110
0: 3 bits next
011: number 3 -> 3 words
0: 3 bits next
000: number 0 -> word 1 from word list: Rare
01: 6 bits next
010010: number 18 -> word 19 from word list: Crafted
0: 3 bits next
010: number 2 -> word 3 from word list: Block
= Rare Crafted Block

06 C8 -> C8 06, 11001000 00000110
0: 3 bits next
011: number 3 -> 3 words
0: 3 bits next
000: number 0 -> word 1 from word list: Rare
0: 3 bits next
100: number 4 -> word 5 from word list: Crafting
0: 3 bits next
110: number 6 -> word 7 from word list: Ingredient
= Rare Crafting Ingredient

56 14 73 14	-> 14 73 14 56, 00010100 01110011 00010100 01010110
0: 3 bits next
011: number 3 -> 3 words
01: 6 bits next
010001: number 17 -> word 18 from word list: Decorative
01: 6 bits next
001100: number 12 -> word 13 from word list: Beacon
11: 10 bits next
0001010001: number 81 -> word 82 from word list: Fitting
= Decorative Beacon Fitting
7. String offset data

string%20offset
This is a list of lengths of the strings in “String data”
There is 1 per “Name section”
This data varies in size depending on the “String data”
The structure here:

- 1 Byte: bit length of values
- x Bytes: bit array with above bit length for each entry. Ends with start of "String data"

How to decode (Works same as "Combination offset data"):
0A 00 10 A0 C0 03 13 6C ...
0A is the length of 10 bits

The following data needs to be translated to bits and split into above bit length (here: 10 bits) to interpret them as a number.
Showing the first hex values, their binary representation -> the split value = its decimal value

00 00000000
10 000100 00 -> 00 00000000 = 0
A0 1010 0000 -> 0000 000100 = 4
C0 11 000000 -> 000000 1010 = 10
03  00000011 -> 00000011 11 = 15
13 00010011 
6C 011011 00 -> 00 00010011 = 19
8. String data


This is the text of all words without any separator
There is 1 per “Name section”
This data varies in size depending on the amount of words in it
The structure here:

- x Bytes: 1 byte per character in ISO_8859_1 (Latin-1) encoding until next "Name section" starts or file ends

How to decode:
Take the lengths from "String offset data" to get the length of each word
0, 4: Offset 0. 4 - 0 = 4 Length -> Rare
4, 10: Offset 4. 10 - 4 = 6 Length -> Common
10, 15: Offset 10. 15 - 10 = 5 Length -> Block
15, 19: Offset 15. 19 - 15 = 4 Length -> Tool

Enjoy creating great things with this knowledge :slight_smile:

6 Likes

@willcrutchley was working on this too. not sure if he made any headway

image

Maybe I’ll have a read again tomorrow, but this is some weird stuff to me :smiley: I wonder if it’s a completely custom solution or if it’s based on some actual spec but modified a little to be more efficient, like the msgpack format they used.

3 Likes

you’re crazy!

the header is:

u8: max-index for the metals color palette
u16: number of encoded ItemType's
{
    u16: ItemType
    u8  : subtitle-index
} * number-of-encoded-ItemType's [ sorted for binary search ]
u8 : number of encoded languages [ this is the 05 byte in your "Language Definition" section in OP ]

[ I could have saved you a lot of time by copy-pasting the format :stuck_out_tongue: but cool to see you figure it out! ]

the “String offset data” to be more specific, has N+1 entries for N words, the last offset being the “end() iterator” (to use a c++ analogy) so that you can always determine the length of the word by using its offset + the next offset in the list

(also you have a typo in the var-length description, its 10 bits, not 9 bits for the 11 case.

(my format description)

format description
u8  : metals-palette max index
u16 : number of encoded ItemType's
{
  u16 : ItemType
  u8  : subtitle-index
} * number-of-encoded-ItemType's  [ sorted by itemId for binary search ]
u8 : number of encoded languages
{
  u8 : length of language identifier
  char * length : identifier [ no null terminator ] ascii
  u32 : offset in buffer to start of language encoding
} * number-of-encoded-languages
{
  u32 : offset in buffer to start of color strings
  u32 : offset in buffer to start of metal strings
  u32 : offset in buffer to start of item-title strings
  ENCODINGS : item-subtitles [ indexed by subtitle-index of encoded ItemType ]
  ENCODINGS : colors [ indexed by BlockColorIndex - 1 ]
  ENCODINGS : metals [ indexed by BlockColorIndex - 1 ]
  ENCODINGS : item-titles [ indexed by index in encoded-ItemType's ]
} * number-of-encoded-languages
where:
ENCODINGS = {
  u32 : offset in buffer to start of "encodings"
  u32 : offset in buffer to start of words-index
  u32 : offset in buffer to start of "words"
  u8  : bit-count for encodings-index values
  uN*indices : offset in "encodings" to start of encoding for each index.
  ENCODING*indices : "encodings" for each index
  u8  : bit-count for words-index values
  uN*words+1: offset in "words" to start of that word (+1 for end() "iterator")
  char*?? : "words", no null-terminator delimination; use index of next word to terminate; ISO8859-1
}
where:
ENCODING = {
  var : length of encoding
  var*N : word-index
}
where:
var = variable-length value encoded as:
   0  ++ u3
   10 ++ u6
   11 ++ u10
8 Likes

This is the part that doesn’t make sense to me

image

Are they really not sequential? Or am I just too tired to think straight. Little endian but just arranged like this?

it is sequential, first bit (value & 1) is right-most in a bit representation of the numbers, so bit-wise you read right to left within each byte

Maybe it’ll make sense when implementing it, in the example the bits are read like you say, little endian bytes. Then it’s switched up when converting to dec, the bits are strung together in big endian order, it just makes my head spin :smiley:

So it kind of goes
left to right bytes, containing
right to left bits, read as
left to right decimal numbers
?

1 Like

If you want you can also switch the order of the whole block:
… 03 C0 A0 10 00
convert it to bits
00000011 11000000 10100000 00010000 00000000
and then read the bits right to left converting them to decimal as 10 bit numbers

little endian is also not my favorite for reading :wink:

1 Like

Why even use little endian? Is it actually more efficient?

Hah, no wonder I was making slow progress… Nice one!

2 Likes

I’m having trouble working this out in PHP. I’ve gotten as far as the above quoted section… specifically up to “uN*indices : offset in “encodings” to start of encoding for each index.”… which is where I am now stumped.

(Disclaimer: I’ve never really done anything like this before, nor had need to until now, so most of this is new to me).

That is the part I named 5. Combination offset data
You can check the example I gave in that spoiler on how to decode it. Just skip the first byte as that is the u8 : bit-count for encodings-index values

I’m a bit stuck

uN*indices
uN*words

Where does the word and indice count come from?

edit: They seem to come from the encodings-index and words-index arrays? Then I’m stuck on determining their lengths, the only way I see is to derive from the offsets?

u8 : bit-count for encodings-index values

The above determines the N, right? (in uN*indices)

It’s not explicit.

For metal names, there are ‘max-metal-index’ (first byte of data) indices (which happens to be 5)
For color names, there are 255
For item titles, the second two bytes of file as a uint16 gives the number of encoded items which is also how many titles there will be as it’s 1-1
For sub titles, you’d have to iterate the map of item-id to subtitle-index at top of file to see what the max index is and that + 1 is how many subtitles will be encoded

The game doesnt need to know counts, it’s just indexing, so no need for any explicit counts

1 Like

Thank you, that’s exactly what I needed!

Dat structure, tho.

(sorry, couldn’t stop myself from laughing like an idiot)

1 Like

Success! I’ll link this for anyone to freely use, it’s not tested that much yet and lacks all documentation :roll_eyes: but it seems to be functional now.

Couldn’t have figured this out without @Lorgar @DreamEvil or @lucadeltodecso, many thanks to everyone who helped!

It’s pure javascript, takes the itemcolorstrings.dat file in Uint8Array form as an input and spits out an object indexed by ItemID’s containing the item name and subtitle. If the output format isn’t what you want, the return statement is on line 66, fairly easy to modify as needed.

Include the file and call

boundlessDat.decode(uint8array) // pass the inputcolorstrings.dat data to the function

image

6 Likes

I am working on a project in Python that needed this, so I made a simple standalone script to do this in Python. Requires Python 3.7+. Fixes a couple of issues from @Mayumichi JS version and automatically handles all of the languages.

EDIT: Forgot to fix add the final dictionary mapping for the items. Give me ~10 fix that… Fixed

1 Like

Any notes on these? I’ve made some additions too like the full language support, could pair that up with the fixes and release a more robust JS version :slight_smile:

The main one was you were always missing the last word from decodeEncodings.

itemSubtitles for example was producing a list of 95 words, when it should have been 96 words.

Line 133, you are skipping the last word.

if(toLen !== undefined) { // it's not the last element

Because it is a variable length string, you do need the ending index, so I am passing in the start index for the next section.

1 Like