Documentation of itemcolorstrings.dat structure

Lorgar · May 12, 2020, 5:51pm

I spent some time to break itemcolorstrings.dat into its components to understand it and to create the current color number mod. With this topic I want to share my knowledge so that nobody has to do this again.

This topic is very technical. Knowledge about binary and hexadecimal data is required to understand everything
Always keep a backup when modifying files. The worst thing that can happen is a game crash on startup.
Feel free to ask questions if anything is unclear

What you need if you want to try it yourself

A Hex Editor for viewing and editing. I used a freeware editor called HxD to make the changes: Downloads | mh-nexus . It is available in many languages.
Any version of itemcolorstrings.dat from Boundless

Do not try editing with a normal text editor. It will likely break the file when saving.

General info

Positions in the file are given as hexadecimal positions with the decimal representation in brackets like: A10 (2576)
Pointers (references in the file) are in little endian. This means that a pointer to position 6789ABCD is stored as CD AB 89 67
Strings so far are encoded with 1 byte per character in ISO_8859_1 (Latin-1) format.
Quoted words like “Language definition” reference to other parts of the file structure.

Data shown is on the example of the first English names for item descriptions

Time to dive into the data!

Overall file structure

1 Header
2 Language definition
    3 Language section
	4 Name section
	    5 Combination offset data
	    6 Word combination data
	    7 String offset data
	    8 String data
	4 Name section
	...
    3 Language section
    ...

1. Header

The file starts with some kind of information which I did not decode. It was not required so far to change anything of this

2. Language definition

Start
This is where the interesting things start with the language definition.
This is currently 58 Bytes long
Start position: DA2 (3490). If it changes in the future it is easily identifiable by the words: english french german italian spanish

The structure here (5 times):

    - 1 Byte: length of the following string
    - x Bytes: String data of above length in ISO_8859_1 (Latin-1) format
    - 4 Bytes: Pointer in file to language section. Example for english: DC 0D 00 00 is position 00000DDC (3548)

Ends after the spanish pointer. This is the position where the english pointer references to

3. Language section

This is a list of the 4 “Name section” for a language
There is 1 for every language at the position specified in the “Language definition”
This section is very small with 12 Bytes
The structure here:

    - 4 Bytes: Pointer to color names
    - 4 Bytes: Pointer to metal names
    - 4 Bytes: Pointer to item names

Directly after starts the first “Name section” for item descriptions

4. Name section

This is the section which contains the pointers how the names are put together.
There are currently 4 of these per language. First starts directly after the “Language section”. The others at the pointers specified in the “Language section”
This section is very small with 12 Bytes
The structure here:

    - 4 Bytes: Pointer to "Word combination data". Instructions how the words are put together
    - 4 Bytes: Pointer to "String offset data". The length of each word is in here
    - 4 Bytes: Pointer to "String data". All words without any separator

Directly after starts the “Combination offset data”. The length of each word combination

5. Combination offset data

combination%20offset
This is a list of lengths for the “Word combination data”
There is 1 per “Name section”
This data varies in size depending on the combinations
The structure here:

    - 1 Byte: bit length of values
    - x Bytes: bit array with above bit length for each entry. Ends with start of "Word combination data"

How to decode:
09 00 06 14 48 B0 A0 01 84 09 16 32 ...
09 is the length of 9 bits

The following data needs to be translated to bits and split into above bit length (here: 9 bits) to interpret them as a number.
Showing the first hex values, their binary representation -> the split value = its decimal value

00 00000000
06 0000011 0 -> 0 00000000 = 0
14 000101 00 -> 00 0000011 = 3
48 01001 000 -> 000 000101 = 5
B0 1011 0000 -> 0000 01001 = 9
A0 101 00000 -> 00000 1011 = 11
01 00 000001 -> 000001 101 = 13
84 1 0000100 -> 0000100 00 = 16
09  00001001 -> 00001001 1 = 19
16 00010110
32 0011001 0 -> 0 00010110 = 22

6. Word combination data

combination%20data
This contains the information how names are created from the single words
There is 1 per “Name section”
This data varies in size depending on the “Combination offset data”
The structure here:

- x Bytes: 1st combination going from "Combination offset data" entry 1 up to entry 2. In the example 0 to 2 (3 Bytes)
- x Bytes: 2nd combination
...
- x Bytes: last combination going from last entry to end of data

How to decode:
The data needs to be translated to bits

The general structure:
   - length indicator bits which is either 0, 01, 11. This represents 3, 6, 10 bits are following
   - number with amount of bits of length indicator representing word amount
   Then for each word:
	- length indicator
	- word position in list created with "String offset data"

06 49 04 06 C8 56 14 73 14 ...
Showing the first hex values split by the "Combination offset data" -> reorder due to little endian,
their binary representation and the interpretation reading the binary from right to left

06 49 04 -> 04 49 06, 00000100 01001001 00000110
0: 3 bits next
011: number 3 -> 3 words
0: 3 bits next
000: number 0 -> word 1 from word list: Rare
01: 6 bits next
010010: number 18 -> word 19 from word list: Crafted
0: 3 bits next
010: number 2 -> word 3 from word list: Block
= Rare Crafted Block

06 C8 -> C8 06, 11001000 00000110
0: 3 bits next
011: number 3 -> 3 words
0: 3 bits next
000: number 0 -> word 1 from word list: Rare
0: 3 bits next
100: number 4 -> word 5 from word list: Crafting
0: 3 bits next
110: number 6 -> word 7 from word list: Ingredient
= Rare Crafting Ingredient

56 14 73 14	-> 14 73 14 56, 00010100 01110011 00010100 01010110
0: 3 bits next
011: number 3 -> 3 words
01: 6 bits next
010001: number 17 -> word 18 from word list: Decorative
01: 6 bits next
001100: number 12 -> word 13 from word list: Beacon
11: 10 bits next
0001010001: number 81 -> word 82 from word list: Fitting
= Decorative Beacon Fitting

7. String offset data

string%20offset
This is a list of lengths of the strings in “String data”
There is 1 per “Name section”
This data varies in size depending on the “String data”
The structure here:

- 1 Byte: bit length of values
- x Bytes: bit array with above bit length for each entry. Ends with start of "String data"

How to decode (Works same as "Combination offset data"):
0A 00 10 A0 C0 03 13 6C ...
0A is the length of 10 bits

The following data needs to be translated to bits and split into above bit length (here: 10 bits) to interpret them as a number.
Showing the first hex values, their binary representation -> the split value = its decimal value

00 00000000
10 000100 00 -> 00 00000000 = 0
A0 1010 0000 -> 0000 000100 = 4
C0 11 000000 -> 000000 1010 = 10
03  00000011 -> 00000011 11 = 15
13 00010011 
6C 011011 00 -> 00 00010011 = 19

8. String data

This is the text of all words without any separator
There is 1 per “Name section”
This data varies in size depending on the amount of words in it
The structure here:

- x Bytes: 1 byte per character in ISO_8859_1 (Latin-1) encoding until next "Name section" starts or file ends

How to decode:
Take the lengths from "String offset data" to get the length of each word
0, 4: Offset 0. 4 - 0 = 4 Length -> Rare
4, 10: Offset 4. 10 - 4 = 6 Length -> Common
10, 15: Offset 10. 15 - 10 = 5 Length -> Block
15, 19: Offset 15. 19 - 15 = 4 Length -> Tool

Enjoy creating great things with this knowledge

Simoyd · May 12, 2020, 6:00pm

@willcrutchley was working on this too. not sure if he made any headway

Mayumichi · May 12, 2020, 8:02pm

Maybe I’ll have a read again tomorrow, but this is some weird stuff to me I wonder if it’s a completely custom solution or if it’s based on some actual spec but modified a little to be more efficient, like the msgpack format they used.

lucadeltodecso · May 12, 2020, 8:57pm

you’re crazy!

the header is:

u8: max-index for the metals color palette
u16: number of encoded ItemType's
{
    u16: ItemType
    u8  : subtitle-index
} * number-of-encoded-ItemType's [ sorted for binary search ]
u8 : number of encoded languages [ this is the 05 byte in your "Language Definition" section in OP ]

[ I could have saved you a lot of time by copy-pasting the format but cool to see you figure it out! ]

the “String offset data” to be more specific, has N+1 entries for N words, the last offset being the “end() iterator” (to use a c++ analogy) so that you can always determine the length of the word by using its offset + the next offset in the list

(also you have a typo in the var-length description, its 10 bits, not 9 bits for the 11 case.

(my format description)

format description

u8  : metals-palette max index
u16 : number of encoded ItemType's
{
  u16 : ItemType
  u8  : subtitle-index
} * number-of-encoded-ItemType's  [ sorted by itemId for binary search ]
u8 : number of encoded languages
{
  u8 : length of language identifier
  char * length : identifier [ no null terminator ] ascii
  u32 : offset in buffer to start of language encoding
} * number-of-encoded-languages
{
  u32 : offset in buffer to start of color strings
  u32 : offset in buffer to start of metal strings
  u32 : offset in buffer to start of item-title strings
  ENCODINGS : item-subtitles [ indexed by subtitle-index of encoded ItemType ]
  ENCODINGS : colors [ indexed by BlockColorIndex - 1 ]
  ENCODINGS : metals [ indexed by BlockColorIndex - 1 ]
  ENCODINGS : item-titles [ indexed by index in encoded-ItemType's ]
} * number-of-encoded-languages
where:
ENCODINGS = {
  u32 : offset in buffer to start of "encodings"
  u32 : offset in buffer to start of words-index
  u32 : offset in buffer to start of "words"
  u8  : bit-count for encodings-index values
  uN*indices : offset in "encodings" to start of encoding for each index.
  ENCODING*indices : "encodings" for each index
  u8  : bit-count for words-index values
  uN*words+1: offset in "words" to start of that word (+1 for end() "iterator")
  char*?? : "words", no null-terminator delimination; use index of next word to terminate; ISO8859-1
}
where:
ENCODING = {
  var : length of encoding
  var*N : word-index
}
where:
var = variable-length value encoded as:
   0  ++ u3
   10 ++ u6
   11 ++ u10

Mayumichi · May 12, 2020, 9:05pm

This is the part that doesn’t make sense to me

Are they really not sequential? Or am I just too tired to think straight. Little endian but just arranged like this?

lucadeltodecso · May 12, 2020, 9:18pm

it is sequential, first bit (value & 1) is right-most in a bit representation of the numbers, so bit-wise you read right to left within each byte

Mayumichi · May 12, 2020, 9:23pm

Maybe it’ll make sense when implementing it, in the example the bits are read like you say, little endian bytes. Then it’s switched up when converting to dec, the bits are strung together in big endian order, it just makes my head spin

So it kind of goes
left to right bytes, containing
right to left bits, read as
left to right decimal numbers
?

Lorgar · May 12, 2020, 10:22pm

If you want you can also switch the order of the whole block:
… 03 C0 A0 10 00
convert it to bits
00000011 11000000 10100000 00010000 00000000
and then read the bits right to left converting them to decimal as 10 bit numbers

little endian is also not my favorite for reading

DKPuncherello · May 12, 2020, 10:44pm

Why even use little endian? Is it actually more efficient?

willcrutchley · May 13, 2020, 1:25am

Hah, no wonder I was making slow progress… Nice one!

Stretchious · May 14, 2020, 3:59pm

lucadeltodecso:

ENCODINGS = {
  u32 : offset in buffer to start of "encodings"
  u32 : offset in buffer to start of words-index
  u32 : offset in buffer to start of "words"
  u8  : bit-count for encodings-index values
  uN*indices : offset in "encodings" to start of encoding for each index.
  ENCODING*indices : "encodings" for each index
  u8  : bit-count for words-index values
  uN*words+1: offset in "words" to start of that word (+1 for end() "iterator")
  char*?? : "words", no null-terminator delimination; use index of next word to terminate; ISO8859-1
}
where:
ENCODING = {
  var : length of encoding
  var*N : word-index
}
where:
var = variable-length value encoded as:
   0  ++ u3
   10 ++ u6
   11 ++ u10

I’m having trouble working this out in PHP. I’ve gotten as far as the above quoted section… specifically up to “uN*indices : offset in “encodings” to start of encoding for each index.”… which is where I am now stumped.

(Disclaimer: I’ve never really done anything like this before, nor had need to until now, so most of this is new to me).

Lorgar · May 14, 2020, 4:09pm

That is the part I named 5. Combination offset data
You can check the example I gave in that spoiler on how to decode it. Just skip the first byte as that is the u8 : bit-count for encodings-index values

Mayumichi · May 26, 2020, 10:20am

lucadeltodecso:

ENCODINGS = {
 u32 : offset in buffer to start of "encodings"
 u32 : offset in buffer to start of words-index
 u32 : offset in buffer to start of "words"
 u8 : bit-count for encodings-index values
 uN*indices : offset in "encodings" to start of encoding for each index.
 ENCODING*indices : "encodings" for each index
 u8 : bit-count for words-index values
 uN*words+1: offset in "words" to start of that word (+1 for end() "iterator")
 char*?? : "words", no null-terminator delimination; use index of next word to terminate; ISO8859-1
}

I’m a bit stuck

uN*indices
uN*words

Where does the word and indice count come from?

edit: They seem to come from the encodings-index and words-index arrays? Then I’m stuck on determining their lengths, the only way I see is to derive from the offsets?

u8 : bit-count for encodings-index values

The above determines the N, right? (in uN*indices)

lucadeltodecso · May 26, 2020, 12:18pm

It’s not explicit.

For metal names, there are ‘max-metal-index’ (first byte of data) indices (which happens to be 5)
For color names, there are 255
For item titles, the second two bytes of file as a uint16 gives the number of encoded items which is also how many titles there will be as it’s 1-1
For sub titles, you’d have to iterate the map of item-id to subtitle-index at top of file to see what the max index is and that + 1 is how many subtitles will be encoded

The game doesnt need to know counts, it’s just indexing, so no need for any explicit counts

Mayumichi · May 26, 2020, 12:23pm

Thank you, that’s exactly what I needed!

Goblinounours · May 26, 2020, 12:26pm

Dat structure, tho.

(sorry, couldn’t stop myself from laughing like an idiot)

Mayumichi · May 26, 2020, 8:19pm

Success! I’ll link this for anyone to freely use, it’s not tested that much yet and lacks all documentation but it seems to be functional now.

Couldn’t have figured this out without @Lorgar @DreamEvil or @lucadeltodecso, many thanks to everyone who helped!

It’s pure javascript, takes the itemcolorstrings.dat file in Uint8Array form as an input and spits out an object indexed by ItemID’s containing the item name and subtitle. If the output format isn’t what you want, the return statement is on line 66, fairly easy to modify as needed.

Include the file and call

boundlessDat.decode(uint8array) // pass the inputcolorstrings.dat data to the function

gist.github.com

https://gist.github.com/mayumi7/ca9e58a21459ccc76ee09873cff5000f

boundlessDat.js

// Usage: boundlessDat.decode(uint8array)

boundlessDat = {
	decode : function(uint8Array) {
		let decoder = new DatDecoder();
		return decoder.decode(uint8Array);
	}
};

class DatDecoder {

This file has been truncated. show original

Angellus · July 20, 2020, 9:41pm

I am working on a project in Python that needed this, so I made a simple standalone script to do this in Python. Requires Python 3.7+. Fixes a couple of issues from @Mayumichi JS version and automatically handles all of the languages.

EDIT: Forgot to fix add the final dictionary mapping for the items. Give me ~10 fix that… Fixed

gist.github.com

https://gist.github.com/AngellusMortis/7c60ba09a7ccf8ae3b4ce1600a334f8c

parse_itemcolorstrings.py

#!/usr/bin/env python3

from __future__ import annotations

import argparse
import json
from collections import namedtuple
from dataclasses import dataclass
from struct import unpack_from
from typing import Dict, List

This file has been truncated. show original

Mayumichi · July 20, 2020, 9:46pm

Any notes on these? I’ve made some additions too like the full language support, could pair that up with the fixes and release a more robust JS version

Angellus · July 20, 2020, 9:49pm

The main one was you were always missing the last word from decodeEncodings.

itemSubtitles for example was producing a list of 95 words, when it should have been 96 words.

Line 133, you are skipping the last word.

if(toLen !== undefined) { // it's not the last element

Because it is a variable length string, you do need the ending index, so I am passing in the start index for the next section.