Unified system

In discussions about these systems, it was clear that the differences between the databases were simply a result of them being separate, and not due to any fundamental disagreements between developers. Everyone is keen to see them merged.

This spec proposes:

File format

The new format is very similar to the KDE format. However, only the tags used in this example are valid:

[MIME-Info text/html]
Encoding=UTF-8
Comment=HTML document
Comment[af]=...
[... etc. other translations ]
Patterns=*.htm;*.html
Contents=(strcmp-at 0 "<HTML")
Hidden=false

All KDE-specific tags have been removed, as well as the Icon field. Although all desktops need a way to determine the icon for a particular type, the icon used will depend on desktop, and not only on the file type.

The type should be a standard MIME type where possible. If a special media type is required for non-file objects (directories, pipes, etc), then the media type 'inode' may be used.

The entries in Patterns are separated by semicolons. There is no trailing semicolon.

Although not part of the name-to-type mapping, the Comment field is left in for the sake of not having too many files.

The Hidden field is usually not present. It is used to indicate that this entry replaces all information for this MIME type read so far, instead of being merged with other records for the same type. The intent is to let users entirely replace existing types.

Directory layout

Unlike the KDE system, the files are not arranged in the filesystem by type. This approach is only possible for a tightly coordinated system. Consider, for example, that ROX-Filer adds a mapping from .DirIcon to 'image/png'. This cannot be specified in a file called image/png.desktop without conflicting with existing definitions for the type.

Since files are not named by type, each file may contain multiple types. The files should instead be named by the package that they come from to avoid conflicts and reduce loading times.

The directories to be used to load these files are:

  • /usr/share/mime/mime-info

  • /usr/local/share/mime/mime-info

  • ~/.mime/mime-info

Each of these directories contains a number of files with the '.mimeinfo' extension. Applications MUST NOT try to load other files. This is to allow for future extensions.

Programs modifying any of these files MUST update the modification time on the parent (mime-info) directory so that applications can easily detect the change. The rules from the directories in this list take precedence over conflicting rules from earlier directories. Thus, the user's settings take precedence over all others.

Pattern matching

KDE's Patterns field replaces GNOME's and ROX's ext/regex fields, since it is trivial to detect a pattern in the form '*.ext' and store it in an extension hash table internally. The full power of regular expressions was not being used by either desktop, and glob patterns are more suitable for filename matching anyway.

Applications MUST first try a case-sensitive match, then a case-insensitive one. This is so that main.C will be seen as a C++ file, but IMAGE.GIF will still use the *.gif pattern.

Dealing with conflicts

If several patterns match then the longest pattern SHOULD be used. In particular, files with multiple extensions (such as Data.tar.gz) MUST match the longest sequence of extensions (eg '*.tar.gz' in preference to '*.gz'). Literal patterns (eg, 'Makefile') must be matched before all others. It is acceptable to match patterns of the form '*.text' before other wildcarded patterns (that is, to special-case extensions using a hash table).

If the same pattern is defined twice, then they MUST be ordered by the directory the rule came from (this is to allow users to override the system defaults if, for example, they are using a common extension to mean something else). If they came from the same directory, either can be used.

If the same type is defined in several places, the Patterns and Comments MUST be merged. If two different comments are provided for the same MIME type in the same language, they should be ordered by directory as before.

Common types (such as MS Word Documents) will be provided in the X Desktop Group's package, which SHOULD be required by all applications using this specification. Since each application will then only be providing information about its own types, conflicts should be rare.

Contents matching

The value of the Contents attribute is a scheme-like expression. If the expression evaluates to a true value then the file is assumed to be of this type. Since scanning a file's contents can be very slow, applications may choose to do pattern matching first and only fall back to content matching, or not perform it at all.

An expression is a list of space-separated items surrounded by parenthesis, eg:

(strcmp-at 0 "<?xml ")

The first element of the list (strcmp-at in this example) is the name of a function. The remaining elements are its arguments. The result of evaluating the expression is the result of applying the function to the arguments. Each argument may be:

An integer

A 64-bit signed integer, such as 32.

A string

A string of characters with C-style escaping. This string contains the sequence of bytes <0, 8, 9, 10>: "\0\010\t\xa".

A symbol

A symbol is a constant for the file being tested. For example, size evaluates to the file's size.

A list

Lists may be nested. Each sub-list is evaluated in the same way as the top-level list, eg (+ (* 3 2) (* 4 3)) is 18.

Functions may return integers or strings. 'True' is represented by the integer 1, and False by 0. The following functions and symbols are provided:

Function exampleResultDescription
(+ 1 2 3)6The sum of the arguments
(- 10 6 6)-2The first argument minus the sum of the remaining arguments
(* 2 2 3)12The product of the arguments
(/ 20 2 2)5The first argument divided by the product of the remaining arguments
(> 1 2)0True iff the first argument is greater than the second
(< 1 2)1True iff the first argument is less than the second
(= 1 2)0True iff the first argument is equal to the second
(not size)1True iff argument is false (0 or "")
(and "one" 2 3)3The first false argument, or the last argument if none are false
(or 0 "" 2 0)2The first true argument, or the last argument if none are true
(& 3 6)2Bit-wise AND of the arguments
(| 3 6)7Bit-wise OR of the arguments
(^ 3 6)5Bit-wise XOR of the arguments
size10The size of the file in bytes
(strcmp-at 0 "Hello")1True iff the string starting at the file offset given by the first argument matches the second argument
(byte-at 0)72The signed byte at the given file offset
(big-16 4)28503The big-endian 16-bit signed integer starting at the given file offset.
(little-16 4)22383The little-endian 16-bit signed integer starting at the given file offset.
(big-32 0)1214606444As above, but for a 32-bit big-endian integer
(little-32 0)1819043144As above, but for a 32-bit little-endian integer
(big-64 0)5216694956358856562As above, but for a 64-bit big-endian integer
(little-64 0)8245905578810697032As above, but for a 64-bit little-endian integer
(string-at 4 6)"oWorld"The string of bytes starting at the offset given by the first argument and of length given by the second argument

The and and or functions should only evaluate as many arguments as are necessary to determine the result.

Security implications

The system described in this document is intended to allow different programs to see the same file as having the same type. This is to help interoperability. The type determined in this way is only a guess, and an application MUST NOT trust a file based simply on its MIME type. For example, a downloader should not pass a file directly to a launcher application without confirmation simply because the type looks `harmless' (eg, text/plain).

Do not rely on two applications getting the same type for the same file, even if they both use this system. The spec allows some leeway in implementation, and in any case the programs may be following different versions of the spec.