What is a file type?
Let’s not go as far deep as “What is a file?”, but before we start, let’s take a look at file types. File types are determined by the contents of the files themselves, and are used to allow the opening program to be chosen wisely. In Microsoft Windows, file extension globbing is the sole method of identifying file types. Users must provide a common phrase at the end of files, and the files would be searched by their name, in turn providing the correct icon and program.
Things are a little different in Ubuntu. Of course, globbing (the most basic method) is present for file types, but Ubuntu has a few other tricks up its sleeve. One of these is magic numbers. The magic number of a binary file is the first few bytes, which identify the file type. The definition of a “magic” number has somewhat loosened in recent years; it can now mean any piece of data, generally near the beginning of a file, that can be used to uniquely identify the type.
Another, more powerful, but too rarely used feature, is XML namespace matching. Without this feature, all XML files wouldn’t be able to be more specifically identified, with the exception of extension globbing, of course. Namespace matching allows for quick detection of a XML-based format based on not only the namespace, but also the root element. For example, XHTML files (application/xhtml+xml) can not only be matched by an xhtml file extension, but also by its namespace URI (http://www.w3.org/1999/xhtml) and its root element (html).
How are file types detected?
In Ubuntu, programs such as Nautilus use the shared-mime-database as the sole location for file type information. Unfortunately, other Gnome facilities such as file open and save dialogs only use extension globbing, and are independent from the MIME database. These databases are stored in a similar way to how programs can be located in four tiers, /bin, /usr/bin, /usr/local/bin and ~/bin. These databases can be found in the following directories:
Like the program tiers, it is generally agreed that only MIME types installed from Ubuntu packages should be located in the first level. System-wide changes by the user or programs installed via make install are placed in the second tier, while changes local to the user are in the third.
The directories inside these MIME databases represent MIME groups, for example ./video for video/* MIME types, and ./application for application/* types. Not all of these directories may exist; they’ll be created on demand for file types. In these directories, there are multiple XML files, each named by their MIME suffix. They contain nodes with information about magic numbers, extension globs, parent types, child and alias types, and the file type description (often in multiple languages).
The update-mime-database command, invoked manually or as a trigger opened when packages are changed, draws upon the information in these files and turns them into fast-seeking formats that aren’t as friendly as XML. These real databases are in the following files:
- aliases: alternate names for MIME types
- generic-icons: system icons to be used for files
- globs: extension globbing without priority values (deprecated)
- globs2: extension globbing with priority values (current)
- icons: custom icons for odd file types
- magic: magic number database
- mime.cache: master cache with the entire database
- subclasses: child file types
- treemagic: detection of directory structures
- types: a list of MIME types
- XMLnamespaces: detection through XML namespaces and elements
A long time ago, when I was learning about the MIME database, I used Bless to directly edit these files to create changes, but was always confused by my changes immediately disappearing. This is because the information is converted one way from the XML files to the cache files.
The structure of the XML files
Before we use programs to modify the MIME database for convenience, here’s a quick breakdown of the format of the XML files in the database. The root element is mime-info, with the shared MIME info namespace:
This root element contains any number of mime-type nodes, providing detection information about a file type. You could even have an empty mime-info node, but that isn’t productive at all.
The following are a selection of the most important elements that can be found in mime-type nodes:
- glob nodes with a simple wildcard glob in a pattern attribute. A weight attribute from 0 to 100 is optional, and defaults to 50:
<glob pattern="*.mkv" weight="55"/>
- glob-deleteall and magic-deleteall nodes, which clear any cascading of globs or magic numbers from previously parsed files and starts afresh
- magic nodes with an optional priority attribute from 0 to 100 (again defaulting to 50). These contain match nodes, which define rules for matching using magic numbers. These are the attributes to be used with match elements:
- type: one of string, host16, host32, big16, big32, little16, little32 or byte
- offset: where to check for the magic, using a single numeric offset or a range notated start:end
- value: the value to match with (numeric for any type other than string)
- mask: an optional attribute, this can be used for more detailed matches by running a bitwise AND on the potential match before testing. The value is either numeric (in the type specified) or strings, which are hexadecimal values all starting with 0x
<match type="string" offset="0" value="DVDVIDEO">
- alias nodes, with a type attribute specifying alternate or deprecated MIME types that are equivalent
- sub-class-of nodes, with a type attribute specifying the parent MIME type
- comment, acronym and expanded-acronym nodes that help describe the file type to people; xml:lang attributes can be used to distinguish language
- root-XML elements which determine types using XML namespaces have namespaceURI and localName (root element) attributes
Here’s an example XML source file that uses a couple of these features (this file type is bogus, I just created it for the example):
<comment xml_lang="en-AU">DML source document</comment>
<expanded-acronym xml_lang="en-AU">Delan's Markup Language</expanded-acronym>
<root-XML namespaceURI="http://azabani.com/dml" localName="dml"/>
Assogiate: a GUI editor for the Gnome MIME database
Assogiate is a neat little program that allows you to create and modify file types, modifying the database in a very user-friendly and quick way. It can access the user database, ~/.local/share/mime, or the system override database, /usr/local/share/mime. Changes are not, however, placed in XML files with the file name structure of the MIME type, instead they are placed in ./packages/Override.xml allowing for a memory of the user-changed file types.
Assogiate can be found in the Ubuntu universe repository:
sudo apt-get install assogiate
In the case that it is not, you can download and compile it:
http://azabani.com/files/apps/assogiate-0.2.1.tar.gz| tar xvz
cd assogiate-0.2.1; ./configure; make; sudo make install
You aren’t allowed to change the system override database without running the program as a privileged user, so always run it as root:
In the Assogiate window, you can use the toolbar buttons to add and modify selected file types, remove and revert changes, or search for file types. The left pane allows you to narrow your view to groups of MIME types, or user modified types.
Adding and editing file types
The process for these two actions is very similar. When you are in the Edit Type dialog, you can edit canonical information, alias and parent types, globbing, magic numbers and XML namespace matching each in its own tab.