Preparation for the future XCF
Having a brand new XCF format has been a long-time project of the team (present in roadmap). This report is to gather needs, requirements and implementation ideas. It is definitely not planned for GIMP 3, but for after. I'm adding a %3.2 milestone as meaning "something we want to do at some point after GIMP 3 is released".
I write this report for myself and other developers if they have any comment on the technical choices. It won't change much GIMP from the "user point of view", except that it will suddenly unlock a lot of possibility and make it easier to improve GIMP with unblocking the storing capabilities of XCF (many issues are stuck on the complexity of doing them while keeping the current format).
Some of the key points for a new formats are:
- Built around GEGL Buffers. Since our internal buffers are GEGL buffers and it's meant to stay that way, it just makes sense.
- Partial load: files are getting bigger, not only because we want to store more in them, but because size of source images are getting bigger as time passes (while FullHD was a big deal 10 years ago, now it's 4K or now; and if we talk about prints, it's even far higher; any camera also produces photos of 5 or 10k per dimension nowadays), computer memory is bigger too… so now it's common to have very huge project files. Also since GIMP is able to load bigger-than-memory images, more users are getting interested in GIMP when they need to process huge images. Nevertheless it doesn't mean we want to load everything at once. Sometimes we can get away with partial loads until specific actions occur. E.g. previews only, but also with upcoming animation support, we may want to only load buffers for some frames, not the whole project. It means better performance as well as sensation of faster loads.
- Incremental save and autosave: with huge files, saving can take seconds or even minutes. This was a main blocker for autosave (we can't randomly block editing). Incremental saves are also a huge time-saver but it's harder to implement with a custom file format because we have to create the whole API infrastructure to manage data offsets, sizes and whatnot. It would be much easier to depend on a stable and generic "container format" which implements this part for us and just provide us with a stable API.
- Embedding data: as we are talking more and more about possibility of embedding data (used fonts, palettes or brushes associated to a project, etc.), it would be easier done when moving once again to a well known container format rather than just appending it to a custom file format and managing offsets.
Image data
This one is easy. The new format idea was always planned on being based on GEGL buffers. It also means that we need to make sure that on-disk GEGL buffers are a stable format (I think it is already, but this needs to be sure) from now on.
Having GEGL buffers would give us autosave (and/or backup saves) nearly for free as GEGL buffers have the ability to sync on disc easily and very quickly. Nevertheless with the additional container layer (see below), this might not be as true, unless we do things in 2 steps: uncompress data to disc then load from the uncompressed data (and oppositely: sync buffers directly to disc then compress in container archive).
Container format
For auto-save and other features to be as easy as possible, the structure would be a common case of a container format containing a single metadata file (indicating how the file is structured, such as which buffer corresponds to which layer, etc.), the image data itself (GEGL buffers) and other embedded data. The question about the container format can then be raised. Historically a lot of format just use zip (OpenDocument, OpenRaster, EPUB, even my recent work on GEX for GIMP extensions).
The advantage of zip are:
- It's very common and won't go away. Implementations are also many.
- A common usage is that the first file added in the zip should be non-compressed, is called
mimetype
and contains for only contents the registered mimetype. This way, all zip-based file formats have a very easy magic number at offset 38, which is the chosen mimetype. E.g. a.odt
file hasmimetype
at offset 30 andapplication/vnd.oasis.opendocument.text
at offset 38. - While not being an extraordinary compression, zip is ok while relatively fast.
Now very recently I read this article by the sqlite project about using sqlite as a complex format container and I found it quite interesting: https://www.sqlite.org/affcase1.html
As for the points above:
- I think that sqlite has quite a large userbase and is probably not going to disappear anytime soon, though I'm not sure if it has other implementations than the upstream one (we would likely use the upstream one anyway, the point is about format sustainability as we don't want to get stuck with an abandoned format).
- Not sure if sqlite have similar properties where we could store some data to be always found at a given offset. This being said, I'm wondering if we could not simply keep our XCF header ("gimp xcf v[0-9][0-9][0-9]" at offset 0) and simply store the sqlite db after that. This way, all existing readers would know it's a GIMP project file, simply with a bumped version so they would not try and read it (then it doesn't matter that the format after this is in fact completely different). Since XCF has always been a versioned format anyway, we are able to do this.
- The article linked above say that the compression rate is similar, though we'd want to verify if this is also true for image data. Also I expect the speed to be similar too. Though it should be tested.
Now how I see good additional advantages to using sqlite:
- Data safety: we take great care of not breaking XCF and there are even several safety measures in GIO to not overwrite a XCF file until the file descriptor is closed. Nevertheless the atomic update abilities of sqlite would clearly be a step further to make sure that we can't break a XCF even if the software/OS crashes or electricity goes down just at the wrong moment while updating.
- While zip also allows to only extract or insert/update part of the archive contents, it looks like sqlite will be more performant. This really requires testing and understanding how both technology operates more closely though.
Anyway this is mostly to be studied. Note that the idea floated around that it would be nice to also allow the ability to have folder-projects, which is basically about having the XCF content just using the file system (no container format). For instance, Ardour does this. It should not be a default, because having single files to share is just more practical to more and also because without any compression, a XCF project would take a lot of space on-disk, but it has advantages:
- Without the container layer, load/save should be much faster. Even more for the near-instant sync discussed above.
- By breaking the data in sub-files, it is easier to sync (partial syncs for backups, etc.).
- Similar to the previous point, it makes XCF projects simpler to version (e.g. in
git
) because if you edit a single buffer, you don't have to re-version the whole project. - It also makes the text parts (metadata) diffable.
Of course, here zip looks like it's a better fit because it's already file-system like, but nothing prevents us to use path-like names as object names to query in a SQL db.
Metadata format
XML still feels like the best fit for extensibility, backward and forward compatibility.
Typically it allows to easily add new tags or new attributes while staying very semantic. And when adding some optional new features in-between others, it's much easier to have the older versions of GIMP not to break (we would just make them to ignore unknown tags/attributes and display a warning that some features might be missing).
libxcf
It might be the opportunity to move XCF loading and saving to an external library libxcf.
Backward compatibility
Obviously all this discussion about a new format doesn't mean we will break loading old XCF files. We will keep code loading old XCF files forever. Being able to load any XCF files, even from the very early GIMP versions, is a key feature in GIMP. This whole report is only about a vision for a future format which will be much more easily extended, easier to save, and so on.