Experimenting with epub - Creation
By Joseph Gray
Moderator’s note: We welcome Joseph Gray, a super-helpful TeleBlog commenter and a standards-loving IT guy, as our latest contributor. This is Part I of a two-part series. - DR
With the recent finalization by the IDPF of the three specifications that comprise an epub, I thought I would see exactly what this new ebook format was capable of. For testing purposes, I created an epub using the information in the IDPF specifications. To the best of my knowledge, the only commercial software currently available for creating an epub is Adobe InDesign. I took the low tech approach and used a text editor.
In this article, I will detail the steps necessary to create an epub using a text editor and a program like WinZip or 7Zip. In the second installment, I will describe some of my experiences with the epub reading software currently available. These are FBReader, Adobe Digital Editions and the Openberg Lector plugin for Firefox.
I initially found information about the process of creating an epub detailed on a few other web sites. Although very helpful, some of the information on these other sites was written before the IDPF specifications were finalized, so it was no longer completely accurate. I will provide an updated example here. Note that any information provided in this article is my interpretation of the specifications and may also be in error. You should check the specification documents yourself to ensure accuracy.
1 - epub Specifications
There are three specifications that make up an epub ebook. All three can be found at the IDPF web site. Open Publication Structure (OPS) 2.0 details the structure of the files that make up the content of your ebook. In a nutshell, this content is comprised of XHTML 1.1 and CSS 2.0. The OPS specification lists a required subset of each of these, along with a few differences. Open Packaging Format (OPF) 2.0 details two special files that provide information about your ebook (author, publisher, etc.) and also list all of the files that make up your ebook. OEBPS Container Format (OCF) 1.0 is just an archive file that uses Zip compression. The specification describes what goes into this Zip file. The complete epub ebook that you distribute is in fact this one Zip file, with a file extension of “.epub”, instead of “.zip”.
At the present time, there is no DRM standard incorporated into the epub specifications. That will be something that will have to be decided as part of a future specification. Personally, I hope that the need for DRM will eventually disappear.
2 - The Parts of an epub
The files used to create an epub are all based on XML. Because of this, there is a logical structure to each file. Because of this logical structure, software tools could be written to automate much of the process of creating an epub. To create an epub manually, you can simply use the files that I am going to describe, as templates for creating other epub ebooks.
IMPORTANT NOTE: In the following files and in the names of the folders described later, practically everything is case sensitive. If you want to use your own names for things, make sure you are consistent with upper and lower case. One other thing to watch for is that for valid XHTML 1.1, all values must be surrounded by quotes. Things may not work correctly if you forget the quotes.
Let’s begin the creation of our epub with some simple content. We’ll create a skeleton ebook, comprised of a title page and one chapter. You can expand on this, of course. You can use almost any valid XHTML 1.1 and CSS 2.0 to create your content. See the OPS specification for details. You could use in-line styles, but I’m going to use a separate stylesheet. You could also use different stylesheets for each HTML file, as long as they are listed in the manifest (we’ll get to this later).
To save space, I’m going to omit extra blank lines and indentation that would otherwise make things more readable. The completed epub file that is attached to this article is formatted better. Note that you may need to right-click and “save as” to download the attached epub. I’m not sure how various web servers and web browsers will interpret this file type, even though it is a Zip file.
EXTRA NOTE: Some lines may break and wrap, due to the formatting requirements of this blog. Double check things against the downloaded epub.
Save the following file as “title_page.html”
<?xml version=”1.0″ encoding=”utf-8″?>
<!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.1//EN”
“http://www.w3.org/TR/xhtml11
<html xmlns=”http://www.w3.org/1999/xhtml”>
<head>
<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />
<title>Title Page</title>
<link href=”stylesheet.css” type=”text/css” rel=”stylesheet” />
</head>
<body>
<h1 class=”center”>The Great American Novel</h1>
<h2 class=”center”>by John Q. Public</h2>
<div class=”center”>
<object data=”images/author-pic.svg” type=”image/svg+xml” width=”150″ height=”150″>
</object>
</div>
<!– Put anything else you want here. –>
</body>
</html>
You will note that I did not use a bitmapped image, but an SVG (Scalable Vector Graphic). This is just one of the required image types supported by the epub standard. I will have more to say about this in the next article.
Save the following file as “chapter01.html”.
<?xml version=”1.0″ encoding=”utf-8″?><!DOCTYPE html PUBLIC “-//W3C//DTD XHTML 1.1//EN”
“http://www.w3.org/TR/xhtml11
<html xmlns=”http://www.w3.org/1999/xhtml”>
<head>
<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />
<title>Chapter 1</title>
<link href=”stylesheet.css” type=”text/css” rel=”stylesheet” />
</head>
<body>
<h1 class=”center”>Chapter01</h1>
<p>It was a dark and stormy night. Suddenly a shot rang out.</p>
<!– Put anything else you want here. –>
</body>
</html>
Now for our stylesheet. I’m going to make it a simple one. Save it as “stylesheet.css”.
body{
font-family: “Georgia”,
“Times New Roman”, “Times”, serif;
margin-left: 3%;
margin-right: 3%;
text-align: justify; /* I’ll talk about this in the next article */
}
p
{
text-indent: 1.5em;
margin-top: 0;
margin-bottom: 0.2em;
}
.center
{
text-align: center;
}
Now we need to create a file that describes our ebook and what files it is composed of.
Save this file as “content.opf”.
<?xml version=”1.0″?>
<package version=”2.0″
xmlns=”http://www.idpf.org
<metadata xmlns:dc=”http://purl.org/dc/elements/1
xmlns:dcterms=”http://purl.org/dc/terms/”
xmlns:opf=”http://www.idpf.org/2007/opf”
xmlns:xsi=”http://www.w3.org/2001/XMLSchem
<dc:title>The Great American Novel</dc:title>
<dc:language xsi:type=”dcterms:RFC3066″>en
<dc:identifier id=”BookID”>urn:uuid:b0425ef2
<dc:date xsi:type=”dcterms:W3CDTF”>2007
<dc:creator opf:file-as=”Public, John Q.”>John Q. Public</dc:creator>
<dc:publisher>Acme Publishing</dc:publisher>
<dc:type>Fiction</dc:type>
<dc:rights>Copyright 2007 by John Q. Public</dc:rights>
</metadata>
<manifest>
<item id=”ncx” href=”toc.ncx” media-type=”application/x
<item id=”style” href=”stylesheet.css” media-type=”text/css” />
<item id=”titlepage” href=”title_page.html” media-type=”application/xhtml
<item id=”chapter01″ href=”chapter01.html” media-type=”application/xhtml
<item id=”img0l” href=”images/author-pic.svg” media-type=” image/svg+xml ” />
</manifest>
<spine toc=”ncx”>
<itemref idref=”titlepage” />
<itemref idref=”chapter01″ />
</spine>
</package>
This file needs a bit of commentary. In the second line, note that “unique-identifier” is set to “BookID”. You can see that this matches up with “id” in line nine. The value doesn’t have to be “BookID”. It can be anything, as long as the two places match.
You don’t have to fully understand all of the “xmlns” stuff in lines 3-6. These point to XML namespaces that are defined elsewhere and provide a standardized vocabulary for use in our “metadata” section (for example “dc:title”, which is our book title). Although we have several metadata items that describe our book, only three are required (dc:title, dc:language and dc:identifier). Our ebook title goes in “dc:title”, of course.
Note that in line eight, the language that our book uses is “en-US” (US English). We could have simply specified “en” if we wanted to be more generic. For British English we would use “en-GB”. For other languages, use the appropriate language code.
Take a closer look at the “dc:identifier” in line nine. The text after “uuid:” is “b0425ef2-ebb9-405d-b2e5
If you don’t have an ISBN for your ebook, any method that creates a unique ID, that does not collide with someone else’s ID could be used, as I understand the specification. Perhaps someone from IDPF could clarify this for us. One other method that might be used is simply a combination of your organization’s name (or registered domain name) and a Julian date that includes the hour, minutes and seconds. Such a combination is highly unlikely to duplicate anyone else’s ebook ID. There are online Julian date generators available if you want to use this method.
In line ten, we have “dc:date”. This is usually the date of publication or copyright. Note the form: YYYY-MM-DD. You can also shorten this to YYYY-MM or even YYYY.
In line eleven, “dc:creator” is usually the author. Note the use of the optional “opf:file-as”, which allows us to specify how the name would be filed in an index or catalog.
The use of “dc:publisher”, “dc:type” and “dc:rights” should be obvious. As I stated above, there are several other metadata items we can use. They are described in the OPF specification.
After the metadata section, we have our “manifest”, which lists every file that is contained in our epub. If it is a part of the ebook, it has to be listed here in the manifest. The first item is a special file called “toc.ncx”, which I’ll discuss in a moment. The next four items are our stylesheet, our two XHTML files (title page and chapter 1) and the image file that we used in our title page. I believe that the use of “ncx” for the“toc.ncx” file and “style” for the stylesheet are mandatory. If anyone knows otherwise, please correct me. The other items listed can be named however you like. Note that each item also has a “media-type”. The supported media-types are listed in the OPF specification.
The last section is the “spine”. This section lists each part of our ebook, in the proper order. This order is only for presentation in the reader software. The actual linking together of files for navigating an ebook is described in “toc.ncx”. Note that the “idref” used in the spine section is the same as in the “id” in the manifest section.
Now we have the “toc.ncx” mentioned above. This file describes how to navigate from one part of our ebook to another. Since we only broke our ebook into two files (title page and chapter 1), this example is relatively simple. The OPF specification describes some optional and more complicated aspects of this file. I’m going to skip a lot of the details of what is in this file.
<?xml version=”1.0″?>
<!DOCTYPE ncx PUBLIC “-//NISO//DTD ncx 2005-1//EN”
“http://www.daisy.org/z3986
<ncx xmlns=”http://www.daisy.org/z3986
<head>
<meta name=”dtb:uid” content=”b0425ef2-ebb9-405d
<meta name=”dtb:depth” content=”1″ />
<meta name=”dtb:totalPageCount” content=”0″ />
<meta name=”dtb:maxPageNumber” content=”0″ />
</head>
<docTitle>
<text>The Great American Novel</text>
</docTitle>
<navMap>
<navPoint id=”navpoint-1″ playOrder=”1″>
<navLabel>
<text>Title Page</text>
</navLabel>
<content src=”title_page.html” />
</navPoint>
<navPoint id=”navpoint-2″ playOrder=”2″>
<navLabel>
<text>Chapter 1</text>
</navLabel>
<content src=”chapter01.html” />
</navPoint>
</navMap>
</ncx>
The “dtb:uid” value in line six of this file must match the identifier used in “content.opf”. We have two “navpoints” defined. The first points to our Title Page and the second to our Chapter One. Note that in most simple, linear ebooks, the “navpoint” and “playOrder” can simply increment by one for each file used in our ebook.
We need two additional files to complete our epub. The first is named “container.xml” and looks like this:
<?xml version=”1.0″?>
<container version=”1.0″ xmlns=”urn:oasis:names:tc
<rootfiles>
<rootfile full-path=”OEBPS/content.opf” media-type=”application/oebps
</rootfiles>
</container>
If you want to, you can use this exact same file for every epub that you create, as long as you keep the same path and file name referenced on line four (OEBPS/content.opf).
The last file we need is named “mimetype” and contains only one line.
application/epub+zip
Make sure that this file only contains this one line and there is no CR, LF or other whitespace at the end of the line. Like the previous file, this file can be reused as-is in the creation of any epub.
3 - The epub Container
Now that we have all of these files, how do we put it all together into an epub? First, create a folder on your computer to hold these files. Name that folder anything you want. Copy the file “mimetype” into this folder. Inside this folder, create one folder called “META-INF” and another called “OEBPS”.
Into the “META-INF” folder, copy the file “container.xml”. Into the “OEBPS” folder, copy the files “title_page.html”, “chapter01.html”, “stylesheet.css”, “content.opf” and “toc.ncx”. Also in the “OEBPS” folder, create a folder called “images”. Copy “author-pic.svg”into the “images” folder. If we had used more image files, they would also be copied here.
Using your favorite Zip creation program (WinZip, 7Zip, etc.), create a new, empty Zip archive file. Add the file “mimetype” to the archive first. Note that this file must not be compressed when it is added to the archive. This should not be a problem, as with such a small file (exactly 20 bytes), it probably won’t get compressed anyway. However, check to make sure, as this one file must not be compressed. Now, add the “META-INF” and “OEBPS” folders and their contents to the Zip archive. All of these files should normally be compressed. After adding all of these files, close the Zip archive. Now, rename the file, using an extension of “.epub” instead of “.zip”. Congratulations. You have just created an epub ebook. In the next article, we’ll explore ways to read this ebook.
4 - Summary
There are several aspects of the epub standard that I did not cover in this article. In particular, you should reference the standards regarding the required and optional parts of XHTML 1.1 and CSS 2.0 that are supported. The specifications also cover the use of the DTBook vocabulary to support reading systems for the visually impaired. The use of fallbacks for non-supported image and document types is also discussed.
Although I would not recommend creating any great quantity of epub ebooks using a text editor, it is certainly possible, given the relatively simple and consistent layout of the required files. I hope to soon see more tools developed to automate epub creation. Given that the epub standard is open and freely available to all, there should be no hindrance to the creation of Open Source tools, as well as commercial tools.









October 31st, 2007 at 9:56 am
Thanks for this write-up, Joseph. Is there some means, in WinZip, to prevent compression of particular files as they’re being added to a Zip archive?
I look forward to being able to add ePub to my list of supported formats (I wish I could look forward to having only one supported format but I think there’ll be a long wait for this.
Rob Preece
Publisher, http://www.BooksForABuck.com
October 31st, 2007 at 10:06 am
Excellent article. Thank you very much for putting this together. I’ve already bookmarked it.
However, I did want to alert you to a broken link. The link in this sentence “For details on what this is and how to generate one, a Wikipedia article, which ponts to some online UUID generators.” is missing the “h” from “http”.
October 31st, 2007 at 10:39 am
Rob - I use 7Zip and not WinZip, although I’m sure that WinZip has some method of specifying no compression when you add a file. In 7Zip, you select “store” as the method of compression. As I mentioned in the article, the “mimetype” file is only 20 bytes and will most likely not compress anyway. Give it a try.
Preston - Thanks for pointing out the link error. I just fixed it, along with some spelling and formatting errors that crept in.
October 31st, 2007 at 10:46 am
There is one other important fact about the “mimetype” file that I hope I made clear in the article. Just to make sure, I’ll mention it here again. The “mimetype” file must be added to the Zip archive first. This, along with the file being uncompressed, places the mimetype information in a fixed location within the epub file, allowing easier determination that this is an epub file and not just a normal Zip file.
October 31st, 2007 at 1:47 pm
Thanks from me, too.
One thing to be aware of is that the quoted examples of files all contain 8-bit curly quotes instead of the plain vanilla straight quotes xml requires. No doubt the content creation tool ’smartened’ your quotes here.
October 31st, 2007 at 2:55 pm
Pond,
Yes, the Wordpress software put the curl in the quotes
If you download the sample file, you will find regular quotes, along with better formatting.
November 9th, 2007 at 8:26 am
[...] workflows for using .epub (Oct 2007) [...]
January 7th, 2008 at 4:45 am
I was reading the dev reference pages on the mobipocket format and found some interesting considerations there. The mobi format can be built from OPF+HTML+IMG files, but they are optimized for access by low-power devices.
Conversely, I’d be concerned that simply zipped OPF+HTML+IMG files are probably going to be resource hogs…