TeleRead: Bring the E-Books Home

News & views on e-books, libraries, publishing and related topics
August 30th, 2006

‘What Does Google Want Us to Do With All These Free PDF eBooks?’: Problematic downloads

By David Rothman

Planet PDF and the TeleBlog obviously see PDF in different ways, but I know their hearts are in the right place. Here are concerns that Planet PDF has about Google’s treatment of PDF:

If you haven’t heard yet, Google has just announced that a bunch of the books it has been scanning for its Google Books Library Program are now available for free download as PDFs. There’s no doubt Google needs to be applauded for the idea, but the execution (i.e. the books they’ve produced) could definitely do with some work. The PDF books are difficult to download, large in size, of such low resolution they’re difficult to read, unsearchable, and do not allow the user to copy text from them. It’s left me wondering what Google expects people to do with the books.

Delighted to see Planet PDF on the job. But look, guys, what do you expect from Google? So far Google seems to be of the E-Book Museum mindset.

Digg us. Slashdot us. Facebook us. Twitter us. Share the news.
  • Digg
  • Slashdot
  • Facebook
  • Twitter
  • del.icio.us
  • Reddit
  • StumbleUpon
  • Technorati
  • NewsVine
  • LinkedIn
  • MySpace
  • Suggest to Techmeme via Twitter
  • Netvibes
  • Turn this article into a PDF!

12 Responses to “‘What Does Google Want Us to Do With All These Free PDF eBooks?’: Problematic downloads”

  1. I haven’t tried it yet, but there’s at least one report at Distributed Proofreaders (http://www.pgdp.net) that the PDF books are much better resolution than the page at a time images, and that they are getting much better results OCRing from the PDF. It’s possible that Planet PDF was looking at early images, too, as it’s my opinion that scans from the last few months are much better than the early scans. They still seem to have a policy of skipping illustrations, though.

    Google’s also been busy. I know there are at least 136,000 books available, up from about 50,000 in February or March.

  2. ryanramseyer Says:
    August 30th, 2006 at 3:37 pm

    Google is to be applauded. Now that they have released the pdf’s, we can store these things, read them in any pdf reader, etc.

    When Sony finally releases the reader, this google trove should provide a textual answer to itunes (at least as far as out of copyright books go.)

  3. As much as I dislike pdf’s, I think that Google’s action is a step forward. The books are scanned only without ocr’ing so there is not much choice of formats, other than pure pictures. I have not yet run one of my 4 Google downloads through the Abby OCR engine to check the resolution, but for jpg page to screen reading on Nokia 770 they are ok, I extracted one page and checked.

    The sizes are as expected for picture pdf books (about 7-10M per 300 page book) but they downloaded almost instantaneously on my cable connection.

    I have heard that they are negotiating with publishers to sell the pdf’s of copyrighted books; I have to see prices and drm but it seems a step forward there too.

    Liviu

  4. Michael Dillon Says:
    August 30th, 2006 at 6:21 pm

    One wonders why Google did not use the DJVU format for
    their books. It was designed for scanned images, works with
    and without OCR text, and compresses neatly into small files.

    There is an open source reader called Evince that supports
    the DJVU format as well as LizardTech’s free browser
    plug-in.

    The main downside is that you have to think about how to
    set up your scanning workflow and assemble the bits and
    pieces yourself rather than buying some commercial
    book-scanning workflow system. But for a project on Google’s
    scale, it shouldn’t have mattered because you only need to
    figure these details out once.

    I suspect that the real issue is that Google never thinks much
    about how to do things. They just throw cash and warm bodies
    at a problem and hope it will work out. It often doesn’t as in
    the case of Google’s ORKUT compared to MySpace.

  5. I used to like djvu better than pdf in the past, but there is considerable more free software to manipulate pdf’s than djvu’s, where you need basically to use the linux tools.

    Also many ocr engines do not (as of yet) recognize djvu, so while I agree that djvu is a better format than pdf from many points of view, it is less practical for now.

    Anyway you can convert pdf’s to djvu freely and easily in Linux with djvudigital (and it works with Colinux too so you can use it on an xp box also).

    Liviu

  6. DjVu, while more efficient (in the past) than other available formats, is still a niche format in terms of support and use. I’d use PDF, instead, myself. I notice that the Google PDF book downloads use the JBIG2 image compression scheme, where similar letter shapes are represented by one canonical glyph, instead of providing the exact (unavoidably distorted) scanned image.

    Many PDF viewers cannot understand this image compression scheme yet; I tried it with Apple’s Mac OS X 10.3.9 Preview, and with pdftotext; the older Preview showed the pages as blank, and the newer pdftotext (part of the xpdf package) gave me a bus-error when trying to de-compress the JBIG2-compressed pages. OS X 0.4.7 Preview seems to work OK with the PDF, though.

  7. That’s 10.4.7 Preview, of course.

  8. Howard Bernier Says:
    August 31st, 2006 at 9:53 am

    It seems (with limited experience) that when downloading the pdf’s; if the size of the file is under 10-12mb, then the download proceeds just great. Larger than that the download stops, and reloading the file will result im problems with the browser. (Explorer, Firefox, Safari all yield the same problem)

    Anyone have a workaround ? Thanks.

  9. Now you can OCR those PDF files with Tesseract, which has been out for several months, but was just Slashdotted today.

  10. There’s an important post on all this on the Book-People list, pointing out that it’s trivial to extract the hi-res page image scans from the PDF files, and discard all the extra Google-added stuff:

    Subject: [BP] Re: Google Books PDFs
    From: “Stewart C. Russell” <scruss@gmail.com>
    Date: Fri, 1 Sep 2006 14:00:00 PDT

    I’ve had reasonable success extracting page images using pdfimages
    (from the xpdf suite: http://www.foolabs.com/xpdf/). It’s rather slow,
    but extracts the pages exactly as stored.

    Each page seems to comprise the main page image (scanned at a high
    resolution) plus a separate “Digitized by Google” watermark image.
    These smaller images can easily be discarded.

    cheers,
    Stewart


    http://scruss.com/blog/

  11. Winfried Helge Pelz Says:
    September 8th, 2006 at 12:06 pm

    Just look what Digitized by Google really means:

    Download as example this book as pdf-file: Eugenio Cappelletti Vocabolario Milanese-Italiano-Francese published 1848.

    –carefully scanned by Google–
    as the first page in the pdf tells you:
    “This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project
    to make the world’s books discoverable online.”

    I have sampled around 100 other books, all abundant with same defect types, although not such extreme high rates.

    Not a minimum care for any sort of quality control.
    Very sad.

  12. I think DjVU is hands down better solution. I produces smaller files, better quality and has many more advantages for this purpose. Just as one example, if you enlarge PDF document which contaings pictures, moving around page causes flickering (thus pictures are refreshing and re-shaping with each move), wjile in DjVU browser moving around enlarged page is smooth. Some say DjVU as format is not so popular. Who cares, document viewers are not brands like Mercedes or BMW, they are back programs: more invisible to the user, better it is.

Leave a Reply

Subscribe without commenting