<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: &#8216;What Does Google Want Us to Do With All These Free PDF eBooks?&#8217;: Problematic downloads</title>
	<atom:link href="http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/</link>
	<description>News &#38; views on e-books, libraries, publishing and related topics</description>
	<lastBuildDate>Sun, 21 Mar 2010 10:51:10 -0500</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: Dagg</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-127704</link>
		<dc:creator>Dagg</dc:creator>
		<pubDate>Thu, 23 Nov 2006 20:40:58 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-127704</guid>
		<description>I think DjVU is hands down better solution. I produces smaller files, better quality and has many more advantages for this purpose. Just as one example, if you enlarge PDF document which contaings pictures, moving around page causes flickering (thus pictures are refreshing and re-shaping with each move), wjile in DjVU browser moving around enlarged page is smooth. Some say DjVU as format is not so popular. Who cares, document viewers are not brands like Mercedes or BMW, they are back programs: more invisible to the user, better it is.</description>
		<content:encoded><![CDATA[<p>I think DjVU is hands down better solution. I produces smaller files, better quality and has many more advantages for this purpose. Just as one example, if you enlarge PDF document which contaings pictures, moving around page causes flickering (thus pictures are refreshing and re-shaping with each move), wjile in DjVU browser moving around enlarged page is smooth. Some say DjVU as format is not so popular. Who cares, document viewers are not brands like Mercedes or BMW, they are back programs: more invisible to the user, better it is.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Winfried Helge Pelz</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-82738</link>
		<dc:creator>Winfried Helge Pelz</dc:creator>
		<pubDate>Fri, 08 Sep 2006 16:06:49 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-82738</guid>
		<description>Just look what  Digitized by Google really means:

Download as example this book as pdf-file:  Eugenio Cappelletti Vocabolario Milanese-Italiano-Francese published 1848. 

--carefully scanned by Google--
as the first page in the pdf tells you:
&quot;This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project
to make the world’s books discoverable online.&quot;

I have sampled around 100 other books, all abundant with same defect types, although not such extreme high rates. 

Not a minimum  care for any sort of quality control. 
Very sad.</description>
		<content:encoded><![CDATA[<p>Just look what  Digitized by Google really means:</p>
<p>Download as example this book as pdf-file:  Eugenio Cappelletti Vocabolario Milanese-Italiano-Francese published 1848. </p>
<p>&#8211;carefully scanned by Google&#8211;<br />
as the first page in the pdf tells you:<br />
&#8220;This is a digital copy of a book that was preserved for generations on library shelves before it was carefully scanned by Google as part of a project<br />
to make the world’s books discoverable online.&#8221;</p>
<p>I have sampled around 100 other books, all abundant with same defect types, although not such extreme high rates. </p>
<p>Not a minimum  care for any sort of quality control.<br />
Very sad.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Janssen</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-82209</link>
		<dc:creator>Bill Janssen</dc:creator>
		<pubDate>Wed, 06 Sep 2006 19:00:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-82209</guid>
		<description>There&#039;s an important post on all this on the Book-People list, pointing out that it&#039;s trivial to extract the hi-res page image scans from the PDF files, and discard all the extra Google-added stuff:


Subject: [BP] Re: Google Books PDFs
From: &quot;Stewart C. Russell&quot; &lt;scruss@gmail.com&gt;
Date: Fri, 1 Sep 2006 14:00:00 PDT

I&#039;ve had reasonable success extracting page images using pdfimages
(from the xpdf suite: &lt;a href=&quot;http://www.foolabs.com/xpdf/&quot; rel=&quot;nofollow&quot;&gt;http://www.foolabs.com/xpdf/&lt;/a&gt;). It&#039;s rather slow,
but extracts the pages exactly as stored.

Each page seems to comprise the main page image (scanned at a high 
resolution) plus a separate &quot;Digitized by Google&quot; watermark image.
These smaller images can easily be discarded.

cheers,
  Stewart

-- 
  http://scruss.com/blog/
</description>
		<content:encoded><![CDATA[<p>There&#8217;s an important post on all this on the Book-People list, pointing out that it&#8217;s trivial to extract the hi-res page image scans from the PDF files, and discard all the extra Google-added stuff:</p>
<p>Subject: [BP] Re: Google Books PDFs<br />
From: &#8220;Stewart C. Russell&#8221; &lt;scruss@gmail.com&gt;<br />
Date: Fri, 1 Sep 2006 14:00:00 PDT</p>
<p>I&#8217;ve had reasonable success extracting page images using pdfimages<br />
(from the xpdf suite: <a href="http://www.foolabs.com/xpdf/" rel="nofollow">http://www.foolabs.com/xpdf/</a>). It&#8217;s rather slow,<br />
but extracts the pages exactly as stored.</p>
<p>Each page seems to comprise the main page image (scanned at a high<br />
resolution) plus a separate &#8220;Digitized by Google&#8221; watermark image.<br />
These smaller images can easily be discarded.</p>
<p>cheers,<br />
  Stewart</p>
<p>&#8211;<br />
  <a href="http://scruss.com/blog/" rel="nofollow">http://scruss.com/blog/</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Janssen</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-81615</link>
		<dc:creator>Bill Janssen</dc:creator>
		<pubDate>Tue, 05 Sep 2006 15:43:04 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-81615</guid>
		<description>Now you can OCR those PDF files with &lt;a href=&quot;http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html&quot; rel=&quot;nofollow&quot;&gt;Tesseract&lt;/a&gt;, which has been out for several months, but was just Slashdotted today.</description>
		<content:encoded><![CDATA[<p>Now you can OCR those PDF files with <a href="http://google-code-updates.blogspot.com/2006/08/announcing-tesseract-ocr.html" rel="nofollow">Tesseract</a>, which has been out for several months, but was just Slashdotted today.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Howard Bernier</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79257</link>
		<dc:creator>Howard Bernier</dc:creator>
		<pubDate>Thu, 31 Aug 2006 13:53:09 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79257</guid>
		<description>It seems (with limited experience) that when downloading the pdf&#039;s; if the size of the file is under 10-12mb, then the download proceeds just great. Larger than that the download stops, and reloading the file will result im problems with the browser. (Explorer, Firefox, Safari all yield the same problem)

Anyone have a workaround ? Thanks.</description>
		<content:encoded><![CDATA[<p>It seems (with limited experience) that when downloading the pdf&#8217;s; if the size of the file is under 10-12mb, then the download proceeds just great. Larger than that the download stops, and reloading the file will result im problems with the browser. (Explorer, Firefox, Safari all yield the same problem)</p>
<p>Anyone have a workaround ? Thanks.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Janssen</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79134</link>
		<dc:creator>Bill Janssen</dc:creator>
		<pubDate>Thu, 31 Aug 2006 00:56:02 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79134</guid>
		<description>That&#039;s 10.4.7 Preview, of course.</description>
		<content:encoded><![CDATA[<p>That&#8217;s 10.4.7 Preview, of course.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bill Janssen</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79133</link>
		<dc:creator>Bill Janssen</dc:creator>
		<pubDate>Thu, 31 Aug 2006 00:55:05 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79133</guid>
		<description>DjVu, while more efficient (in the past) than other available formats, is still a niche format in terms of support and use.  I&#039;d use PDF, instead, myself.  I notice that the Google PDF book downloads use the JBIG2 image compression scheme, where similar letter shapes are represented by one canonical glyph, instead of providing the exact (unavoidably distorted) scanned image.

Many PDF viewers cannot understand this image compression scheme yet; I tried it with Apple&#039;s Mac OS X 10.3.9 Preview, and with pdftotext; the older Preview showed the pages as blank, and the newer pdftotext (part of the xpdf package) gave me a bus-error when trying to de-compress the JBIG2-compressed pages.  OS X 0.4.7 Preview seems to work OK with the PDF, though.</description>
		<content:encoded><![CDATA[<p>DjVu, while more efficient (in the past) than other available formats, is still a niche format in terms of support and use.  I&#8217;d use PDF, instead, myself.  I notice that the Google PDF book downloads use the JBIG2 image compression scheme, where similar letter shapes are represented by one canonical glyph, instead of providing the exact (unavoidably distorted) scanned image.</p>
<p>Many PDF viewers cannot understand this image compression scheme yet; I tried it with Apple&#8217;s Mac OS X 10.3.9 Preview, and with pdftotext; the older Preview showed the pages as blank, and the newer pdftotext (part of the xpdf package) gave me a bus-error when trying to de-compress the JBIG2-compressed pages.  OS X 0.4.7 Preview seems to work OK with the PDF, though.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Liviu</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79108</link>
		<dc:creator>Liviu</dc:creator>
		<pubDate>Wed, 30 Aug 2006 23:11:35 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79108</guid>
		<description>I used to like djvu better than pdf in the past, but there is considerable more free software to manipulate pdf&#039;s than djvu&#039;s, where you need basically to use the linux tools. 

Also many ocr engines do not (as of yet) recognize djvu, so while I agree that djvu is a better format than pdf from many points of view, it is less practical for now. 

Anyway you can convert pdf&#039;s to djvu freely and easily in Linux with djvudigital (and it works with Colinux too so you can use it on an xp box also).

 Liviu</description>
		<content:encoded><![CDATA[<p>I used to like djvu better than pdf in the past, but there is considerable more free software to manipulate pdf&#8217;s than djvu&#8217;s, where you need basically to use the linux tools. </p>
<p>Also many ocr engines do not (as of yet) recognize djvu, so while I agree that djvu is a better format than pdf from many points of view, it is less practical for now. </p>
<p>Anyway you can convert pdf&#8217;s to djvu freely and easily in Linux with djvudigital (and it works with Colinux too so you can use it on an xp box also).</p>
<p> Liviu</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Michael Dillon</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79099</link>
		<dc:creator>Michael Dillon</dc:creator>
		<pubDate>Wed, 30 Aug 2006 22:21:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79099</guid>
		<description>One wonders why Google did not use the DJVU format for
their books. It was designed for scanned images, works with
and without OCR text, and compresses neatly into small files.

There is an open source reader called Evince that supports
the DJVU format as well as LizardTech&#039;s free browser
plug-in.

The main downside is that you have to think about how to 
set up your scanning workflow and assemble the bits and
pieces yourself rather than buying some commercial
book-scanning workflow system. But for a project on Google&#039;s
scale, it shouldn&#039;t have mattered because you only need to
figure these details out once.

I suspect that the real issue is that Google never thinks much
about how to do things. They just throw cash and warm bodies
at a problem and hope it will work out. It often doesn&#039;t as in
the case of Google&#039;s ORKUT compared to MySpace.</description>
		<content:encoded><![CDATA[<p>One wonders why Google did not use the DJVU format for<br />
their books. It was designed for scanned images, works with<br />
and without OCR text, and compresses neatly into small files.</p>
<p>There is an open source reader called Evince that supports<br />
the DJVU format as well as LizardTech&#8217;s free browser<br />
plug-in.</p>
<p>The main downside is that you have to think about how to<br />
set up your scanning workflow and assemble the bits and<br />
pieces yourself rather than buying some commercial<br />
book-scanning workflow system. But for a project on Google&#8217;s<br />
scale, it shouldn&#8217;t have mattered because you only need to<br />
figure these details out once.</p>
<p>I suspect that the real issue is that Google never thinks much<br />
about how to do things. They just throw cash and warm bodies<br />
at a problem and hope it will work out. It often doesn&#8217;t as in<br />
the case of Google&#8217;s ORKUT compared to MySpace.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Liviu</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79084</link>
		<dc:creator>Liviu</dc:creator>
		<pubDate>Wed, 30 Aug 2006 21:42:44 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79084</guid>
		<description>As much as I dislike pdf&#039;s, I think that Google&#039;s action is a step forward. The books are scanned only without ocr&#039;ing so there is not much choice of formats, other than pure pictures. I have not yet run one of my 4 Google downloads through the Abby OCR engine to check the resolution, but for jpg page to screen reading on Nokia 770 they are ok, I extracted one page and checked. 

 The sizes are as expected for picture pdf books (about 7-10M per 300 page book) but they downloaded almost instantaneously on my cable connection. 

 I have heard that they are negotiating with publishers to sell the pdf&#039;s of copyrighted books; I have to see prices and drm but it seems a step forward there too.

 Liviu</description>
		<content:encoded><![CDATA[<p>As much as I dislike pdf&#8217;s, I think that Google&#8217;s action is a step forward. The books are scanned only without ocr&#8217;ing so there is not much choice of formats, other than pure pictures. I have not yet run one of my 4 Google downloads through the Abby OCR engine to check the resolution, but for jpg page to screen reading on Nokia 770 they are ok, I extracted one page and checked. </p>
<p> The sizes are as expected for picture pdf books (about 7-10M per 300 page book) but they downloaded almost instantaneously on my cable connection. </p>
<p> I have heard that they are negotiating with publishers to sell the pdf&#8217;s of copyrighted books; I have to see prices and drm but it seems a step forward there too.</p>
<p> Liviu</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: ryanramseyer</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79069</link>
		<dc:creator>ryanramseyer</dc:creator>
		<pubDate>Wed, 30 Aug 2006 19:37:18 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79069</guid>
		<description>Google is to be applauded.  Now that they have released the pdf&#039;s, we can store these things, read them in any pdf reader, etc.  

When Sony finally releases the reader, this google trove should provide a textual answer to itunes (at least as far as out of copyright books go.)</description>
		<content:encoded><![CDATA[<p>Google is to be applauded.  Now that they have released the pdf&#8217;s, we can store these things, read them in any pdf reader, etc.  </p>
<p>When Sony finally releases the reader, this google trove should provide a textual answer to itunes (at least as far as out of copyright books go.)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Bruce Albrecht</title>
		<link>http://www.teleread.org/2006/08/30/what-does-google-want-us-to-do-with-all-these-free-pdf-ebooks/comment-page-1/#comment-79068</link>
		<dc:creator>Bruce Albrecht</dc:creator>
		<pubDate>Wed, 30 Aug 2006 19:36:03 +0000</pubDate>
		<guid isPermaLink="false">http://www.teleread.org/blog/?p=5413#comment-79068</guid>
		<description>I haven&#039;t tried it yet, but there&#039;s at least one report at Distributed Proofreaders (http://www.pgdp.net) that the PDF books are much better resolution than the page at a time images, and that they are getting much better results OCRing from the PDF.  It&#039;s possible that Planet PDF was looking at early images, too, as it&#039;s my opinion that scans from the last few months are much better than the early scans.  They still seem to have a policy of skipping illustrations, though. 

Google&#039;s also been busy.  I know there are at least 136,000 books available, up from about 50,000 in February or March.</description>
		<content:encoded><![CDATA[<p>I haven&#8217;t tried it yet, but there&#8217;s at least one report at Distributed Proofreaders (<a href="http://www.pgdp.net" rel="nofollow">http://www.pgdp.net</a>) that the PDF books are much better resolution than the page at a time images, and that they are getting much better results OCRing from the PDF.  It&#8217;s possible that Planet PDF was looking at early images, too, as it&#8217;s my opinion that scans from the last few months are much better than the early scans.  They still seem to have a policy of skipping illustrations, though. </p>
<p>Google&#8217;s also been busy.  I know there are at least 136,000 books available, up from about 50,000 in February or March.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
