Part one: what do I want?
I hinted at this topic in a previous post. I have a big collection of (mostly) paperback Science Fiction books – some hardcover books too. I used to read a lot more in the pre-Internet days, nowadays it’s just during my holidays that I get enough time to read whole books in a short enough time… so many of those old paperbacks are 20-30 years old and yellowed.
In this digital age it would be appropriate to have digital versions of my books and save them from crumbling to dust. I am in anticipation of Sony’s new e-reader, the PRS-T1 which I want to buy once it is out:
This is a very nice device. It is also a lot cheaper than the previous generation Sony e-reader (the PRS-650) while at the same time adding wireless connectivity. This device needs content once I have it in my possession.
A lot of the “newer” books, and those written by contenporary authors can be purchased online, or downloaded from fan sites where people scan their own collections into EPUB or MOBI e-books. That is all good and well, but on my bookshelves I have many dozens good books that will probably never see a new life as an e-book. That is very unfortunate… I had a lot of fun reading them and do not want to see them go into oblivion.
I decided to do something about this. I am going to try and describe (and hopefully implement) how I am going to digitize my book library. Note: at the moment this is all just ideas, “dreams” if you wish, although I have already found quite a bit of information on the Internet that I will be sharing with you. I want it to be more than just a dream.
What does one need to get a paper book converted into an e-book?
- the book’s pages need to be scanned
- the scanned bitmaps may have to be cleaned-up digitally (enhancing the contrast between characters and background, de-skewing or rotating the text blocks, …)
- I need an Optical Character Recognition (OCR) program to convert the bitmap images into character text
- I need an e-book editor to layout the bare text that I got from the OCR program – the ebook has to look largely like the original paper version.
- I want to use a library program to make my book catalogue available, to myself of course, to my e-reader device, and possibly to other interested parties.
And I want this to be as low-cost as possible. Any software that I am going to use for this should preferably be Open Source and run on Slackware.
Those are the main topics I will write about. Each of these topics deserves its own separate article. Why is that?
I can already see how this project will confront me with interesting challenges. I am going to write a multi-post story with interlinked articles (this being the first) in order to preserve this hobby project of mine for posterity. Having separate topic articles allows me to split up your feedback as well (heh… I hope I do get some feedback!), so that discussions about, say, scanning techniques will not interfere with talk about what is the best OCR program for Linux.
The articles are not going to be “static” per se. I value your feedback and important new insights will find their way back into the main text.
Let’s see where this ends. It is probably going to take days, or weeks, to write. It delends a bit on Slackware development – if that picks up speed again, I will have less time for this ebook side show. But for the moment , there is silence in the ChangeLog.txt and I have time to spare.
Thank the universe when Eric is bored!
Hey, great to see this initiative!
I have a large SciFi + IT collection of books as well, and married a teacher who loves to do research. Result: lots of books in the house. My current office has all four walls filled with shelves, and there is more in the living-room and bedroom…
Some books I found on-line, “official” or not. But many I would like to see digitized.
I’m eager to read your ideas about:
– resolution when scanning
– combining illustrations with text (scientific books, not so important for Sci-Fi)
– etc… etc…
I’ll definitely follow your posts closely!
Eric, I believe the most boring and tiresome part of this project will be scanning the books and editing the images (where needed).
That DOES take time 🙁
I am also interested in this project.
Yes, scanning will be the most time-consuming, and good scan robots cost many thousands of euros… so that is out the question. Every second you can scrape off a page scan is a bit gain. I do have some thoughts on good hardware solutions.
Also I think I have a package for semi-automated “cleaning” of scans (but I need to create some scans in order to test that…)
Check out http://diybookscanner.org/
@hub, indeed that is a site I can hopefully get some inspiration from.
Eric, it’s so nice to see you getting involved in DIY book scanning – I can’t elp but think some big advanements in the state of the art are goin to come out of this.
DIY Book Scanner linked by @hub is the de-facto resource for these things and you’ll find plenty of inspiration and information there.
Re. software, check out the awesomely great Scan Tailor, and also Unpaper.
If you want to see my (stalled, for now) scanner project, it’s here: http://www.diybookscanner.org/forum/viewtopic.php?f=1&t=821
Really look forward to seeing what you come up with!
Some more (hopefully) useful info…
Post-Scantailor images can be converted OCR’d and converted to djvu with another excellent app, djvubind. The advantage of this approach is that it uses hidden text layers behind highly-compressed images of the page, to preserve the book’s layout exactly while allowing searching/copy-paste, etc.
Of course, not many e-readers support DJVU, so here’s my technique for creating hidden-text-layer PDFs:
All software I’ve mentioned is open-source and proudly used by me on Slack 🙂
Yes, scantailor is a nice tool, but djvubind is new to me.
I have a set of packages that I should upload but I helpd back because I had wanted to try them out myself first. I have built:
* tesseract with all language data files in separate packages
* a whole lot of supporting packages for the above (leptonica, iulib, openfst, ocroswig, ocropy)
I will try to upload those tonight regardless of whether I was able to test them… I will leave the testing to the blog audience.
i like to read your posts and thought about scanning your books, here are my thoughts about.
I think how to scan and which type of file or files (pdf, jpg, png, raw) you use is one of the most difficult parts.
Solution 1 (fast and cheap)
1. cut the edge of the books to get single sheets, a cutter is not expensive, perhaps in a copy shop
2. scan the whole sheets in duplex mode of the scanner, perhaps at work (resolution, depends on ocr/programm)
3. scan with ocr programm, try first one or two pages to get best settings
Solution 2 (not fast, but cheap)
1. scan your books with an mobile scanner or normal scanner
2. scan with ocr programm
Solution 3 (fast and cheap but need time to prepare)
1. develop your own scanner, you need one or two good cameras, to get both sides and somebody to scroll
2. scan with ocr programm
Solution 4 (depends on help)
1. ask somebody who can do a fast and good scan, perhaps there is a solution in a university
2. scan with ocr program
Your Solution depends on how you like your books and what resources you have.
hope that helps
me too I’m in your exact situation.I have lots of books and academic article in pdf and want to digitize them and create a digital library, working only with opensource tools and on slackware of course. I was thinking of using tesseract as ocr program and greenstone( http://www.greenstone.org/ ) to build my library.
I’m working a lot on this project and can’t think how much happy I was when discovered that you also are on it.
Keep in touch!
Pingback: Alien Pastures » Raspberry Pi deserves Slackware