stripping logos from scanned PDF files
Moderator: kcleung
-
- active poster
- Posts: 293
- Joined: Sun Apr 23, 2006 5:08 am
- notabot: YES
- notabot2: Bot
- Location: Phoenix, AZ
With Gimp, I usually use the rectangle tool (press 'r'), select the region, then press Ctrl-'.' to blank the rectangle. It's faster than the eraser.
One tool that would be useful to make this a snap would be to give a sample of what to erase. Then it would automatically look for things that look like the sample picture and erase them. I don't think we need rotation or scaling so that should make it fairly simple. It seems very doable.
One tool that would be useful to make this a snap would be to give a sample of what to erase. Then it would automatically look for things that look like the sample picture and erase them. I don't think we need rotation or scaling so that should make it fairly simple. It seems very doable.
First of all, thanks for your hint on gimp.
Anyway have you seen such tool?
That tool would be great! However this would require some sort of computer vision algorithm (at least shape / object recognition), which would in turn either require huge computation resources or result in slow recognition....... whereas if we do it manually, we can process a image within 5 seconds. Although I may be wrong....horndude77 wrote: One tool that would be useful to make this a snap would be to give a sample of what to erase. Then it would automatically look for things that look like the sample picture and erase them. I don't think we need rotation or scaling so that should make it fairly simple. It seems very doable.
Anyway have you seen such tool?
-
- active poster
- Posts: 293
- Joined: Sun Apr 23, 2006 5:08 am
- notabot: YES
- notabot2: Bot
- Location: Phoenix, AZ
No, I haven't seen such a tool. Yes it would be processor intensive, but if it worked I wouldn't mind letting it work over a weekend to get a full cd done while I'm out at the park.
In any case I started a stupid script to work on this problem which just blanks rectangles. It only works with one of the files I have. I don't think it could be made more general, but perhaps it could be a good starting point.
(Looking at it now I noticed that I forgot to set the dpi in the output images.)
In any case I started a stupid script to work on this problem which just blanks rectangles. It only works with one of the files I have. I don't think it could be made more general, but perhaps it could be a good starting point.
Code: Select all
#!/bin/sh
RIGHT_COORDINATES="1890,240 2400,370"
LEFT_COORDINATES="310,270 815,400"
CENTER_COORDINATES="1115,140 1725,370"
PDF=$1
FILENAME=testing
PREFIX=prefix
pdfimages $PDF $PREFIX
LAST=right
for i in ${PREFIX}*
do
#PPM use the center. After that alternate between left and right starting with left.
EXT=`echo $i | sed -e 's_^[^.]*__'`
if [ $EXT = '.ppm' ]
then
COORD=$CENTER_COORDINATES
LAST=right
elif [ $LAST = 'right' ]
then
COORD=$LEFT_COORDINATES
LAST=left
else
COORD=$RIGHT_COORDINATES
LAST=right
fi
OUT_FILE=`echo $i | sed -e 's_\.[^.]*_.tiff_'`
convert $i -fill white -draw "rectangle $COORD" -monochrome -compress group4 $OUT_FILE
done
tiffcp ${PREFIX}*.tiff out.tiff
tiff2pdf out.tiff -t "$TITLE" -z -o $FILENAME.pdf
rm ${PREFIX}*
-
- active poster
- Posts: 293
- Joined: Sun Apr 23, 2006 5:08 am
- notabot: YES
- notabot2: Bot
- Location: Phoenix, AZ
http://github.com/horndude77/image-scripts/tree/master
I spent a half hour putting together a simple program which searches a bi-level image for a smaller image and removes that section of the target image (i.e. logo removal). Yes, it's slow, but on the images I've tried it only takes a minute or two.
Shortcomings:
- It only works on PBMs right now.
- Search space is hard coded.
- Search is not directed (just try everything).
- There is no concept of 'good enough' in the search. For example, if only 30 pixels differ in a section then it is probably the desired section (which contains thousands of pixels).
I'll hopefully clean it up a bit more soon, but an approach like this is much better than removing them by hand.
I spent a half hour putting together a simple program which searches a bi-level image for a smaller image and removes that section of the target image (i.e. logo removal). Yes, it's slow, but on the images I've tried it only takes a minute or two.
Shortcomings:
- It only works on PBMs right now.
- Search space is hard coded.
- Search is not directed (just try everything).
- There is no concept of 'good enough' in the search. For example, if only 30 pixels differ in a section then it is probably the desired section (which contains thousands of pixels).
I'll hopefully clean it up a bit more soon, but an approach like this is much better than removing them by hand.
@ horndude:
awesome! the program is written in java - but is there any way of getting a precompiled version of that program? would it be possible to port this either to gimp or make it a pdfsam plugin?
i also just realized that it might be possible to remove the logos using virtualdub, after making all the images into a single movie. for virtualdub there are scripts like logoaway (http://www.voidon.republika.pl/virtualdub/) that could be utilized.
another addition: i found these plugins to be more promising:
http://www.compression.ru/video/image_r ... ex_en.html and
http://www.compression.ru/video/logo_re ... ex_en.html
for the latter one would have to separate the even and uneven page number in order to make the logo appear in the same area.
greetings
tilmaen
awesome! the program is written in java - but is there any way of getting a precompiled version of that program? would it be possible to port this either to gimp or make it a pdfsam plugin?
i also just realized that it might be possible to remove the logos using virtualdub, after making all the images into a single movie. for virtualdub there are scripts like logoaway (http://www.voidon.republika.pl/virtualdub/) that could be utilized.
another addition: i found these plugins to be more promising:
http://www.compression.ru/video/image_r ... ex_en.html and
http://www.compression.ru/video/logo_re ... ex_en.html
for the latter one would have to separate the even and uneven page number in order to make the logo appear in the same area.
greetings
tilmaen
-
- active poster
- Posts: 293
- Joined: Sun Apr 23, 2006 5:08 am
- notabot: YES
- notabot2: Bot
- Location: Phoenix, AZ
Removing a logo from a bi-level static image is very different from removing a logo from video. Also I believe the programs you linked to rely on the logo being in the same place from one frame to the next. This isn't always the case for images.
(I wonder how a program like this would work with watermark removal? Often these tv logos look similar to watermarks to me.)
As for making a plugin... yes it is possible, but my goal was a command-line solution and few dependencies. The less buttons I have to press the better.
For now I don't have a precompiled version. Are you on windows, linux or osx? I could put together some build instructions though it will amount to something like: install jdk, install ant, type 'ant'.
(I wonder how a program like this would work with watermark removal? Often these tv logos look similar to watermarks to me.)
As for making a plugin... yes it is possible, but my goal was a command-line solution and few dependencies. The less buttons I have to press the better.
For now I don't have a precompiled version. Are you on windows, linux or osx? I could put together some build instructions though it will amount to something like: install jdk, install ant, type 'ant'.
hi!
no hurry! i don't have my hands on the CD'S (yet) so i couldn't work with it.
i will get the cds some time next year i guess. i'm running windows, but i also have a kubuntu linux installed.
the second link (image restoration) might work - i don't know for sure but their aproach with using masks could actually work for a logo.
alternatively maybe one could run an ocr software over the pdfs and detect "CD-Rom Library" and delete that.
thanks for the program though - once i get started with converting the cds that will totally increase orchestra part availibility.
greetings
no hurry! i don't have my hands on the CD'S (yet) so i couldn't work with it.
i will get the cds some time next year i guess. i'm running windows, but i also have a kubuntu linux installed.
the second link (image restoration) might work - i don't know for sure but their aproach with using masks could actually work for a logo.
alternatively maybe one could run an ocr software over the pdfs and detect "CD-Rom Library" and delete that.
thanks for the program though - once i get started with converting the cds that will totally increase orchestra part availibility.
greetings
some poeple uploaded logo infested files to this protected imslp server - i would "donate" some computing power to get the files done and uploaded. if you don't mind i'd love to use your program to do so.
no hurry, but it'd be great to get some assistance as soon as you have the time.
I'm running a kubuntu 8.10 linux.
greetings
tilmaen
no hurry, but it'd be great to get some assistance as soon as you have the time.
I'm running a kubuntu 8.10 linux.
greetings
tilmaen
-
- active poster
- Posts: 385
- Joined: Mon Apr 16, 2007 11:09 pm
- notabot: 42
- notabot2: Human
- Location: Melbourne, Australia
I guess this is sort of on a similar subject...removing logos/ watermarks from files....
There is a wealth of material from the Mendelssohn Complete Works appearing on the website of the Münchener Digitalisierungszentrum.
http://www.muenchener-digitalisierungszentrum.de/
They also have a whole mountain of interesting scores already. I don't have the tools or experience to try to strip the logos/ watermark but this could be an assignment for someone who is interested.
aldona
There is a wealth of material from the Mendelssohn Complete Works appearing on the website of the Münchener Digitalisierungszentrum.
http://www.muenchener-digitalisierungszentrum.de/
They also have a whole mountain of interesting scores already. I don't have the tools or experience to try to strip the logos/ watermark but this could be an assignment for someone who is interested.
aldona
“all great composers wrote music that could be described as ‘heavenly’; but others have to take you there. In Schubert’s music you hear the very first notes, and you know that you’re there already.” - Steven Isserlis
-
- Groundskeeper
- Posts: 553
- Joined: Fri Feb 16, 2007 8:55 am
-
- active poster
- Posts: 385
- Joined: Mon Apr 16, 2007 11:09 pm
- notabot: 42
- notabot2: Human
- Location: Melbourne, Australia
As far as I can see, they are only available as separate images. If you click on "Miniaturansicht", you can get 5 images to a screen view, which you can then right-click and do other things to.
As for your other questions, my educated guess would be that yes, the 150% image looks like the best available quality, and no, I can't see anything apart from the "BSB" symbol that needed removing.
I tried to save all of the individual images for one piece (T. Boehm, Rondo a la mazurka for flute & piano, Op.36), then combine them and convert to black-&-white PDF, but the quality of the finished images was very poor (almost to the point of being unreadable). I'm sure there are tools and techniques available to get around this, but I don't have them (or the skills).
Good luck!
Aldona
As for your other questions, my educated guess would be that yes, the 150% image looks like the best available quality, and no, I can't see anything apart from the "BSB" symbol that needed removing.
I tried to save all of the individual images for one piece (T. Boehm, Rondo a la mazurka for flute & piano, Op.36), then combine them and convert to black-&-white PDF, but the quality of the finished images was very poor (almost to the point of being unreadable). I'm sure there are tools and techniques available to get around this, but I don't have them (or the skills).
Good luck!
Aldona
“all great composers wrote music that could be described as ‘heavenly’; but others have to take you there. In Schubert’s music you hear the very first notes, and you know that you’re there already.” - Steven Isserlis
-
- active poster
- Posts: 504
- Joined: Fri Dec 19, 2008 8:36 pm
- notabot: YES
- notabot2: Bot
- Location: Berlin, Germany
Bavarian State Library (BSB)
Hallo Aldona,
I had a look at this collection several weeks ago, too bad they don't go by the good example of the Danish National Library. Saving a large work as single jpeg pages is a pain in the ***
(And there are only two titles featuring the oboe (Onslow), both already available at IMSLP in better resolution...)
Following comments: A lot of works offer also a 200% view option, and the resulting jpeg has twice the size (2000 x 2500 pixels) of the resolution of the 100% view (1000 x 1250 pixels); In miniature view your downloads are also 1000 x 1250 pixels.
The larger size works out to about 250 dpi on an A4 page. Apparently their master scans (not available on the web) are only at 300dpi.
With converting to greyscale and adjusting contrast in photoshop (try contrast 60, brightness 20) I got quite decent prints on my laserprinter, but converting to B/W with a threshold of 50% usually removes too much from the image.
You can convert greyscale to pdf...
But maybe the best solution would be to request a different setup for the printed music (after all the interests of a performing musician are quite different from someone interested in a rare book with over 1000 pages.
I try my luck suggesting some changes...
Kalliwoda
I had a look at this collection several weeks ago, too bad they don't go by the good example of the Danish National Library. Saving a large work as single jpeg pages is a pain in the ***
(And there are only two titles featuring the oboe (Onslow), both already available at IMSLP in better resolution...)
Following comments: A lot of works offer also a 200% view option, and the resulting jpeg has twice the size (2000 x 2500 pixels) of the resolution of the 100% view (1000 x 1250 pixels); In miniature view your downloads are also 1000 x 1250 pixels.
The larger size works out to about 250 dpi on an A4 page. Apparently their master scans (not available on the web) are only at 300dpi.
With converting to greyscale and adjusting contrast in photoshop (try contrast 60, brightness 20) I got quite decent prints on my laserprinter, but converting to B/W with a threshold of 50% usually removes too much from the image.
You can convert greyscale to pdf...
But maybe the best solution would be to request a different setup for the printed music (after all the interests of a performing musician are quite different from someone interested in a rare book with over 1000 pages.
I try my luck suggesting some changes...
Kalliwoda
-
- active poster
- Posts: 702
- Joined: Wed Mar 14, 2007 3:21 pm
- notabot: 42
- notabot2: Human
- Location: Delaware, USA
- Contact:
Re: Bavarian State Library (BSB)
kalliwoda wrote:[...] Saving a large work as single jpeg pages is a pain in the *** [...]
("***" was César Cui's pseudonym in the Russian press for many years at the beginning of his side-career as a music critic.)
"A libretto, a libretto, my kingdom for a libretto!" -- Cesar Cui (letter to Stasov, Feb. 20, 1877)
-
- Groundskeeper
- Posts: 553
- Joined: Fri Feb 16, 2007 8:55 am
Re: Bavarian State Library (BSB)
If a threshold of 50% removes too much, you should try moving the threshold as close to white as possible, without adding too much noise to the background. That may well be a value around 80-90% (assuming that 100% corresponds to pure white).kalliwoda wrote:With converting to greyscale and adjusting contrast in photoshop (try contrast 60, brightness 20) I got quite decent prints on my laserprinter, but converting to B/W with a threshold of 50% usually removes too much from the image.
You can convert greyscale to pdf...
I'm not saying that this will necessarily work for all files; but if it does, B/W has the advantage of smaller PDFs (with group4 or better compression), meaning less storage space, faster downloads, and potentially faster printing.