FTP Server
Moderator: kcleung
-
- Groundskeeper
- Posts: 553
- Joined: Fri Feb 16, 2007 8:55 am
Only logos were removed (I'm sure horndude will confirm this). All page headers (every single page has a header, not only the title pages) remained unchanged in the process. All of these headers were evidently digitally inserted into the image files, i.e. they are not part of the original scan.
Horndude is basically using a "search and replace" algorithm, which searches each page for a copy of the logo image and replaces it with white pixels (his software repository is linked in one of his earlier posts). Clearly you can't do this for the headers, as they are different for each file.
Horndude is basically using a "search and replace" algorithm, which searches each page for a copy of the logo image and replaces it with white pixels (his software repository is linked in one of his earlier posts). Clearly you can't do this for the headers, as they are different for each file.
Here's an idea:
All titles added by the company are in the same font (and presumably at the same size). Would it be plausible to have an algorithm search for every letter of that font (lowercase and uppercase) separately and replace them with white pixels, which would ultimately delete the whole title? It might be too much work, but a lot of the programming would probably be copy/paste-able.
All titles added by the company are in the same font (and presumably at the same size). Would it be plausible to have an algorithm search for every letter of that font (lowercase and uppercase) separately and replace them with white pixels, which would ultimately delete the whole title? It might be too much work, but a lot of the programming would probably be copy/paste-able.
-
- Groundskeeper
- Posts: 553
- Joined: Fri Feb 16, 2007 8:55 am
I don't think that ras1's proposal is realistically feasible. Horndude mentioned a processing time of 1-2 minutes per search. Thus, searching for all letters of the alphabet, both upper and lower case (52 separate searches, one after the other), would take roughly 1-2 hours per page, corresponding to a throughput of maybe 10-20 pages per day.
And this is without taking into account that in fact the font size isn't always the same (not sure about the font family), and that there could possibly be diacritical marks and letters from non-English alphabets, etc. And there might well be other problems, like slightly different pixel renderings of the same letter depending on its alignment with respect to the pixel grid, which would call for some kind of fuzzy matching algorithm...
In short, I don't believe that horndude's program is adaptable for this kind of task. Plus, lastly, even if it was, it would leave us with files with no headings at all, and I'm not sure if we want that.
I tend to agree with tilmaen...
And this is without taking into account that in fact the font size isn't always the same (not sure about the font family), and that there could possibly be diacritical marks and letters from non-English alphabets, etc. And there might well be other problems, like slightly different pixel renderings of the same letter depending on its alignment with respect to the pixel grid, which would call for some kind of fuzzy matching algorithm...
In short, I don't believe that horndude's program is adaptable for this kind of task. Plus, lastly, even if it was, it would leave us with files with no headings at all, and I'm not sure if we want that.
I tend to agree with tilmaen...
-
- active poster
- Posts: 293
- Joined: Sun Apr 23, 2006 5:08 am
- notabot: YES
- notabot2: Bot
- Location: Phoenix, AZ
Removing the top matter automatically isn't unfeasable, just annoying. You'd have to do the same thing for the logos for every separate pdf. At this point you may as well edit it by hand (almost). I don't think removing by font would be easy because it's essentially an OCR problem. It might be possible to pick out similar components that are near the top of each page and remove them. This however would require multiple pages in the pdf and would possibly remove actual content.
Also I sped up the search somewhat by skipping initial blank lines (duh!). It takes 10-15 seconds per page I'd guess.
Also I sped up the search somewhat by skipping initial blank lines (duh!). It takes 10-15 seconds per page I'd guess.
-
- Groundskeeper
- Posts: 553
- Joined: Fri Feb 16, 2007 8:55 am
Well, I agree that in other collections I have seen some pretty ugly additions with utterly unsuited fonts, but as far as the OM scores are concerned, I thought they were rather decent (though certainly not perfect). What exactly is it that you dislike about the OM scores?Carolus wrote:It's not what I'd term a high priority, aesthetically desirable as it may be.
-
- Groundskeeper
- Posts: 553
- Joined: Fri Feb 16, 2007 8:55 am
Sorry for the misunderstanding then - from the context I had assumed this was about the OM scores.Carolus wrote:Actually, I was thinking of other collections rather than the OM scores, which I am not as familiar with.
It seems that some of the other collections will be very challenging, even if we settle for trademark removal only. For example, there are currently some Beethoven scores on the server which have a logo scaled to different sizes and even different aspect ratios - not sure if we can ever automatically clean them up. It would certainly require more sophisticated tools than what we have available right now...
-
- active poster
- Posts: 293
- Joined: Sun Apr 23, 2006 5:08 am
- notabot: YES
- notabot2: Bot
- Location: Phoenix, AZ
Re: FTP Server
Thanks Horndude77!
I also didn't see anything slipped through. You are a real hero in this project.
I have slighted amended your script to allow automated batch process of OM files volume by volume.
ftp://imslp.org/OP Project/batch_script/
I've included the pre-compiled liblept.so.1.60 for both x86 and amd64 for Ubuntu 8.10 onwards and updated documentation (README) explaining how to set up and run the script to batch-process OM stuff.
Please have a good look at my README and tell me what you think. The script is designed to process each full volume in one go. As explained in README, I tried to standardize instrument names and avoid spaces and non-ascii character in names to make different filesystems (and my script) happy.
On a C2Q computer, it should take around 1.5 hours for this script to process a volume without user intervention
I also didn't see anything slipped through. You are a real hero in this project.
I have slighted amended your script to allow automated batch process of OM files volume by volume.
ftp://imslp.org/OP Project/batch_script/
I've included the pre-compiled liblept.so.1.60 for both x86 and amd64 for Ubuntu 8.10 onwards and updated documentation (README) explaining how to set up and run the script to batch-process OM stuff.
Please have a good look at my README and tell me what you think. The script is designed to process each full volume in one go. As explained in README, I tried to standardize instrument names and avoid spaces and non-ascii character in names to make different filesystems (and my script) happy.
On a C2Q computer, it should take around 1.5 hours for this script to process a volume without user intervention
-
- active poster
- Posts: 293
- Joined: Sun Apr 23, 2006 5:08 am
- notabot: YES
- notabot2: Bot
- Location: Phoenix, AZ
Re: FTP Server
Great! I'm glad you built leptonica to make it easier for others to use. The README looks good. I'm sorry I don't have much time to review in depth right now.
A couple other observations from working with the script:
- Look at line 63 of clean_pdf.rb. I added a pause so that the files could be checked/edited before the images are compiled into a pdf. I found it useful on scores where removal didn't work well. (musicprog has been asking me to take a look for a while now. It's slower from what I understand, but I'm wondering if it will fix these problem files. I still need to do this.)
- I found the 'harp and others' volumes are somewhat problematic with this approach. Manual work is required in renaming files.
- The utf-8 characters don't seem to work well in the pdf titles. I haven't investigated this much.
Good luck!
A couple other observations from working with the script:
- Look at line 63 of clean_pdf.rb. I added a pause so that the files could be checked/edited before the images are compiled into a pdf. I found it useful on scores where removal didn't work well. (musicprog has been asking me to take a look for a while now. It's slower from what I understand, but I'm wondering if it will fix these problem files. I still need to do this.)
- I found the 'harp and others' volumes are somewhat problematic with this approach. Manual work is required in renaming files.
- The utf-8 characters don't seem to work well in the pdf titles. I haven't investigated this much.
Good luck!
Re: FTP Server
- The ut8-8 problem causes me to ditch the file_mapper.rb and just use the original name of the file (minus .pdf) as the name of the work. (KISS principle). In the batch script, I re-group the music by work, each pdf file has its instrument name appended to the original stem. Yagan said that the copyright reviewers should be clever enough to work out the identities of the scores.horndude77 wrote:Great! I'm glad you built leptonica to make it easier for others to use. The README looks good. I'm sorry I don't have much time to review in depth right now.
A couple other observations from working with the script:
- Look at line 63 of clean_pdf.rb. I added a pause so that the files could be checked/edited before the images are compiled into a pdf. I found it useful on scores where removal didn't work well. (musicprog has been asking me to take a look for a while now. It's slower from what I understand, but I'm wondering if it will fix these problem files. I still need to do this.)
- I found the 'harp and others' volumes are somewhat problematic with this approach. Manual work is required in renaming files.
- The utf-8 characters don't seem to work well in the pdf titles. I haven't investigated this much.
Good luck!
- Yes, there are some re-naming of the instrument folder names to keep the filesystem and the bash scripts happy (remove spaces and non-ASCII characters), but they only need to be renamed once for each volume, so even this is manual renaming, there is little work on the user's part.
- I noticed at later stages, you added a couple of extra patterns, they are really useful and so far I didn't see any wrong files so far, but if the users are paranoid, they can uncomment line 63 of clean_pdf.rb to allow manual checking.