First off I'd like to say that your programme does far and away the best job at cleanly converting PDFs that I've seen to date.
I was wondering if you would consider adding the ability to output the html markup into individual files per page. I realise that it's technically possible by selectively ignoring all lines except for those on the desired page one page at a time, but it means it's difficult to use the ignore functionality for it's original purpose, and in our instance where we have to convert documents 100's of pages long, it massively increases the time taken to do the job.
It occurs to me that you've written PDFMasher concentrating mainly on the use case of eBook creation, but some of your potential user base might have different uses.
Anything is always possible, but your request have me curious: what would be the purpose of having one html file per page?
In our case, we provide digital publications as rich turning page presentations, but many of our clients have a need for a ‘text only’ view of the page content, basically a button that presents the textual content with structural markup such as headings, paras and lists but with no layout or style.
I imaging another use would be if someone wanted to populate pages of a website with the contents of a PDF.
The feature would be a bit too specific for my own taste, but a little research has shown a program to split HTML based on header tags:
One thing that PdfMasher could do is to give the option to automatically insert “Page <page number>” elements at the beginning of each page, which would allow you to have easy markers for your page splitting.
Fair enough… If it wouldn't benefit enough of your users then I understand that it's not worth developing the feature.
When you mention inserting ‘Page <page number>’, do you mean that there's the ability to do this already or is this a feature you're looking at adding? We could definitely post process based on that, although we would probably need to be differentiated from ‘Page x’ appearing in the text of the PDF somehow, maybe it could appear in the form of a html comment.