Research Hell, um, I mean Research Query

By Holly Tucker (W&M Editor)

Hi everyone…we interrupt our regular programming for an urgent call for help.

I’m in the thick of writing and research for my next book, which is under contract with W.W. Norton.  I’m not allowed to say much about it right now…other than that I may genuinely be more excited about this one than I was for my last one  (Blood Work:  A Tale of Medicine and Murder in the Scientific Revolution).

So here’s the challenge:

The current book requires me to consult  thousands upon thousands of pages of court records and interrogation reports.  Many of these reports exist only in manuscript scribble–which brings its own special form of suffering, err, delight.  But for the moment, I’m focusing on about 5 hefty volumes of transcribed texts.

The documents in are in PDF format and–another wrinkle–all in French. They contain testimonies from hundreds of people, who were each ostensibly up to some fabulously wicked things, OR who were being accused of similarly fabulously wicked things.

To make sense of the avalanche of accounts, I need to put together an index or concordance of each witness and each person who is named in the accounts. This is critical for me to be able to figure out who was up to what, when, and why.

Really, I just want something simple like this.

Curie, Marie.   5, 29, 523, 1502.  (ideally with the page numbers hot-linked to location in the PDF itself)

Curie, Pierre.  16, 99, 504, 1412.

I could create a hand-coded index, a process that would take me well into my twilight years.  Or I could extract each letter/report/record contained in the PDF and import each into my database (Devonthink Pro Office) and tap the “see also” AI functions of DTPO.  But, call me crazy, I’d like to finish this book at some point in the foreseeable future.  (And so would my editor…who is expecting the manuscript in the next 15 months.  Gulp.)

I’ve looked at book indexing programs like Cindex ($500+!), PDF Index Generator (which could work, if the computerized voice on the tutorials didn’t bug me so much),  as well as many articles on indexing strategies (see, for example, this ProfHacker article).

Adobe Acrobat Pro will let me save individual searches into nifty files.  But it doesn’t seem to be able to organize multiple searches into a single hot-linked document.  So I could end up with hundreds of different files for the hundreds of different people/historical figures I’m trying to track right now. (Cue sound of head pounding on wall.)

For those of you who know me, I usually get excited about finding just the right tool for just the right research need.  For the record, the programs I could not live without are:  Dropbox, Devonthink Pro Office (database), Scrivener (writing program), Sente (citation manager), and Aeon Timeline, which has been amazing as I plot chronology for the book.


Alas, as much as I wish that I were a true Research Sensei, I admit only to being a clueless dolt. This indexing question is truly kicking me in the pants.

So help this poor soul?  Share Tips?  Strategies?  (NB: No deep programming skills required, please.  The whole point is to simplify the process, rather than complicate it with a steep learning curve.)

And if all else fails, offers to buy me a drink and join me in commiseration?



  • Heinrich C. Kuhn

    Are these PDFs of yours text-files (i.a. the manuscripts have already been transcribed), or just images?

    If they should be text files: my approach would be: transfer them into MS-Word files.

    Then write a Macro which goes through your text and indexes automatically every string which starts with a capital letter and where the second word of the string does also start with a capital letter (as your texts are in French only names of persons and geographical items should be capitalised, and going for two-word-strings should eliminate most of the geographical terms).
    Then order Word to create your index.

    You would then have to react manually to different spellings of the mane(s) of one and the same person etc., but that should not take too much time.



    • Holly Tucker

      As I read through the primary docs, it doesn’t look like the person call-outs are in a predictable (First, Last Name) pattern. :( Will see if I can find other patterns to search for.

    • Holly Tucker

      Hmmm…food for thought. I’ll read through the text and get a much better sense of how the names are presented. (I’m not sure they’re always First Last). Once I see a pattern, I’ll give this a try.

  • Musebrarian

    Are you familiar with the Old Bailey project? Sounds like they’ve done similar things using TAPoR see “Data-mining with Criminal Intent”

    • Holly Tucker

      OH MY! What a fantastic site! I’ll need to dig around and see how they approached the same kind of issue. But I have a feeling it is soooo much more high-tech than I have time for.

    • Holly Tucker

      No, I hadn’t seen this. Wow! Wow. This looks fascinating.

  • mona everett

    My head is spinnijng. Where do I send the drink money? LOL! No, wait,. I’ll just drink it. You won’t have time. Sorry.

    • Holly Tucker

      How was that drink, Mona?

    • Holly Tucker


  • Audra (Unabridged Chick)

    Holly, I laughed at this subject line at first until I realized this was a serious query. My brain exploded a little. I’m asking my friends who do research if they have any suggestions.

    • Holly Tucker

      Thanks, Audra!

    • Holly Tucker

      Hey, thanks for your willingness to ask around (and to share exploded-brain sympathies)!

  • Pamela Toler

    Yikes! I have no ideas, but will ply you with whiskey from the good bottle the next time you’re in Chicago.

    • Holly Tucker

      Whisky? Sounds good: I need it.

    • Holly Tucker

      Good whiskey? Count me in! I’m trying now to use Adobe Acrobat annotate tool. The trouble now is getting the file into a format that the OCR likes. The thought is that I’ll read for names and annotate each. Then I can search for specific names in the annotations. It’s still not as automated as I’d like though…