pdf to data conversion

Intro

So….you have lots of pdfs that you want to extract some data from? I’ll tell you my story first, then give you the files and finally some hints on how to get your own data.

My Story

For a number of years I’ve set my own evaluation/feedback questionnaires for the classes that I run in Uni. I’ve always found the student responses useful, but because the questions were my own it was impossible to correlate my results with other courses, because – like me – the coordinators would have done their own surveys.

This year the Uni rolled out a standard evaluation questionnaire which was nicely integrated into our VLE (Virtual Learning Environment). Class coordinators were sent the results automatically. However, I still wanted to compare the scores from my classes to others in the same subject field and then the rest of the department…and that’s where the challenge started!

One of our admin teams had downloaded all the survey results, but they came as either a CSV file with all the original survey responses for all the students in all the classes in our department, or as summary pdf file for each class. Neither of these formats gave me (and my colleagues) the information we wanted in the format we needed. I did some ‘manual harvesting’ for a few classes I was curious about, but it was quite a faff.

My recent forages into Data Carpentry with my colleague Leighton Pritchard made me wonder if there was a better way…all it needed was a slow Friday evening and a couple of hours recycled from a Saturday! There were python libraries available online for converting pdfs to data, but the coding wasn’t clear to me and you needed to ‘register’ with someone before installation (which I’ve never come across for python before so I was a bit suspicious!). I thought of using the text mining functions in Unix too. However, python, unix and R would all require non-techy users to download additional software. If I could code something up in Visual Basic I could use MS Office which all my colleagues have installed.

Building the macros

Step one was to convert the pdf files to text files. Looking around online it wasn’t clear how I’d do this. Adobe wanted a monthly subscription for the privilege of doing the conversion by batches. Eventually I realised that Word will open pdf files, and – of course – save those files as text-only (or .txt) files. I adapted some Google’ed code and came up with a macros that does the conversation. (See the instructions and links below).

Step two was to extract the relevant values from the text file and display them in a table. This was a bit trickier, for three reasons: firstly, the text files generated in step 1 didn’t always have the data I wanted in the same place in every text file; secondly the same data and the key phrases I could use to search for that data were repeated at different places in the same text file; and finally the file length wasn’t consistent, some files were huge (running to several thousand lines) while some were really short! Again, ‘Google is my friend’ and I managed to get code that can cope with all three problems and spit the numbers out into a table.

The files

WordPress won’t let macro-enabled files be stored on the server, so I’m hosting them on my personal uni webpages. The Word document is here, and the Excel file is here.

Instructions

  1. Download the Word doument from the link above. The document – called Doc_with_macro.docm) has a macro in it so you will get a ‘warning’ when you open it up, but you need to have the macro enabled for the code to work. The document contains the instructions on how to use it, but they are repeated below.
  2. Put all the pdf’s you want to change into one folder
  3. In the folder you created in step 2, create a new folder to contain the text files and save the Word document to that new (text files) folder
  4. Click the white circle near the upper left hand corner of this window (the so-called Quick Access toolbar), select the folder that the pdfs are stored in and click OK
  5. On my computer each file conversion takes about 10 seconds so it might take a while for large file numbers.
  6. The text files are now in the new text files folder.
  7. Download the Excel file from the link above. (The Excel files has no space for instructions in it – unlike the Word file – so I’ve written them here.)
  8. Save the Excel file to a folder of your choice. When you open the file Excel will come up with a warning that the file has a macro: you have to have the macro enabled for the code to work.
  9. Clear any data in rows 2 onward (if you’re using the downloaded Excel file it won’t have any data in it). In fact you can keep data, but be aware that new values are added to the bottom of any existing data table.
  10. Click one of the two ‘macro buttons’ and select the folder that your text files are in. The ‘Collect Data (CORE)’ button is the best one to use, but some module evaluation reports don’t use the phrase ‘CORE – Teaching & Learning’, but just ‘Teaching & Learning’. If you use the ‘Collect Data (CORE)’ and there are lots of gaps in your table, use the ‘Collect Data (no CORE)’ button instead.
  11. The data will be added to the worksheet. (This normally happens very quickly.)
  12. Optional – copy the data to another blank workbook so that: firstly, you aren’t sending/storing Excel sheets with macros in them; secondly, your boss won’t know you’ve use VBA to complete the task and you can grab a cuppa before you send the table onto them!!
  13. Optional – I’d recommend counting the number of rows matches the number of original pdf files, and checking a few of random rows with the original numbers in the pdfs.

Final thoughts

A big thanks to all those authors who’s code I’ve adapted (I’ve tried to add weblinks and references in the macro comments where I have used their code).

I’m happy to try and help folks if I can (but parsing data in different formats can be a bit tricky and I might not be able to help either time-wise or technically). Please use the comments box if you want to suggest any improvements. I’ve run these macros on a couple of pdf batches relevant to the department in which I work and they worked OK, but beyond that I’ve not done any ‘testing’. I’m not a professional coder: the macros work fine on my computer, but I can’t vouch for anyone else system so ‘use at your own risk’.

If you want to use this for your pdfs it’ll need a bit of coding knowledge to modify the Excel macro. I’ve tried as much as I can to explain the code in the comments box. You need to open the Excel macro in the VBA editor and match your code to the structure of your text files.

For each text file, each line of text is loaded one by one, the key is identifying which lines you need to extract that data from. By opening the macro you can see that I’m taking data from the filename, from two points in long text ‘string’ made by concatenating the first 20 lines of text, and four points by taking the first ‘non-blank’ line after a ‘key phrase’. The macro only looks at the first 200 or so lines (where all my relevant data should be). Perhaps you could modify, copy and delete these wee bits if code to get your data into a table. Good luck. I’m happy to help if I can, but keep an eye out for the excellent Data Carpentries, Software Carpentries and Library Carpentries course – they are well worth the time.

Categories:

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s