Large Document Classification with BERT

By Karl Davidson

Some background

I recently had to tackle a deep learning task involving Natural Language Processing/Understanding (NLP/NLU), a field in which I have maybe 4 months experience, on top of only ~1 years worth of experience in Data Science. The task was large document classification, where the document can be anything from a one-pager to, well, whatever you want.

If you’re a novice data scientist like me, this is rather daunting as a lot of the frameworks in the field of NLP/NLU are particularly advanced and can be difficult to follow, even with their generally extensive documentation.

(As an aside, if you plan to do any production level work with NLU, consider having a GPU (graphics processing unit), and try not to order them during a pandemic when everyone wants to be a gamer, and there’s a silicon shortage.)

Back to the task at hand, when you’re doing something you’ve never done before, usually the best first step is to boot up your favourite search engine and ask “how to do __insert_new_thing_here__” and hope you get a good, and recent article. This seems especially true in software and data, because who writes their own code anymore? Anyway, there are many tools to use for such a problem, but I chose BERT and Huggingface Transformers. I’ve linked them both to some level of documentation you can look into more yourself, as describing them here isn’t the purpose of the article. The nice thing about these tools is that they either are integrate into or work seamlessly with the actual Python libraries that I use to code out this problem. I say “libraries” plural, although I only ended up using one in the final product, but I’ll talk about why I started with the other. 

TensorFlow (TF) and PyTorch (PT) are both very similar libraries in that they do a lot of the same things. They are designed for working in deep learning with tensor objects. Before I start showing my process with PyTorch, I’ll discuss where I got a lot of my starting ideas, and why I stopped using TF for the problem. I had ended my first string of research with this article (which I’ll just denote as [1]), which discusses large document classification with TF. I used a good portion of the main ideas here, and because the article is a few years old, I had to make a few edits here and there to account for versioning issues. By the end here, I had a fully working TF model on Colab, but therein was the issue. As it turns out, there exists a bug where training a model on a Colab GPU means its particularly difficult to make it work on any other processing unit (an article discusses this in some depth here). After a couple hours trying to sort through it, I decided to just make the switch to PT (as another member on my team had been using this as well). To keep things consistent and save potentially more time in solving this bug, I made the switch there.

Getting started

You can refer to the TF article I mentioned above for more detail on some of the steps I describe below. Beyond that, I followed along fairly closely with this PyTorch article (denoted [2]). In the following article, I’ll note the few changes I made but I see no sense in copying and pasting the code of someone else you can find in their article! 

The data you’ll need is a collection of documents from which you can extract text. These documents will need labels, meaning you’ll have to know if they are applications, resumes, forms, etc. Once you have the text extracted, (by whatever means you want, I used a combination of PyPDF2, textract and pytesseract) make sure you clean up the text a little bit. This can be done easily enough with any regex library. You probably don’t care about much except alphanumeric characters, just note that underscores fit in this category for whatever reason, so you’ll have to explicitly remove those (they gave me some grief before I figure this out). 

Once you have all the data extracted, I found it easiest to put this in a pandas dataframe, then I could easily map all the labels to numeric values (any label encoder will do for this, I used Scikit-Learn). Now is when the fun begins.

As you might know, BERT does its work with tokens, not text. Furthermore, when we convert text to tokens, the longest single collection of tokens you can put into any BERT model is 512, which is probably about the length of a high schooler’s first essay. As you might expect, there isn’t a necessarily obvious way to deal with large documents, since they can be many, many pages long. The method I used is contained within [1], and simply splits the text on whitespace. I modified the code for 150 word chunks (rather than 200), and kept the overlapping of 50 words between subsequent chunks. Once you split up all the text, I again put it all back in a dataframe. Make sure that each 150 word chunk retains its original label! Here’s the function to make this happen.

def get_split(text):
total = []
partial_str = []
if len(text.split())//100 > 0:
n = len(text.split())//100
n = 1
for w in range(n):
if w == 0:
partial_str = text.split()[:150]
total.append(” “.join(partial_str))
partial_str = text.split()[w*100:w*100 + 150]
total.append(” “.join(partial_str))
return total

Tokenizing and training set up

Next up we’ll need to actually start getting the BERT model all set up. For getting the model working, I suggest just using the ‘bert-base-uncased’ model, and then consider switching to something bigger for production. To be nice and clear, BERT is a pre-trained model, so when I say “training” I really mean “fine-tuning”, since we are just modifying the already built models to suit our needs. 

Tokenizing the data is particularly easy, but we want to make sure we get more than one bit of information from the process. Here’s some more code.

tokenizer = BertTokenizer.from_pretrained(‘bert-base-uncased’, do_lower_case=True)
def super_encode(x, tokenizer):
encoded_dict = tokenizer.encode_plus(x,
add_special_tokens = True,
max_length = 512,
padding = ‘max_length’,
truncation = True,
return_attention_mask = True,
return_tensors = ‘pt’,)
return encoded_dict

This function here can be directly applied to a pandas dataframe (particularly the one which contains your chunked text data) and returns a big encoded object containing the tokenized text as a tensor of integers (which are the ids for each individual word), the attention mask (a tensor of ones and zeros – ones for tokens, zeros for padding). The format is also ‘pt’ for PyTorch. 

The next step is to do a split your data into training, validation and test data (I did an 80, 10, 10 split). You’ll need both training and validation for the actual training, while keeping the test set for the very end. Following that, you’ll need to prepare your data for PyTorch, with the DataLoader object. This step is outlined well enough in article [2], so just make sure you do that for your train, validation and test sets separately. Don’t forget that in addition to the input ids and attention masks that are already in tensor form, you’ll need to tensorize your dataframe’s label column so that the DataLoader can read it. Now that you have these things, you’re all ready to go. 


The first step here is to pick your favourite model parameters. I had to mess around with these a bit, but you can use the ones from articles [1] and [2] as starting points. The one note I will make here is that depending on your system (including if you use Colab), you may have to shrink the batch size a bunch. I needed to use a batch size of 8, as opposed to 16 or 32, otherwise I’d get an OOM error (the worst). This may change for you, but I expect that since we’re in the business of large documents, this is just what it has to be. 

You may have noticed that we are using a BertForSequenceClassification model, which was used in [1], but is not in [2]. Fortunately, this doesn’t actually matter for training, and we can use the training method used in [2] (almost) verbatim. There is one change that needs to happen due to the age of [2], which is simply that

loss, logits = model(b_input_ids,

must become:

model_obj = model(b_input_ids,

loss = model_obj.loss
logits = model_obj.logits

The chunking of all my documents meant I had about 30,000 training points and about 3,000 validation. The training ended up taking about 3 hours (1 hour per epoch), which isn’t terrible but it’s something to keep in mind. To my surprise, the accuracy of my model ended up being >98%, which is pretty substantial, but suspicious. I looked into it more (and tested it with the test data, which I’ll discuss soon) and my suspicions turned out to be unwarranted, which was great. Now that the model is fine-tuned, make sure to save both your tokenizer and your model for later use! 


Testing your model requires very similar code to the training, and can be found in [2]. Reminder to change the code for the logits (no loss here). Once you run the model on your test data, you’ll have to decipher the outputs. BERT outputs are pretty confusing, but it really just boils down to whatever is the largest number in the array it returns is the one that corresponds to the predicted label. Just do a numpy argmax calculation here to pick those out, and match it to the corresponding label. From there, you can put this new list of predicted labels in your old dataframe of test data. 

Of course, you can just use some basic filtering to see how many your model got wrong, but its better to be a little more robust. Run these two columns through a Scikit-Learn ClassificationReport, for a nice table of precision, recall and f1 scores for each of your individual labels. 

With that, you’re all done! I intended for this article to act simply as an “update” and brief guide to a problem that I found a little frustrating to research. Hopefully it will speed things up for you a little!


I referenced these articles as my main sources of information for my project, and mentioned them a lot during the article, but only hyperlinked them a couple times. Here they are in full:

[1] Using BERT For Classifying Documents with Long Texts

[2] BERT Fine-Tuning Tutorial with PyTorch