I did not realise how long it has been since I last blogged. Having recently being given the opportunity to work on some NLP(Natural Language Processing) at work and with no prior experience on the same, I resorted to my first option of learning through an e-book I purchased some time ago through a humble bundle(Reference to the book at the end of the article).

Humble bundle is usually an impulse buy for me when I see the books on offer and truth be told, I rarely get to read through most of what I bought. But the important thing is that I stumbled upon this gem of a book with one of those purchases so I guess it is a win.

As an initial assessment at work, I needed to do some basic work with POS(Part of Speech) tagging, noun chunking and dependency parsing on certain text we take from different sources. We were using the CoreNLP Java library to do certain NLP stuff before and wanted to explore what are libraries out there better suits our needs.

With a strong urge to up my Python foo, I wanted to try out how best we could use some of the Python libraries out there for this activity at hand. It would probably be better if I did a separate blog on the different Python libraries I did try out and will only be focusing on how I used Spacy on this post in order to maintain brevity and more importantly with the intentions of not putting you, my lovely readers to sleep on a Monday.

Spacy is one of those top contenders for NLP with Python and I wanted to see how best we can try and use it for our use case(s) at hand. Before we get into the nitty-gritty details of what was done, let us have a quick look at my naive attempt at drawing a diagram to explain my work with Spacy on Lambda.

**Disclaimer before we begin** : This is a POC and not to be misconstrued to be a production ready deliverable.

My current use case is such that we will be dropping files that we need to run a few NLP tasks on to S3 which will trigger a cloud watch event that invokes the Lambda that runs the task on the given S3 object. Everything is logged into Cloudwatch logs with a log group configured.

I used the serverless framework to build this as I found it the best to get up and running in the shortest amount of time while allowing me to declaratively configure the resources I need for my application. Probably another post on different serverless frameworks out there later on.

My end goal for this initial assessment is to have a file generated with the following template;

<<BEGIN>>
<Sentence>
<Constituency parser output>
<<SNP>>
<Noun chunking>
<<ENP>>
<<SDEP>>
<Dependency tree>
<<EDEP>>
<<END>>

Let us take a look at the Python code and explain bits and pieces that I think warrants an explanation as we go on. The complete code can be found on my Github repo.

nlp = spacy.load(‘en_core_web_sm’)
tokens = nlp(data)

Before we tokenize things, we need to use a spacy model. Usually this model is installed using the following command;

python -m spacy download en_core_web_sm

In the serverless world however, things work a tag bit differently. More on that later on.

So once the model is loaded, spacy gives us the annotated tokens

data_map = {}
for token in tokens:
 if data_map.get(token.sent) is None:
  internal_list = []
  internal_list.append((token.text, token.tag_))
  data_map[token.sent] = internal_list
 else:
  data_map[token.sent].append((token.text, token.tag_))

In this section, we keep a map where the key is the actual sentence (not the best approach if we have repeated sentences, but again this is just for an initial POC) and the value is a list of the token text and tag. Example output would look something as follows;

(‘But’, ‘CC’), (‘despite’, ‘IN’), (‘longstanding’, ‘JJ’), (‘efforts’, ‘NNS’), (‘to’, ‘TO’), (‘standardise’, ‘VB’)

Note that in this initial implementation we do not do any text preprocessing like cleaning up the text, removing stop words, stemming/lemmatization.

Next comes the noun chunking. After parsing the initial text with Spacy, the noun chunking is available via the Document data structure.

for chunk in doc.noun_chunks:
write_to_file(chunk.text, key)

In my case, I just right the noun chunks found into a text file with the template format I defined earlier.

Finally, I needed to do the dependency parsing and while searching through a solution I stumbled upon this solution that worked for my case right now.


When Serverless meets Spacy

Ok, now let us get into the serverless aspects of it. Usually if Iwas using Spacy in a stand-alone script, I would download the model I wanted to use with the following command;

python -m spacy download en_core_web_sm

But on the serverless world things are a bit different. To start off, I used a plugin that uses the Pipfile I had in my project to install the dependencies needed for my script to run in a Lambda environment. The Pipfile gets converted to a requirements file at build time which is then used to download the dependencies.

plugins:
- serverless-python-requirements

My Pipfile has the following defined;

[packages]
spacy = “*”
nltk = “*”
boto3 = “*”
[requires]
python_version = “3.8”
[scripts]
runspacy = “python handler.py”

The missing part was the spacy model I wanted to download. I first tried by defining the Github link to download it as a tar as follows in my Pipfile;

“spacymodel” = {file = “https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.1.0/en_core_web_sm-2.1.0.tar.gz"}

Although this worked initially, it started to give me some weird issues after. So I had to resort to downloading and extracting the spacy model distribution into my serverless project.

When I ran the serverless deploy command I saw that the serverless Python plugin was downloading the dependencies I stated on my Pipfile into the root directory as part of the final zip that was being uploaded to S3. So putting in the Spacy model into the root directory worked out in the end.

One thing to note was that with spacy and the dependencies I was using, the size of the zip was in excess of 60MB which meant that when it was extracted it was in the 200+MB range. At the time of writing, Lambda has a limit of 250MB for the final extracted version. I can probably take out some dependencies to reduce it further but this is an important thing to take note of I thought because you do not want to implement something that in the end cannot be run on Lambda due to its size limitations.

Another gotcha with using serverless was that after the initial deployment, my function was timing out after 6 seconds. Googling around I find that the default timeout for your function is set to 6 seconds by the serverless framework if you do not specify your own timeout. This was fixed by adding a timeout(in seconds) to my serverless.yml;

functions:
 spacy:
  handler: handler.nlp_with_spacy
  timeout: 300

When building the Python lambda module, there is a custom tag you can put in that uses Docker to compile non-pure python modules as stated here. In my case I used it as follows with the non-linux tag which states that it should only dockerize it if run in a non-linux environment.

custom:
  pythonRequirements:
   dockerizePip: non-linux

That is about it for this post. The code for this post can be found on my Github repo.

Thank you for reading and do leave a comment with any suggestions improvements if you feel like it.

Have a great week ahead!

References

[1] Python Machine Learning By Example — Yuxi (Hayden) Liu

Leave a Reply

Your email address will not be published. Required fields are marked *