How We Used ML While Building (Yet Another) Remote Job Board

It was a nice day at the end of 2020 when we suddenly decided to create another aggregator for remote vacancies, exclusively for IT positions. It would be logical to ask why to make another one when there are already enough of them on the market. The answer is straightforward — we understood how to improve current solutions in at least five parameters:

  • Quantity: to aggregate the most in the world;
  • “Really” remote vacancies: not only “remote until COVID-19”;
  • Relevance: often, on similar sites, you can find a large number of irrelevant vacancies;
  • Power of the search engine (in my opinion, the search on current sites with deleted vacancies is at the level of 2005);
  • Filter by citizenship.

As a matter of fact, it is about the last parameter that I want to tell you today.

Problem

For anyone who has ever searched for a remote job, it is obvious that often companies offer remote work, but only for citizens of certain countries.

There is no separate field on the pages with job descriptions where such restrictions can be displayed most of the time. And there is no search/filter. Therefore, the applicant has to carefully read the text of each vacancy to understand whether it makes sense to respond to it or he/she will definitely not pass based on citizenship.

We decided to solve this problem, basically, to show the user only those vacancies for which he/she can really apply, given their citizenship.

Analyze

At first, we thought to solve this problem with simple algorithmic methods. The basic idea was:

Step #1

We are looking for certain keywords in the text, for example: “only”, “remote in”, “authorized to work in”, and so on.

Step #2

We are looking for a “location” next to the keywords, which, as a rule, was a word with a capital letter. If such a location is found, then it is a restriction.

In general, if the vacancy says “USA only”, then this logic works perfectly. However, after analyzing only about 500 vacancies, it became clear that the restrictions can be indicated  differently, for example:

  • This role is remote, and you can be based anywhere across the UK.
  • Living in Europe is a must.
  • This opportunity is only open to candidates within Canada at this time.
  • Location: Argentina (any part of the country it’s great for us!)
  • And hundreds of other descriptions.

It became clear that the algorithms could not pull the problem, and it was decided to try to use the power of ML.

Task

Just in case, I will announce the problem again. In the input, we have a text describing the vacancy, which usually contains a company’s description, a technology stack, requirements, conditions, benefits, etc. In the output, we should have parameters:

restriction: 0 (no) / 1 (yes)

if restriction = 1, then it is also necessary to highlight the country for which there is a restriction

Solution Structure

As I wrote above, we have a large text at the input, which usually contains a bunch of everything, and therefore the task was somewhat more difficult than just writing a regular classifier. First, it was necessary to find what exactly to classify.

Given that we were looking for location restrictions, we decided to find all the text locations first. Then select all sentences that contained these locations and write a classifier for them.

Finding Locations

We also tried to solve the problem “head-on”: find a list of all countries and cities and just search for their text occurrence. But again, the task was not so easy.

First, the restrictions applied to countries and capitals of the world and small cities and states (for example, “Can work full time in Eugene, OR / Hammond, IN”). And making a list of every city in the world was difficult enough.

Secondly, the writing of vacancies locations often differed from the standard (for example, “100% Remote in LATAM”).

Therefore, we decided to use NER to highlight locations. We tried different existing methods:

The choice fell on spaCy because EntityRecognizer showed the best result out of ready-made and free options.

Total: we managed to highlight locations in the text.

Splitting Into Sentences

We also used spaCy to split the text of the vacancies on sentences with locations inside them. At the output, we received a list of them. Here are examples of such sentences:

  • The position is remote, so the only thing is they have to be in the US and work Eastern or Central time.
  • This job is located out of our Chicago office, but remote, US-based applicants are still encouraged to apply.
  • This is a remote role, but we’re looking for candidates based in Montreal, Canada.

Classifier

The model was supposed to mark these sentences. It is important — we did not have the opportunity to make a dataset with tens of thousands of such sentences (this takes a lot of time), so when selecting a model, we had to take these limitations into account.

We decided to try several models, including both simpler CNN and LSTM and more modern transformers. The latter, predictably, turned out to be better, the training of which was essentially reduced to fine-tunning — this definitely suited us because the dataset, as I said above, was not large.

Among transformers, the RoBERTa architecture (roberta-base) showed the best result with an accuracy rate of 94% for our dataset.

Normalizing Locations

Based on the classifier and NER for each vacancy, we received the following additional fields:

restriction: 1 (yes); location: London

Classifier gave us Restriction. But NER gave Location. Since the Location field could have different spellings of cities and countries, we also made additional normalization through the Google API. We decided at making country restrictions.

So, the output turned out like that:

restriction: 1 (yes); location: United Kingdom

Summary

As a result, we now know how to do this, and candidates can filter vacancies that are not suitable for them. 

P.S. I didn’t want to promote the aggregator here, so I’ll just leave it as the reference.

This UrIoTNews article is syndicated fromDzone