We were approached by the owner of Skoosh.com with the following problem: A very large travel booking company is receiving feeds of available hotel rooms, discounts, and travel opportunities from multiple international sources of such data. The incoming data is highly unstructured and using different formats which requires every entry to be reviewed. In order to automate this process we have created a system for data disambiguation using machine learning algorithms. In this case study we will examine in details the problems our client was facing and how implementing our solutions exponentially improved the Skoosh system's productivity.

Project overview

The major difficulty of working with such feeds was that the incoming items were marked by numerous errors such as incorrectly entered hotel addresses, wrong telephone contact numbers, wrongly transliterated hotel names, and multiple missing hotel features such as missing web site addresses, email addresses, hotel amenities, post codes, etc. Since each item from each travel agent has to be linked to a real physical entity to match the search criteria of the online customers, every single item at the entire database had to be reviewed manually to be linked to a real-world physical hotel. The size of the database, when we were approached, was about 70 000 different entities. The trained manual operators were spending about one year to complete a review of about 50% of this data, missing business opportunities by not offering results from the other 50% of the purchased data, and the observed human error rate was about 10%. A number of professional authorities have been attempting to design a solution for more than one year, and the achieved gains were disproportionate when compared to the time and funds needed to reach them. The project which we undertook was aimed at completely automating the clustering of the available hotels while maintaining acceptable error rates that should not go beyond the observed human error rate.

The results

The automation reduced the manpower needed to maintain the data from one year for every 70 000 items, to about 3 days for every 100 000 items.Within two weeks we had a demo that was already surpassing the results achieved by the previous professionals working on the task for more than one year. Within 30 days we had designed a solution that reached accuracy rates of more than 98%, exhaustiveness of 100%, and required only about 72 hrs of fully automated computations to process the entire data set. Due to the huge change in the review process the company was able to contract many additional sources of data within the past four years, and is currently able to operate with almost 1 000 000 items at its database. The product pool grew more than ten times, while the automation reduced the manpower needed to maintain the data from one year for every 70 000 items, to about 3 days for every 100 000 items. The preprocessing phase stopped being the bottleneck limiting the growth of the company, and it is currently able to expand its product portfolio by contracting new data sources at virtually unlimited rate.

Technological overview

A number of machine learning algorithms were used during the different stages of the product development. The automated disambiguation and clustering process was divided into three phases: data normalization, item clustering, quality post-processing of the data.

Data normalization:

Due to the multiple sources of data, each set from each vendor used different formats to supply the data. We had to design a system that would normalize the data, and which would allow us to use identical encoding for each item. The identical encoding was needed so that we could then implement comparison and distance measures between different items from the data. A simple compression algorithm was designed to minimize the size of the data when placed into the operational memory. The textual data was converted to UTF8 encoded streams regardless of the incoming encoding, and all nominal values were further compressed by removing features that were major sources of error. A pattern detecting algorithm was run against the data to find patterns that lead to data corruption and to remove them. Such happened to be the number 0 at the phone numbers, for example. Preliminary results showed that removing all zeroes from the phone numbers resulted in clusters with smaller inter-cluster distances making the resulting groups more compact and with more distinct borders. The result of the data normalization phase was that we could accept any source of data in any format and convert it to a well formatted record. The entire data was then vectorized and used as input for the item clustering phase.

Item clustering:

A combination of distance measures were used to form a secondary vector of features for each record. The first calculated distance measure for some of the item features was a heavily customized version of the Levenshtein distance. Due to the numerous typing errors, wrong transliterations, inconsistent word order, international hotel names and address formats, the designed distance measure allowed us to examine the similarity of seemingly unconnected items. Once we had the distance measure calculated for the textual features, we designed a feedforward neural network to find efficient distance threshold values which would later be needed for the actual clustering algorithms. Another benefit of applying thresholds for the distances was that they had to be estimated only once, and could then be hard-coded where needed, reducing the computational complexity of the task by a factor of two. We were also able to separate all outliers that fell outside of the ranges defined by the upper and lower thresholds. The fact that we had the outliers separated meant that we could use another learning model to work only with them. Our initial assumption that the default classifier would not be able to handle them well was correct and this led to further accuracy gains.

We used a variation of the general Nearest neighbour search (NNS) algorithm. Our preliminary tests showed that after proper pattern extraction the k-nearest neighbor
algorithm (k-NN) achieved results that were most insensitive to noise in the data. Using the k-NN algorithm, we were able to form well defined clusters of items with distinct borders. Since each candidate object is classified by a majority vote of its neighbours, we were also able to locate features resulting in strong data separation, shared by all members of a group, such as the city in which the item is recorded. After further consultation with the booking company, we were informed that, according to their experience, the city is a feature that is never wrong, so we were able to further reduce the computational complexity of the task by using it as a hard-coded divisor of the data.

We were able to spot the formation of a considerable percentage of loose clusters and our analysis showed that these clusters should be actually concatenated with other existing groups. The algorithms were able to detect and form groups based on patterns we called “similarity leaks”. One such pattern is formed when a hotel is sold from one hotel chain to another, or when a stand alone hotel changes its official name. Here is an example of one such real world event which the algorithms were able to pick perfectly: in 2005, Blackstone Real Estate Partners sold the AmeriSuites chain of hotels to Hyatt. Since not all vendors of data maintain their in-house records well, there were records still listing the hotels as being owned by AmeriSuites four years after the change of ownership. The hotel chain is often included at the name of the hotel, and these events formed patterns which even the human operators were not able to account for without extensive research of the history of each hotel. The fact that the manual review was already a cumbersome process, and performing further checks at external sources was not feasible for each hotel, meant that business opportunities were missed and a considerable percentage of the available discounts and offers could not be shown to the customers. Since our system was able to automatically detect these “similarity leaks”, we managed to design a solution that allowed us to form additional closures under a new set of distance functions that accounted for the “similarity leaks”. The closures were formed through a recursive process which led to higher computational complexity but the importance of the added business value in the long term far surpassed the added processing cost of a few additional hours of computation.

The designed system was able to pick a lot of different “similarity leaks” present at the data. Changes of street names, hotel name changes, alternative contact details, missing or overly generalized descriptions of some of the hotel features, were only a few of the problems we were able to successfully account for. This resulted in accuracy and exhaustiveness of the results which went beyond what we initially believed was even possible.

Quality post-processing of the data:

The time which was needed to process a new data source, and the fact that it was now a completely automated procedure, meant that the idle human operators could be used to perform a secondary pass of the already grouped data. We designed an interface which facilitated the process so much that a full review of 100 000 grouped items could be performed by a few operators within two to three weeks. This stage allowed the human operators to fix groups formed incorrectly by the automated algorithms. Due to the high accuracy of the automated system these operators acted on less than 2% of the formed groups.

We came up with the idea that would allow us to automate this process as well. We started recording the actions of the human operators during the post-processing review of the groups, and after a few cycles were able to design a system that was able to learn from their activities. The purpose of the new system was to learn from the actions of the human operators so that it could then replicate their decision making process. The major difference between the previous two tasks and the current one was that the latter was data independent. The new system was based on the actions of the people working with the interface, and on the environment within which the actions were taken. The system environment was comprised of features such as the size of the groups, the extracted proximity measures from the earlier phases, and on the chronological changes which each group experienced since we started logging them.

We created a method that allowed the automated generation of a family of 12 000 tree based learning agents. Each agent was allowed to make simple logical decisions and basic calculations: comparisons for equality, summations, divisions, multiplications, less than or greater than checks, etc. The agents were allowed to accept random numerical values for some of their nodes to spur fuzziness through the introduction of noise at the learning data. The genetic programming approach consisted of the following steps:

  • Step 1: Initialization. Creation of the entire population of learning agents. The agents were given completely random decision nodes.
  • Step 2: Selection of a good fitness function to be applied for each learning agent. We designed a function based on a combination of weighted precision and recall metrics and on the performance of the agent against a subset of groups dynamically selected after each generation.
  • Step 3: The next step was to apply the rules governing the natural selection phase. We decided to implement probability governed survival rates for the learning agents. The more fit an agent is the higher the probability of its survival was. We applied the same probabilities for the rates at which each agent was allowed to reproduce and to the selection of agents which were allowed to become parents of the agents at the following generation. The generation of the new population of agents was allowed through random mutation and crossover (recombination of decision nodes between the parent learning agents). Since the best agents were selected for breeding more often, the overall fitness of the population was guaranteed to increase with time.
  • Step 4: Termination. We optimistically set the termination criteria to be 0 errors. We allowed the system to run for close to one million generations monitoring the convergence of the accuracy achieved by the top 1% of the learning agents.

The top performing learning agent that was selected for the automated review system was a tree with depth of 47 layers. When we applied its reasoning to the automatically generated groups it was able to automatically process about 87% of the groups within a few seconds. The achieved accuracy was 99.98%. In order to test it further we obtained data with 6000 groups from a company that sells manually processed records, which are regularly checked, updated, and widely used by the industry, the latter means that any noticed errors are reported back to this authority and fixed on their end. Our algorithms returned only 6 groups which were reported as ambiguous. Manually reviewing these six groups showed that they were mistakenly combined by the data vendor.

The results achieved by the automated review system within a few seconds lowered the human post-processing review times with about 87%, and the achieved accuracy of 99.98% meant that it outperformed the decision making power of well trained human operators.