“With so much data available, how can busy users find the right bits of information? For example, a typical search on an Internet search engine returns thousands of results. A user has to sieve through articles after articles until he finds what he needs,” explains Lim Chee Kiam (pictured), Senior Solution Architect at NLB. “Instead of having users repeat the tedious search and sieve process, we should push the most relevant information packages to them. To do this, we must connect our content.”
Connecting Structured Data
The group of 25 public libraries and one National Library houses over a million physical titles, which generate over 30 million loans a year. Using data mining techniques on past loan transactions and bibliography records of books, the library successfully connected their titles and launched a title recommendation service on its websites and portals since 2009.
“Besides showing the book that you’ve searched for, a section on the side shows you a list of books other patrons who have borrowed this book also borrowed. Collaborative filtering mines the reading patterns within hundreds of millions of loan records in the last three years to make recommendations,” he elaborated.
The system also rely on content-based filtering using bibliographic records and generates another list of recommended books, under ‘similar titles you may also enjoy’.
Because fiction titles are more frequently loaned, the system can generate recommendations for 89 per cent of the fiction titles, compared to only 53 per cent of non fiction titles.
NLB is currently working on title recommendations on new arrivals. Once rolled out, when a patron is looking at a particular title, he or she will be able to see if there are related new arrivals of interest.
Connecting Unstructured Data
Unstructured data makes up a huge and growing portion of content that NLB holds. It has successfully used text analytics on Infopedia (a micro site with less than 2000 articles), the Singapore Memory portal, and 58,000 newspaper articles on NewspaperSG.
“The results were very promising. Interestingly, when we organise the recommended articles in a chronological order, we can discover the progression of an event and see how the story unfolds,” said Lim.
Lim’s team is now working on using text analytics on the above collection and 6 million newspaper articles. “This gave us our first real scalability issue. The processing ran for more than a week before we ran out of disk storage,” he recounted.
Processing the older issues of the newspapers surfaced another challenge. Newspaper microfilms were digitised by using Optical Character Recognition software, but errors were common. There errors introduced ‘noise’ into the data set, and also significantly increased the complexity of the computation, leading to lengthy processing and the need for huge amount of intermediate disk storage.
“To address this issue, we tuned the parameters for the text analytics algorithm to ignore infrequent word tokens. We also set up a full Apache Hadoop cluster with 13 virtual serves on three virtual machine hosts so that we have a reliable, scalable and distributed computing platform,” continued Lim.
While the team has successfully reduced the time needed to process the data, they are still working towards processing the 6 million articles.
Looking ahead, Lim hopes to enrich the content with semantic information so that content becomes connected semantically instead of just textually. He also wants to enrich the content with language translations to explore the possibilities of connecting content in different languages, particularly useful given that Singapore has four official languages.