Dr. Hua Yan is an expert in search techniques, specifically as a means of extracting useful information from very large groups of text sources.
Yan has dedicated a great deal of time and study to a specific search technique known as latent semantic indexing, which is often shortened to LSI.
Yan has authored and published a number of research papers on LSI and related topics, and we will be covering two of these papers later.
Advanced search techniques have been in use for years, with different techniques being applied across various search engines and systems depending on a variety of parameters.
For the layperson, understanding of text-based searches begin and end with internet search engines, such as Google and Bing, but there are actually many more applications for text-based searches, especially in the area of academic study and scientific research.
While the techniques utilized by the major internet search engines are highly effective for sorting through extremely large amounts of data, these engines also receive assistance from search engine optimization, or SEO, which has become standard practice in online publishing.
These engines work well for many searches, but the techniques being used behind the scenes are far from perfect, and this imperfection can easily result in poor search results, as Yan describes here;
“Very often, we put a few keywords into a search engine such as Google and may find that the top five or ten search results returned are not exactly relevant or helpful. So we scroll down and try out the next ten or twenty entries, or even more. This process of a ‘search after the search’ is quite frustrating.”
This represents a fundamental failure of the well-known search engines;. they need to be reliable all the time, not just most of the time.
Yan explained that the core issue here is the intelligence of the search being performed, not speed or efficiency. Yan’s work in LSI aims to improve the intelligence of searches, and there’s potential for these more intelligent techniques to be integrated into internet search engines in the future.
“The current search engines, although fast enough, are not intelligent enough to return the most relevant results. LSI techniques in general, and SVR [singular value rescaling] in particular, are aiming to improve in this area. With LSI advancement incorporated into internet search engines, the average person can expect a better, more helpful, and more accurate search engine in the future.”
Beyond internet search engines, these techniques could be highly useful for more specialized, professional searches for the purposes of research and study.
The remainder of the article will act as a guide to the relative value of LSI in comparison to other common search techniques, as well as an overview of Yan’s own modification to LSI that has the potential to improve LSI even further.
What is LSI and why is it important?
First, a breakdown of LSI, and perhaps the best way to do that is to explain the use of each individual term.
‘Indexing’ refers to the use of matrices and vector computing. This computing is used to extract numerical values that represent previously hidden, AKA latent, relevance or meaning, AKA semantics, based on the user’s query words.
“Subsequently, these values are used to produce the most relevant matches among the provided pool of texts. It’s similar to an internet search, with the major difference lying in pool sizes and accuracy.”
Elaborating on that difference, internet searches typically have a much larger pool size, and LSI searches frequently produce better and/or more satisfying results to the user.
This makes LSI the superior choice when searching smaller pools of texts, but it hasn’t as of yet been successfully adapted to work with larger pools of sources.
However, Yan has researched and proposed a modification to LSI that improves its effectiveness even further, and this modification is called singular value rescaling, or SVR.
Singular value rescaling
In Yan’s dissertation, “Techniques For Improved LSI Text Retrieval,” he proposed and studied the novel technique of singular value rescaling, (SVR).
Yan tested SVR in experimental environments and with standardized data sets. These tests confirmed that SVR is incredibly effective.
Yan continued his study of SVR in the journal paper “Augmenting the power of LSI in text retrieval: Singular value rescaling.”
In this paper, Yan found that SVR had an improvement ratio of 5.9% over the leading conventional LSI query method. He also compared SVR to another scaling technique called iterative residual rescaling, or IRR. SVR performed better than IRR as well.
But what is SVR and why is it so effective compared to the traditional usage of LSI?
Other LSI techniques treat computed sets of singular values as non-variables. But in SVR, these computed values then undergo a transformative process in which they are subject to rescaling.
Each instance of this rescaling produces a new set of transformed values, which are then subject to re-evaluation.
From there, this re-evaluation produces a set of rescaled values with the highest score of the bunch. In practice, this method can be used to identify a small group of especially valuable finds within a large pool of sources.
The cherry on top is that, compared to other LSI methods, SVR takes the same amount of computing time but produces far better results.
“This finding bears the practical significance that the current information retrieval techniques may be significantly improved by simply adopting a novel query method that’s computationally as efficient as the best standard query method but with an improved result. In other words, it’s the same cost with better output.”
In the past, computational ‘cost’ has been a major limiter for specific search techniques. A high computational cost, whether of resources or computing time, could easily make a particular search technique either impractical or impossible for certain use cases.
This is why SVR represents a groundbreaking shift in search techniques. It makes LSI a far more practical option for many more use cases, and it also points to a potential path by which the power and efficiency of LSI could be made to work well within internet search engines.
But in the meantime, it’s not as if LSI isn’t being used at all. It definitely is, but it’s being used for applications where it makes the most sense.
SVR adds to that list of applications in a big way.
Search engines, whether they’re popular internet search engines with millions of users or bespoke search engines for professionals and specialists, offer a range of utility and efficiency.
Current search techniques in use by major search engines are practical and effective for many use cases, but they fail in specific instances, especially in terms of accuracy, which highlights the need for substantial improvement.
Historically, LSI has proven to be superior to other search techniques in several key areas, but the computational cost associated with LSI also limits its number of practical applications.
Dr. Hua Yan has proposed a modification to LSI known as SVR, and during experiments, SVR has outperformed other LSI search methods. These findings have been published via multiple research papers: “Techniques For Improved LSI Text Retrieval” and “Augmenting the power of LSI in text retrieval: Singular value rescaling.”
SVR has the potential to bridge the gap between hyper-efficient LSI and the large-capacity but flawed techniques currently being used by major search engines.
Special thanks to Dr. Hua Yan for leading this fascinating look at LSI, SVR, and search techniques in a more general sense as well.