The short story:
In Distributional Semantics, word embeddings represent the meaning of a word through (observing) its contexts. Such embeddings have been extensively and successfully applied in machine learning models to address various phenomena and tasks in computational linguistics and NLP. However, there is a general assumption that embeddings encode relatively coarse-grained concept-level information (dogs and cats are more similar than dogs and beetles).
The topic of my ‘ongoing’ PhD thesis (with Prof. Dr. Sebastian Padó and Dr. Gemma Boleda) is analyzing to what extent fine-grained semantic properties and relations are represented in embeddings. The focus is on Named Entities which are generally associated with specific properties and relations (Italy has a population of about 60 million, Italy borders Croatia).
In the preliminary work, we extracted numeric relations from word embeddings to a reasonable degree of accuracy [Gupta et al. 2015]. As a natural extension, in the next step we focussed on predicting categorical relation and analyse various aspects which make these relations easy or difficult to predict [Gupta et al. 2017]. Thereafter, we carried out a contrastive analysis of named entities and their concepts (represented by common nouns) to see if there are any substantial differences in their distributional behaviour [Boleda et al. 2017].
To summarize, my work centers around an in-depth analysis of distributed word embeddings of Named Entities and their related concepts to see how different and similar they are in the distributional space they exist in. The findings of this work can potentially aid NLP application(s) that deal with structured information sources, like question-answering systems, Knowledge-base completion, relation extraction, co-reference resolution and many others.
During my MS (with Prof. Rajeev Sangal), my primary area of interest was Dialogue Systems. I specialized in developing Natural Language Interface to Databases (NLIDB systems), which is a sub-domain of Dialogue Systems.
The (much) longer version:
Ongoing PhD work:
Sentences in corpora contain words of different types/categories which may include common nouns and, at times, named entities. While common nouns are typically observed within linguistic expressions which convey conceptual information, named entities on the other hand frequently occur within expressions which ground them to real-world information specific to such entities.
For example: Germany’s chancellor is Angela Merkel → Chancellor(Germany, Angela Merkel)
Germany has a population of 82 million and the average life-expectancy is 81 years → Population(Germany, 82million) &
Life_expectancy(Germany, 81 years)
Consequently, the word embeddings of common nouns would tend to contain more coarse-grained concept-like information and the embeddings of named entities would encode more fine-grained information. Thus, these embeddings would tend to differ in their interactions with other embeddings in various NLP tasks. For effective use of word embeddings the machine learning models should be generalized enough to take such varied interactions into account. To this purpose, two research questions were raised: (i) do word embeddings of named entities encode fine-grained information?; (ii) is there a difference between the embeddings of named entities and the concepts (referenced by a common-noun) they are related to?
As a preliminary study, we conducted a proof-of-concept experiment to establish our hypothesis that embeddings of named entities encode fine-grained information. We broadly categorize this fine-grained information as numeric relations GDP(Germany, 2.9 trillion euros) and categorical relations Capital(Germany, Berlin). Such relational information about entities can be typically found in structured information sources - like knowledge-bases. Thus, our experiment involved predicting entities’ knowledge-base vectors (extracted from Freebase) from their distributed word embeddings through a supervised regression model. The results established our hypothesis to be true along with the conclusion that numeric relations of entities can be predicted quite accurately. [Gupta et al. 2015].
- A. Gupta, G. Boleda , M. Baroni , S. Padó. Distributional vectors encode referential attributes.
In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP 2015), Lisbon, Portugal.
- A. Gupta, G. Boleda, M. Baroni, S. Padó. Mapping conceptual features to referential properties.
Talk (by G. Boleda) at the 3rd International ESSENCE Workshop: Algorithms for Processing Meaning, May 2015, Barcelona, Spain.
While this pilot study also dealt with predicting categorical relations, however, since their attributes are string-based representations, the regression models encounter massive data-sparsity during training. Thus, we could predict these relations only within a reasonable degree of accuracy. As a solution to this problem, we refined our approach from predicting strings representations for categorical attributes to predicting their word embeddings. Since an exact embedding of the attribute cannot be reproduced, we decode the predicted embedding through nearest-neighbour computation. Not only did the new approach give us significantly better results but, as compared to the previous approach, we could also address a much larger variety of categorical relations - helping us generalize our models to diverse data. Additionally, we also conducted a detailed analysis on the factors which affect the difficulty in relation prediction in a distributional setup [Gupta et al. 2017].
- A. Gupta, G. Boleda, S. Padó. Distributed Prediction of Relations for Entities: The Easy, The Difficult, and The Impossible.
In Proceedings of 6th Joint Conference on Lexical and Computational Semantics (*SEM 2017), Vancouver, Canada.
With the first research question successfully addressed, the next step was to carry out a contrastive analysis of named entities and their concepts (represented by common nouns) to see if there are any substantial differences in their distributional behaviour. For this, we used supervised logistic regression models to compare instantiation detection (Abraham Lincoln - Lawyer) and hypernymy detection (Lawyer - Professional). The results showed that identifying instantiation is easier than hypernymy detection. [Boleda et al. 2017] The concluding work to these experiments would be an in-depth analysis of the properties which contribute to the differences between common nouns and named entities within the same geometric space and effective ways of dealing with such semantically diverse embeddings during compositional and other NLP related tasks.
- G. Boleda, A. Gupta, S. Padó. Entities and Concepts in Distributional Space.
In Proceedings of 15th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2017), València, Spain.
Cross-lingual Compositionality Modeling and Evaluation. In the space of compositional distributional semantic models, the practical lexical function model (PLF) [Paperno et al. 2014] strikes a promising balance: On the one hand, lexicalized (i.e. specialized by word) encoding of semantic dependency -- best exemplified by full tensor-based approaches are provided, while its design was such as to keep the number of required model parameters low (a x d x v, as opposed to d^a, where a = num of args and d = num of dims, v = vocab size). In PLF, each word is represented via its semantic vector and words that can function as a predicate (i.e. the head of a dependency relation) carry one matrix per relation type. At test time, the respective argument vector is then multiplied by its corresponding matrix of the applied predicate to obtain that combination’s representation. Paperno et al. (2014) showed it performed comparably or better than state-of-the-art fully lexicalized and component-wise combination models. Starting from their description of the model, we determined there was an inconsistency between the training and test setup of PLF. In this work, we experimented with two approaches to resolving this inconsistency in a monolingual setting: modification of (a) training- and (b) test-time configuration. Testing on the same evaluation datasets (nvn, anvan1 & anvan2), we showed [Gupta et al. 2015] that significant improvements can be gained with test phase modification.
This study still left syntax standard and non-incremental. In a first approach to determine the effect of word-order on composition representation quality, we applied PLF sentence modeling to Croatian, which is known to have freer word order than English and German [Medić et al. 2017]. In addition to evaluating component-wise composition and standard PLF and modified [Gupta et al. 2015] models, a valuable insight was gained in this study regarding the point of most improved performance for PLF versus the simpler models, made possible by the novel and innovative design of the evaluation dataset allowing the evaluation of composition models in terms of a lexical substitution task. To sum up the main insight of the paper: higher arity words in sentences are more accurately substituted by the PLF models -- which explicitly account for the complex interactions between predicates and arguments -- whereas the noise/sparsity inherent in the complex PLF models translates into better substitutions from the simpler, and thus more robust, models.
- A. Gupta, J. Utt, S. Padó. Dissecting the Practical Lexical Function Model for Compositional Distributional Semantics.
In Proceedings of the Fourth Joint Conference on Lexical and Computational Semantics (*SEM 2015), Denver, Colorado.
MS by Research:
I finished my MS by Research under the guidance of Prof. Rajeev Sangal at IIIT-Hyderabad, India. During my masters, my primary area of interest was Dialogue Systems. I specialized in developing Natural Language Interface to Databases (NLIDB systems), which is a sub-domain of Dialogue Systems. The focus was on developing a NLIDB system which is robust, scalable and easy to customize, for different languages and domains, by people with limited knowledge of programming or database systems. I have also worked on aggregation processing implied in natural language queries in a NLIDB system. I have developed an aggregation processing framework that can handle different types of aggregation operations in a natural language query, including direct quantitative as well as indirect qualitative aggregations, and those which combine quantifiers or relational operators with aggregations. The main contribution of the work is providing NLIDB systems with the ability to process qualitative queries on the data stored in a knowledge base.
- A. Gupta. Complex Aggregates In Natural Language Interface To Databases.
Thesis for MS by Research in Computer Science and Engineering at International Institute of Information Technology - Hyerabad (IIIT-H). Telangana, India.
- A. Gupta, R. Sangal. A Novel Approach to Aggregation Processing in Natural Language Interfaces to Databases.
In Proceedings of 10th International Conference on Natural Language Processing (ICON 2013), C-DAC Noida, U.P., India.
- A. Gupta, A. Akula, D. Malladi, P. Kukkadapu, V. Ainavolu, R. Sangal. A novel approach towards building a portable nlidb system using the computational paninian grammar framework.
In Proceedings of International Conference on Asian Language Processing (IALP 2012), p93-p96, Hanoi, Vietnam.
Web Technologies is another interest of mine. During my research at IIIT-Hyderbad, I was involved in developing and managing several key websites and portals of the Language Technologies Research Centre (LTRC). I was also a project member of the Indian Language Machine Translation project where I was in-charge of the web related development.