Notes from SIGIR2020
Notes taken from attending SIGIR2020.
We talked about:
- Emotion lexica:
- WEES
- NRC/EmoLex
- WNA
- Types of sarcasm:
- Co-existence of positive and negative emotions in the text.
- This is the type that was explored in Ameeta Agrawal's work, Leveraging Transitions of Emotions for Sarcasm Detection.
- I mentioned that the presence of transition words for opposition/contradiction may indicate genuine, impartial attempts to cover both sides of the argument (consider IAC).
- Pointing out issues that should be common sense.
- I talked about a real life example that happened at a closed car rental place.
Zhang is the author of the MUSE model. 3 types of document relationships:
- semantic relevance,
- textual entailment, and
- textual similarity.
- On the change of the city name Bengalore to Bengaluru:
- Bangalore was the British spelling. In the local official language (Kannada), it is spelled Bengaluru.
- This is similar to how Calcutta was renamed to Kolkata.
- Resources for studying Indian language procesing:
- Datasets:
- ArabicWeb16: A New Crawl for Today’s Arabic Web (note to self: see Table 2 for country-level dialect breakdown) (webpage)
- Arabic-specific search engines: Yamli, Eiktub, and Yoolki
- Clayton Coupla -- easier to get probability distribution function
- Frank Coupla
- Gumbel Coupla
- SPot: A Tool for Identifying Operating Segments in Financial Tables: similar to my prior work at WRDS, with these differences:
- 8-K instead of 10-K
- parsing XML/HTML instead of plain text
- JASSjr: The Minimalistic BM25 Search Engine for Teaching and Learning Information Retrieval: written by the original author of JASS, JASSjr is only 400 lines of C++ code.Ranking
- OpenNIR: the framework upon which the code for Expansion via Prediction of Importance with Contextualization was built upon. Also written single-handedly by Sean MacAvaney.
- Works by Omar Khattab:
ColBERT
: Efficient and Effective Passage Search via Contextualized Late Interaction over BERTMy follow-up question was: Are the queries padded by appending [MASK] tokens to the orginal tokens only? I wonder what happens if you insert [MASK] tokens randomly *between the orginal tokens. Intuitively, it would probably enhance the robustness of ColBERT to variations of the same query.- Finding the Best of Both Worlds: Faster and More Robust Top-k Document Retrieval
- Efficient Document Re-Ranking for Transformers by Precomputing Term Representations
- Crowdsourcing platforms in Japan:
Lancers.jp
Crowdsourcing.yahoo.jp
- Curriculum Learning
- TheCranfield Paradigm
- Package for extracting topics/topic modeling:
- Biterm Topic Model (BTM): word co-occurrence based topic model
- Gensim is also designed for topic modeling; I have been using it solely for training word embedding models though.
- Text Retrieval Conference (TREC): A program of NIST. De-facto standard of benchmarking IR work.
- Some metrics:
- BM25: bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document. (from Wikipedia)
- BM = "best matching"