publications
List of Publications
For complete and updated list, see my Google Scholar profile.
2026
-
- PoeTone: A Framework for Constrained Generation of Structured Chinese Songci with LLMsIn AAAI, Singapore, 2026For insiders: We build a framework for generating classical Chinese Songci that satisfies strict tone and rhyme constraints.For everyone: Songci poetry has rules like sheet music, and we teach an AI to write it while staying on the right tones and rhymes, with a strict checker that keeps it honest.
2025
- Incorporating Content-based Features into Quantum Knowledge Graph EmbeddingsIn AIQxQIA@ECAI, 2025For insiders: We introduce the first quantum link prediction framework that fuses knowledge-graph structure with text features, using amplitude and angle encoding to inject reduced text embeddings into quantum circuits and improve prediction accuracy.For everyone: We teach a quantum model to “read the labels” in a knowledge graph—not just follow the links. By turning short texts into simple signals for a quantum circuit, it predicts missing connections more accurately.
- From Monolingual to Bilingual: Investigating Language Conditioning in Large Language Models for Psycholinguistic TasksIn AACL, Mumbai, India, 2025For insiders: We show that prompting an LLM with a specific language identity changes its behavior on psycholinguistic tasks and shifts what its internal layers encode across languages.For everyone: We show that an AI can “put on” another language like a costume and then answers differently, and even its inner signals shift when it is prompted in a different language.
- LLM in the Loop: Creating the ParaDeHate Dataset for Hate Speech DetoxificationIn AACL, 2025For insiders: We create ParaDeHate, an LLM-in-the-loop dataset for hate-speech detoxification, enabling systematic evaluation of meaning-preserving toxic-text rewriting.For everyone: We build a large training set that rewrites toxic text into cleaner language without changing the meaning, like running dirty water through a filter and checking the result.
- Decomposed Prompting: Probing Multilingual Linguistic Structure Knowledge in Large Language ModelsIn AACL, 2025For insiders: We introduce token-wise decomposed prompting for multilingual sequence labeling, improving accuracy and efficiency for POS tagging across 38 languages.For everyone: Instead of asking an AI to label a whole sentence in one go, we ask about each word step by step, like a teacher calling roll, and this makes the labels more reliable across languages.
- Context-Based URL Classification for Open Access Datasets and Software in Scholarly DocumentsIn JCDL, 2025For insiders: We classify URLs in scholarly documents to identify open datasets and software links, enabling large-scale tracking of research artifacts and openness.For everyone: We build a “link bouncer” for research papers that checks whether a URL really leads to open data or usable software, so open resources are easier to find and count.
- Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource LanguagesIn NoDaLiDa, Tallinn, Estonia, 2025For insiders: We improve projection-based cross-lingual NER using back-translation and a formal matching step, outperforming baselines across 57 low-resource languages.For everyone: We improve name detection in languages where training material is scarce by transferring labels from other languages and checking again via translation, like carbon paper plus a second copy to catch errors.
- Claim2Source at CheckThat! 2025: Zero-Shot Style Transfer for Scientific Claim-Source RetrievalIn CLEF, 2025For insiders: We test whether rewriting informal scientific claims in a more formal style improves claim-to-paper retrieval, and analyze when hybrid retrieval helps most.For everyone: We show that rewriting informal claims into a cleaner, paper-like wording helps find the right scientific source, like using the correct catalog terms when searching a library.
- Optical-access networks for smart sustainable cities: from network architecture to fiber deploymentJ. Opt. Commun. Netw., Mar 2025For insiders: We outline optical access-network designs for smart cities and discuss AI-based monitoring and cost-aware fiber deployment for sustainable connectivity.For everyone: This work sketches the “roads and water pipes” of smart cities, explaining how fiber networks and AI monitoring can keep future connectivity fast, stable, and energy-aware.
- SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented GenerationIn CIKM, Seoul, South Korea, Mar 2025For insiders: SQuAI answers scientific questions via multi-agent retrieval over millions of papers and produces claim-level citations with supporting evidence sentences.For everyone: SQuAI works like a team of librarians by splitting a hard question into smaller ones, pulling evidence from millions of papers, and answering with sources that readers can check.
- Real-E: A Foundation Benchmark for Advancing Robust and Generalizable Electricity ForecastingIn CIKM, Seoul, South Korea, Mar 2025For insiders: We release Real-E, a large multi-country electricity-forecasting benchmark, and show current methods struggle under non-stationary correlation shifts.For everyone: We build a “weather station” for electricity, release a large real-world testbed, and show why models stumble when the grid’s patterns change over time.
- Hateful Person or Hateful Model? Investigating the Role of Personas in Hate Speech Detection by Large Language ModelsIn PALS@EMNLP, Mar 2025For insiders: We show that persona prompts can substantially change LLM hate-speech labels, raising fairness and reliability concerns for LLM-based moderation and annotation.For everyone: We show that switching a chatbot’s “persona mask” can change its hate-speech judgments, like getting different verdicts from the same judge in different costumes.
- ComplexTempQA: A 100m Dataset for Complex Temporal Question AnsweringIn EMNLP, Suzhou, China, Mar 2025For insiders: ComplexTempQA provides 100 million temporal QA pairs that require multi-hop reasoning, enabling realistic evaluation of temporal reasoning in language models.For everyone: We create a giant time-travel quiz for AI with 100 million questions about “before”, “after”, and “during”, to see whether models truly keep track of time.
- Extractive Fact Decomposition for Interpretable Natural Language Inference in One Forward PassIn EMNLP, Suzhou, China, Mar 2025For insiders: JEDI performs atomic fact decomposition and interpretable NLI in a single encoder-only forward pass, reducing reliance on expensive generative inference at test time.For everyone: We break long statements into small fact “building blocks” and check them efficiently, like snapping sentences into LEGO pieces so the system can justify true or false.
- Graph-Guided Textual Explanation Generation FrameworkIn EMNLP, Mar 2025For insiders: We generate free-text explanations guided by faithful highlight cues encoded with a graph layer, improving explanation faithfulness across reasoning datasets.For everyone: We make explanations more like a guided tour by having the model point to evidence first and then connect it step by step, instead of telling a nice story after the fact.
- In-Context Learning for Information Extraction using Fully Synthetic DemonstrationsIn XLLM@ACL, Vienna, Austria, Mar 2025For insiders: We generate synthetic demonstrations and retrieve them at inference to improve document-level entity and relation extraction in low- or zero-shot settings.For everyone: We let the AI write its own practice questions and then pull the most useful ones at the right moment, like learning with flashcards you made yourself and picking the cards that really help.
- Natural Language Inference Fine-tuning for Scientific Hallucination DetectionIn SDP@ACL, Vienna, Austria, Mar 2025For insiders: We fine-tune NLI models to detect scientific hallucinations by checking claims against references, achieving top performance in a shared task.For everyone: We build a science fact-checker that compares a model’s statements with the sources it cites, like a receipt check that flags items that do not match.
- Paths to Causality: Finding Informative Subgraphs within Knowledge Graphs for Knowledge-Based Causal DiscoveryIn KDD, Mar 2025For insiders: We propose a neurosymbolic method for knowledge-based causal discovery that selects relevant knowledge graph subgraphs to ground LLM prompting.For everyone: We help AI reason about cause and effect by highlighting the most informative routes through a knowledge map, like handing a detective the best trail of clues.
- Revisiting Projection-based Data Transfer for Cross-Lingual Named Entity Recognition in Low-Resource LanguagesIn NoDaLiDa/Baltic-HLT, Tallinn, Estonia, Mar 2025For insiders: We refine projection-based cross-lingual NER transfer and show data-based transfer can outperform multilingual-model baselines in low-resource settings.For everyone: We revisit cross-language label transfer for name detection, like tracing a drawing onto another sheet, and show how to make the tracing far more accurate.
- Predicting Company ESG Ratings from News Articles Using Multivariate Time-series AnalysisIn TempWeb@WWW, Sydney, Australia, Mar 2025For insiders: We predict company ESG ratings from large news streams using multivariate time-series models, enabling more data-driven sustainability signals at scale.For everyone: We predict ESG ratings from large streams of news, like building a sustainability dashboard that learns from what companies do and what is reported about them over time.
- SimplifyMyText: An LLM-Based System for Inclusive Plain Language Text SimplificationIn ECIR, Lucca,Italy, Mar 2025For insiders: We build and evaluate an LLM system that turns complex documents into inclusive plain language, improving accessibility for readers with comprehension barriers.For everyone: SimplifyMyText turns dense documents into plain language, like clearing bureaucratic fog so more people can understand what matters.
- Explainable LiDAR 3D Point Cloud Segmentation and Clustering for Detecting Airplane-Generated Wind TurbulenceIn KDD, Toronto, Canada, Mar 2025For insiders: We detect aircraft wake turbulence from LiDAR point clouds with explainable ML, providing transparent evidence for aviation safety decisions.For everyone: We detect dangerous wake turbulence with LiDAR and explain the model’s decision, like an alarm that not only rings but also points to the evidence in the air.
- CoDy: Counterfactual Explainers for Dynamic GraphsIn ICML, Vancouver, Canada, Mar 2025For insiders: CoDy generates counterfactual explanations for dynamic graph models, identifying minimal changes that flip predictions and improving interpretability of temporal GNNs.For everyone: CoDy explains changing network predictions with “what if” examples, like showing the smallest tweak that would flip a decision so humans can follow the logic.
- Hallucinations Can Improve Large Language Models in Drug DiscoveryMar 2025For insiders: We show that some hallucinated LLM descriptions of molecules can improve downstream property prediction, suggesting ways to leverage noisy text in drug discovery.For everyone: We show that even some AI made-up descriptions can still carry useful signals for drug discovery, like rough sketches that are not perfect but still reveal patterns.
- Optimizing Small Transformer-Based Language Models for Multi-Label Sentiment Analysis in Short TextsCoRR, Mar 2025For insiders: We study how to optimize small transformer models for short-text multi-label sentiment and show when augmentation helps and when it adds harmful noise.For everyone: We study how small language models can handle short, messy texts well, like making a compact tool do a big job without losing accuracy.
- Large Physics Models: Towards a Collaborative Approach with Large Language Models and Foundation ModelsMar 2025For insiders: We propose a roadmap for physics-specific foundation models, emphasizing collaborative development, rigorous evaluation, and implications for scientific discovery.For everyone: We propose a roadmap for “Large Physics Models”, like planning a new AI lab assistant that can read physics papers, handle equations, and be tested properly before it is trusted.
- Implementation of an Open Chemistry Knowledge Base with a Semantic WikiJ. Cheminformatics, Mar 2025For insiders: We build an open chemistry knowledge base with a semantic wiki and structured forms, enabling collaborative capture and comparison of chemical research results.For everyone: We build an open chemistry knowledge base like a shared lab notebook on the web, so results can be entered in a structured way and reused by others.
2024
- Assessing Privacy Policies with AI: Ethical, Legal, and Technical ChallengesIn AISyS, Mar 2024For insiders: We analyze technical feasibility plus ethical and legal risks of using LLMs to assess privacy policies.For everyone: We ask whether AI can read privacy policies like a magnifying glass for fine print and we lay out the technical, ethical, and legal pitfalls.
- ML approaches for OTDR diagnoses in passive optical networks - event detection and classification: ways for ODN branch assignmentJ. Opt. Commun. Netw., Mar 2024For insiders: We improve ML-based OTDR diagnostics using measured and synthetic traces and map events to network branches for practical fiber monitoring.For everyone: We improve fiber fault diagnosis by training on both real and simulated signals, like training a mechanic on real cars and realistic simulators.
- Benefits of international collaboration in computer science: a case study of China, the European Union, and the United StatesScientometrics, Mar 2024For insiders: We quantify how international collaboration in computer science relates to productivity and citation impact across China, the EU, and the United States.For everyone: We map scientific collaboration like flight routes and study how cross-border co-authoring relates to productivity and impact over decades.
- Future Timelines: Extraction and Visualization of Future-related Content From News ArticlesIn WSDM, Mar 2024For insiders: We extract future-oriented statements from news and visualize them on timelines, helping users anticipate developments under daily information overload.For everyone: We pull future-looking statements from news and place them on a timeline, like turning daily headlines into a simple roadmap of what might come next.
- Embedded Named Entity Recognition using Probing ClassifiersIn EMNLP, Miami, FL, USA, Mar 2024For insiders: EMBER enables fast NER in decoder-only language models via probing, adding minimal overhead and avoiding destructive fine-tuning.For everyone: We add “live labels” to a chatbot as it writes, like sticky notes appearing while you type instead of only after the text is finished.
- The Effects of Hallucinations in Synthetic Training Data for Relation ExtractionIn KBC-LM@ISWC, Baltimore, USA, Mar 2024For insiders: We show that hallucinated synthetic training data can seriously reduce relation extraction quality and propose detectors to filter hallucinations.For everyone: We show that synthetic training data can be polluted by made-up details, like feeding a model mislabeled food, and we propose ways to detect and clean it.
- Knowledge Graph Structure as Prompt: Improving Small Language Models Capabilities for Knowledge-Based Causal DiscoveryIn ISWC, Baltimore, MD, USA, Mar 2024For insiders: We inject knowledge graph structure into prompts to improve small language models for causal discovery, showing strong few-shot gains over baselines.For everyone: We help smaller language models think about causes by giving them a structured knowledge map, like signposts that reduce guessing in the dark.
- AutoRDF2GML: Facilitating RDF Integration in Graph Machine LearningIn ISWC, Baltimore, MD, USA, Mar 2024For insiders: AutoRDF2GML turns RDF knowledge graphs into graph-ML-ready datasets with content and topology features, bridging semantic web data and graph learning.For everyone: AutoRDF2GML converts knowledge graphs into machine learning datasets, like turning a city map into a format a navigation system can train on.
- GNNavi: Navigating the Information Flow in Large Language Models by Graph Neural NetworkIn ACL, Bangkok, Thailand, Mar 2024For insiders: GNNavi guides information flow in prompt-based fine-tuning via a graph neural layer, improving few-shot learning while updating only a small fraction of parameters.For everyone: GNNavi steers how information flows during prompting, like a traffic controller that routes signals to the right places so the model learns better.
- GraSAME: Injecting Token-Level Structural Information to Pretrained Language Models via Graph-guided Self-Attention MechanismIn NAACL, Mexico City, Mexico, Mar 2024For insiders: GraSAME injects token-level graph structure into pretrained language models via graph-guided self-attention for efficient graph-to-text generation.For everyone: GraSAME weaves relationship structure into how a language model reads, like stitching a network into its attention so it can describe connected data more accurately.
- KITspotlight: A System for Spotlighting Researchers in the MediaIn ICWE, Tampere, Finland, Mar 2024For insiders: KITspotlight identifies and aggregates researchers mentioned in news articles, enabling scalable media monitoring and visibility analytics for institutions.For everyone: KITspotlight scans news for mentions of researchers, like a radar that turns scattered articles into a clear visibility dashboard.
- ToPro: Token-Level Prompt Decomposition for Cross-Lingual Sequence Labeling TasksIn EACL, Malta, Mar 2024For insiders: ToPro decomposes prompts token-by-token for sequence labeling, improving cross-lingual transfer for tasks like NER and POS tagging.For everyone: ToPro breaks prompts into tiny word-level questions, like asking about each word in turn, which makes labeling across languages less brittle.
- HyperPIE: Hyperparameter Information Extraction from Scientific PublicationsIn ECIR, Glasgow, UK, Mar 2024For insiders: HyperPIE extracts hyperparameters from scientific papers, enabling machine-readable experiment settings for reproducibility, search, and meta-analysis.For everyone: HyperPIE reads papers to extract exact experiment settings, like pulling oven temperature and timing from thousands of recipes so results can be compared.
- A Novel Machine Learning-based Equalizer for a Downstream 100G PAM-4 PONIn OFC, San Diego, CA, USA, Mar 2024For insiders: We propose an ML-based equalizer for downstream 100G PAM-4 PON and show large BER improvements with lower complexity than baselines.For everyone: We use AI to clean up fiber signals, like noise-cancelling headphones for optical networks, so high-speed links work with fewer errors.
- Advanced Equalization in 112 Gb/s Upstream PON Using a Novel Fourier Convolution-based NetworkIn ECOC, Frankfurt, Germany, Mar 2024For everyone: We design a smarter signal-cleaning model for fiber links, like tuning a radio with a better filter so high-speed internet arrives with fewer glitches.
- Incorporating Quantum Advantage in Quantum Circuit Generation through Genetic ProgrammingMar 2024For insiders: We use quantum advantage in genetic programming for circuit generation, improving convergence and producing circuits competitive with expert designs on standard tasks.For everyone: We evolve quantum circuits by trial and reward, like selective breeding for better designs that solve test problems faster.
- Why Lift So Heavy? Slimming Large Language Models by Cutting Off the LayersIn IJCNN, Rome, Italy, Mar 2024For insiders: We show that truncating layers can shrink large language models while retaining strong performance on some tasks.For everyone: We show that some large language models can go on a diet by removing layers without collapsing on certain tasks, pointing to more efficient and greener AI.
2023
- Linked Papers With Code: The Latest in Machine Learning as an RDF Knowledge GraphIn ISWC, Athens, Greece, Mar 2023For insiders: LPWC turns the ML landscape into a machine-queryable knowledge graph linking papers to tasks, datasets, methods, and results for semantic search and analysis.For everyone: LPWC turns machine-learning research into a structured “periodic table” of papers, tasks, datasets, and results that can be searched and connected.
- SemOpenAlex: The Scientific Landscape in 26 Billion RDF TriplesIn ISWC, Mar 2023For insiders: We release SemOpenAlex, a scholarly knowledge graph with 26 billion triples, dumps, SPARQL access, and embeddings, enabling large-scale semantic science analytics and search.For everyone: SemOpenAlex is an open “Google Maps for science”, built as a connected map of papers and authors so others can navigate research at web scale.
- Biases in Scholarly Recommender Systems: Impact, Prevalence, and MitigationScientometrics, Mar 2023For insiders: We measure biases in scholarly recommender systems and discuss mitigation strategies to support fairer literature discovery under severe paper overload.For everyone: We show that literature recommenders can have blind spots, like a librarian who keeps pointing to the same shelves, and we outline how to measure and reduce this effect.
- AI-Based OTDR Event Detection, Classification and Assignment to ODN Branches in Passive Optical NetworksJournal of Optical Communications and Networking, Mar 2023For insiders: We detect and classify events in optical networks from OTDR traces and assign faults to network branches, improving reliability of passive optical networks.For everyone: We listen to echoes in fiber cables like sonar and use machine learning to spot and locate faults, improving how networks can be monitored.
- Analyzing the Impact of Companies on AI Research Based on PublicationsScientometrics, Mar 2023For insiders: We measure how companies influence AI research via publishing and show differences in citation impact and online attention between industry and academia.For everyone: We measure how strongly companies shape AI research through publishing, like tracking who is rowing and who is steering in a shared boat.
- A Full-Fledged Framework for Combining Entity Linking Systems and ComponentsIn K-CAP, Mar 2023For insiders: We present a full framework for modular entity-linking pipelines, enabling easier combination of components and fairer comparison across systems.For everyone: We offer a framework to combine entity-linking systems, like a mixing console where you can plug in different components and compare results fairly.
- A Full-fledged Commit Message Quality Checker Based on Machine LearningIn COMPSAC, Torino, Italy, Mar 2023For insiders: We build an ML-based tool that checks commit-message quality using established rules, supporting maintainability and better software engineering practice.For everyone: We build a quality checker for commit messages, like a grammar checker for software teams so code changes stay understandable.
- Ablesbarkeitsmesser: A System for Assessing the Readability of German TextIn ECIR, Dublin, Ireland, Mar 2023For insiders: We provide an online service to estimate readability of German texts, supporting clearer writing for education and public communication.For everyone: We build a “readability thermometer” for German, so people can quickly see whether a text is easy or hard to read and adjust it for the audience.
- CoCon: A Data Set on Combined Contextualized Research Artifact UseIn JCDL, Mar 2023For insiders: We release CoCon, a dataset capturing combined use of research artifacts in papers, enabling new prediction and recommendation tasks beyond paper-level analysis.For everyone: CoCon shows which tools, datasets, and methods tend to appear together in papers, like revealing which ingredients often show up in the same scientific recipes.
- Evaluating Generative Models for Graph-to-Text GenerationIn RANLP, Varna, Bulgaria, Mar 2023For insiders: We benchmark GPT-style models for graph-to-text generation in zero-shot settings and analyze when they hallucinate versus fine-tuned baselines.For everyone: We test whether language models can describe a network diagram faithfully, like asking a narrator to tell a story from a map and checking where it invents details.
- Automatic Hint GenerationIn ICTIR, Taipei, Taiwan, Mar 2023For insiders: We generate hints from Wikipedia for question-answer pairs so learners can reach answers themselves, supporting reasoning without giving solutions away.For everyone: We generate helpful hints instead of full answers, like giving a nudge in the right direction so people learn and reason rather than being handed the solution.
- Impact, Attention, Influence: Early Assessment of Autonomous Driving DatasetsIn ICCRE, Mar 2023For insiders: We analyze early impact signals of autonomous-driving datasets and propose an influence score to estimate dataset value before citations mature.For everyone: We estimate early which self-driving datasets will matter later, like spotting which seedlings are likely to become the strongest plants.
- Measuring Variety, Balance, and Disparity: An Analysis of Media Coverage of the 2021 German Federal ElectionMar 2023For insiders: We measure news diversity in German election coverage using variety, balance, and disparity metrics to support transparent analysis of media pluralism.For everyone: We measure how diverse news coverage is, like checking whether the media menu offers many perspectives or keeps serving the same dish.
- Theories About World Representations for the Internet of ThingsIn CARLA, Mar 2023For insiders: We connect theories of meaning and knowledge representation to IoT practice, highlighting challenges of grounding, intersubjectivity, and modeling change over time.For everyone: We compare how devices and AI systems build pictures of the world, like comparing different mental models to see which ones support shared understanding.
- Vocab-Expander: A System for Creating Domain-Specific Vocabularies Based on Word EmbeddingsIn RANLP, Mar 2023For insiders: Vocab-Expander builds domain vocabularies from embeddings and knowledge bases, helping users discover terms for search, innovation, and interdisciplinary work.For everyone: Vocab-Expander grows a domain vocabulary like expanding a map legend, suggesting useful terms and links so people can search and communicate more easily.
2022
- The Microsoft Academic Knowledge Graph enhanced: Author name disambiguation, publication classification, and embeddingsQuant. Science Studies, Mar 2022For insiders: We enhance a massive scholarly knowledge graph with improved author disambiguation, publication classification, and embeddings to improve search and analytics at scale.For everyone: We clean and enrich a huge catalog of scientific papers, like fixing name tags and adding smart labels so large-scale search works more reliably.
- The Green AI Ontology: An Ontology for Modeling the Energy Consumption of AI ModelsIn ISWC, Virtual Conference, Hangzhou, China, Mar 2022For insiders: We introduce the Green AI Ontology to represent energy use and sustainability properties of AI models, enabling transparent reporting and comparison.For everyone: We build a “nutrition label” for AI models that records energy use and sustainability, so systems can be compared with clear facts.
- A Blocking-Based Approach to Enhance Large-Scale Reference LinkingIn LITE@JCDL, Mar 2022For insiders: We improve large-scale reference linking using a blocking strategy over reference fields.For everyone: We speed up linking references by sorting them into smart bins first, like organizing mail before matching addresses so more links are made correctly.
- How Does Author Affiliation Affect Preprint Citation Count? Analyzing Citation Bias at the Institution and Country LevelIn JCDL, Cologne, Germany, Mar 2022For insiders: We quantify affiliation-based citation inequality for bioRxiv preprints and show stronger bias signals than in publisher versions.For everyone: We show that an author’s origin can influence citation outcomes, like a nameplate effect that raises fairness questions in how science is evaluated.
- Few-Shot Document-Level Relation ExtractionIn NAACL, Seattle, WA, USA, Mar 2022For insiders: We introduce a few-shot benchmark for document-level relation extraction and reveal challenges beyond sentence-level settings, including realistic NOTA behavior.For everyone: We create a benchmark for extracting relationships from whole documents with only a few examples, like testing whether a student understood the full story and not just one line.
- AIFB-WebScience at SemEval-2022 Task 12: Relation Extraction First - Using Relation Extraction to Identify EntitiesIn SemEval@NAACL, Mar 2022For insiders: We build an end-to-end pipeline where relation extraction supports entity identification, improving extraction from LaTeX documents in a shared-task setting.For everyone: We help find names in technical text by learning relationships first.
- Which Publications’ Metadata Are in Which Bibliographic Databases? A System for ExplorationIn BIR@ECIR, Stavanger, Norway, Mar 2022For insiders: RefBee helps researchers check which publications are missing from major bibliographic databases, improving visibility and metadata completeness.For everyone: RefBee is a checklist for your publication record, like spotting missing books on a shelf before others do.
- Safety Aware Reinforcement Learning by Identifying Comprehensible Constraints in Expert DemonstrationsIn SafeAI@AAAI, Mar 2022For insiders: We learn human-readable safety constraints from expert demonstrations to monitor RL agents, improving safety without losing interpretability.For everyone: We turn expert demonstrations into clear safety rules, like writing down a skilled driver’s habits so an AI can avoid dangerous moves.
- When to Use Which Neural Network? Finding the Right Neural Network Architecture for a Research ProblemIn SDU@AAAI, Mar 2022For insiders: We recommend neural-network architectures from textual problem descriptions and show that simpler models can be strong baselines for this task.For everyone: We recommend neural-network designs from a short problem description, like a tool clerk suggesting the right instrument once you describe the job.
- Explaining Convolutional Neural Networks by Tagging FiltersIn AIMLAI@CIKM, Mar 2022For insiders: We explain CNN decisions by tagging filters with human-readable labels, producing intuitive explanations and supporting error analysis.For everyone: We explain image classifiers by naming what their hidden filters react to, like labeling the internal sensors so people can understand why a decision happened.
- Sequence Labeling for Citation Field Extraction from Cyrillic Script ReferencesIn SDU@AAAI, Mar 2022For insiders: We create a large multilingual dataset of Cyrillic references and train accurate sequence labelers for citation-field extraction beyond Latin scripts.For everyone: We teach models to parse Cyrillic references, like sorting foreign-language mail into the right slots so citation data can be extracted reliably.
2021
- The Data Set Knowledge Graph: Creating a Linked Open Data Source for Data SetsQuant. Science Studies, Mar 2021For insiders: We introduce the Data Set Knowledge Graph (DSKG), the first large-scale linked open dataset that connects datasets to the scholarly papers that mention them, enabling better dataset discovery and making data contributions transparent and measurable.For everyone: We link datasets to the papers that mention them, creating a map that helps people find data faster and gives data creators clearer credit.
- Cross-Lingual Citations in English Papers: A Large-Scale Analysis of Prevalence, Usage, and ImpactInternational Journal on Digital Libraries, Mar 2021For insiders: We analyze cross-lingual citations at scale to quantify prevalence and impact, showing how language affects scholarly visibility and citation behavior.For everyone: We trace citations across languages like trade routes on a map, showing which languages get noticed and cited in English scientific writing.
- Improving Question Answering for Event-focused Questions in Temporal Collections of News ArticlesInformation Retrieval Journal, Mar 2021For insiders: We improve question answering over long-term news archives with temporal reranking, helping users retrieve relevant evidence for event-centric questions.For everyone: We make question answering over news archives time-aware, like a historian who knows which year to search when the question forgets to say it.
- Towards Full-Fledged Argument Search: A Framework for Extracting and Clustering Arguments from Unstructured TextCoRR, Mar 2021For insiders: We propose a unified argument-search framework that extracts and clusters arguments from text for more comprehensive debate analysis.For everyone: We build a machine that mines arguments from messy text and sorts them into clusters, like turning a chaotic debate into neat stacks of note cards.
- Are Investors Biased Against Women? Analyzing How Gender Affects Startup Funding in EuropeCoRR, Mar 2021For insiders: We analyze European startup funding data and quantify how team gender composition relates to funding outcomes, providing evidence of structural bias risks.For everyone: We analyze European startup funding like checking a pipeline, asking whether team gender composition relates to how money flows and where bias may appear.
- Recommending Datasets for Scientific Problem DescriptionsIn CIKM, Virtual Event, Queensland, Australia, Mar 2021For insiders: We propose dataset recommendation from natural-language scientific problem descriptions, cutting dataset search time and improving dataset reuse at scale.For everyone: You describe your research problem in plain text and we suggest suitable datasets, like a matchmaking service that connects questions with the data that can answer them.
- DataHunter: A System for Finding Datasets Based on Scientific Problem DescriptionsIn RecSys, Amsterdam, The Netherlands, Mar 2021For insiders: DataHunter finds datasets from problem descriptions using scholarly text and the Data Set Knowledge Graph, supporting FAIR data discovery workflows.For everyone: DataHunter searches for datasets from a problem description, like a treasure map that points to the right data by reading what you want to study.
- Quantifying Explanations of Neural Networks in E-Commerce Based on LRPIn ECML-PKDD, Bilbao, Spain, Mar 2021For insiders: We quantify explanation quality for neural networks in e-commerce settings, helping practitioners assess trust and compliance for recommendation models.For everyone: We measure how good an AI explanation in online shopping really is, like using a ruler to check whether it stays stable, useful, and trustworthy.
- Combining Linking Techniques for EveryoneIn ESWC, Mar 2021For insiders: We provide a framework for combining and comparing entity-linking components, improving reproducibility and fair evaluation across pipelines.For everyone: We make entity linking plug-and-play, like LEGO bricks that can be swapped and combined so users can compare systems without rebuilding everything.
- Media Bias Everywhere? A Vision for Dealing with the Manipulation of Public OpinionIn BIAS@ECIR, Lucca, Italy, Mar 2021For insiders: We outline how AI can detect and communicate media bias, supporting transparency, media literacy, and more resilient public discourse.For everyone: We sketch a blueprint for fighting manipulation in the news, showing how AI can detect bias and present it in ways people can actually understand.
- C-Rex: A Comprehensive System for Recommending In-Text Citations with ExplanationsIn Sci-K@WWW, Virtual Event, Ljubljana, Slovenia, Mar 2021For insiders: C-Rex is an online system that recommends in-text citations and provides explanations, helping authors write more efficiently and responsibly.For everyone: C-Rex is an online citation assistant, like a helpful co-author that suggests supporting papers and briefly explains why they fit.
- Exploding TV Sets and Disappointing Laptops: Suggesting Interesting Content in News Archives Based on Surprise EstimationIn ECIR, Mar 2021For insiders: We recommend surprisingly interesting content from news archives by contrasting past and present, making digital heritage collections more engaging.For everyone: We find “surprising gems” in news archives by comparing past statements with today, like opening a time capsule and highlighting what changed.
- Right for the Right Reasons: Making Image Classification Intuitively ExplainableIn ECIR, Virtual Event, Mar 2021For insiders: We propose a metric for object-aligned explanations in image classification and show training can make models rely on the right evidence.For everyone: We check whether image AIs look at the right evidence, like asking a student to point to the exact part of a picture that supports the answer.
- Identifying Used Methods and Datasets in Scientific PublicationsIn SDU@AAAI, Mar 2021For insiders: We identify automatically which methods and datasets were used in papers via extraction and context classification.For everyone: We extract which methods and datasets papers really used.
- Bootstrapping Multilingual Metadata Extraction: A Showcase in CyrillicIn SDP@NAACL, Mar 2021For insiders: We bootstrap multilingual metadata extraction for Cyrillic scholarly data, reducing barriers for non-Latin research communities and digital libraries.For everyone: We help computers read Cyrillic scholarly metadata, like teaching a librarian to catalog books in another script so they do not stay invisible.
- AWARE: An Ontology for Situational Awareness of Autonomous Vehicles in ManufacturingIn CSKG@AAAI, Mar 2021For insiders: We develop an ontology for situational awareness of autonomous factory vehicles, enabling structured reasoning about perception and safer operation.For everyone: We build a shared “mental map” for factory robots, like a common language that helps vehicles describe what they sense and choose safe actions.
- Theories of Meaning for the Internet of ThingsIn Language, Cognition, and Mind, Mar 2021For insiders: We compare semantic theories for representing meaning in IoT and show why grounding, aligned perspectives, and dynamics must be addressed together.For everyone: We compare ways machines can connect words to the real world.
2020
- Citation Recommendation: Approaches and DatasetsInt. J. Digit. Libr., Mar 2020For insiders: We survey citation recommendation methods and datasets and highlight evaluation pitfalls and open challenges for assisting scientific writing.For everyone: This survey explains citation recommendation, like a GPS for references, and summarizes which datasets and tests are needed before such tools deserve trust.
- unarXive: A Large Scholarly Dataset With Publications’ Full Text, Annotated In-Text Citations, and Links to MetadataScientometrics, Mar 2020For insiders: We present unarXive, a large-scale arXiv full-text corpus with precisely linked in-text citations and rich metadata, empowering citation-aware, transparent, and reproducible scholarly NLP.For everyone: We build a large “paper warehouse” where citations inside arXiv texts are tagged and linked, so citation analyses need far less manual cleanup.
- Analyzing the GitHub Repositories of Research PapersIn JCDL, Virtual Event, China, Mar 2020For insiders: We analyze GitHub repositories linked from research papers to study maintenance and documentation patterns, providing signals for research reproducibility.For everyone: We analyze the GitHub code behind research papers, like checking the health of the software that scientific results depend on in practice.
- HybridCite: A Hybrid Model for Context-Aware Citation RecommendationIn JCDL, Virtual Event, China, Mar 2020For insiders: HybridCite combines embedding and IR signals to improve context-aware citation recommendation beyond single-method baselines.For everyone: HybridCite recommends citations by combining several signals, like using both a compass and a map so suggestions fit the writing context better.
- Who’s Behind That Website? Classifying Websites by the Degree of Commercial IntentIn ICWE, Helsinki, Finland, Mar 2020For insiders: We classify websites by commercial intent with ML, helping distinguish individuals, companies, non-profits, and public institutions from sparse web content.For everyone: We sort websites by whether they are commercial or not, like a quick “smell test” that works even when there is little text.
- Annotating and Analyzing Biased Sentences in News Articles using CrowdsourcingIn LREC, Marseille, France, Mar 2020For insiders: We create and analyze a crowdsourced dataset of biased sentence annotations, enabling fine-grained study of subtle news bias.For everyone: We crowdsource labels of subtle bias in news, like using many small magnifying glasses to spot slanted wording sentence by sentence.
- KORE 50^DYWC: An Evaluation Data Set for Entity Linking Based on DBpedia, YAGO, Wikidata, and CrunchbaseIn LREC, Marseille, France, Mar 2020For insiders: We extend the KORE 50 entity-linking benchmark to multiple knowledge graphs, enabling broader evaluation across DBpedia, Wikidata, YAGO, and Crunchbase.For everyone: We take a well-known test set for linking names to real-world entities and expand it to several different knowledge bases, like checking whether a navigation system still works when you switch from one city map to another.
- Answering Event-Related Questions over Long-Term News Article ArchivesIn ECIR, Lisbon, Portugal, Mar 2020For insiders: We answer event-related questions over long-term news archives using temporal reranking, improving retrieval when questions have implicit time scopes.For everyone: We answer event questions in huge news archives using time clues, like flipping to the right chapter of history even when the question forgets to name the year.
- Semantic Modelling of Citation Contexts for Context-Aware Citation RecommendationIn ECIR, Lisbon, Portugal, Mar 2020For insiders: We model citation contexts using entities and claim structure and show improved context-aware citation recommendation at scale.For everyone: We turn citation contexts into structured signals, like extracting the ingredients and purpose of a reference so models understand more than nearby words.
- Neural Citation Recommendation: A Reproducibility StudyIn BIR@ECIR, Lisbon, Portugal, Mar 2020For insiders: We reimplement a neural citation recommender and analyze hyperparameter effects, improving reproducibility and providing practical tuning guidance.For everyone: We rebuild a well-known citation recommender from scratch, like re-running a recipe to check whether others can truly reproduce the same result.
2019
- Linked Crunchbase: A Linked Data API and RDF Data Set About Innovative CompaniesCoRR, Mar 2019For insiders: We provide a Linked Data API and RDF dataset for Crunchbase, enabling machine-readable analytics of startups, organizations, people, and investments.For everyone: We publish Crunchbase as a web-friendly data stream, like adding a spout so machines can “drink” structured startup information.
- The Microsoft Academic Knowledge Graph: A Linked Data Source with 8 Billion Triples of Scholarly DataIn ISWC, Auckland, New Zealand, Mar 2019For insiders: We release the Microsoft Academic Knowledge Graph in RDF with billions of triples and embeddings, enabling large-scale scholarly integration and analytics.For everyone: We publish a huge linked dataset of scholarly metadata, like turning a library card catalog into a connected web map that machines can navigate.
- Identifying Twitter Bots Using a Convolutional Neural NetworkIn CLEF, Lugano, Switzerland, Mar 2019For insiders: We detect Twitter bots from tweet text with CNNs and achieve strong accuracy on a shared evaluation task for automated bot profiling.For everyone: We detect Twitter bots from their writing style, like using a metal detector to separate automated noise from real people.
- Determining How Citations Are Used in Citation ContextsIn TPDL, Oslo, Norway, Mar 2019For insiders: We define fine-grained categories of citation usage and train models to classify them, improving citation analysis and recommendation.For everyone: We label how citations are used inside sentences, like tagging whether a reference backs a claim or just explains a concept.
- Evaluating the Availability of Open Citation DataIn BIRNDL@SIGIR, Paris, France, Mar 2019For insiders: We quantify how much citation data is openly available and how openness evolves over time, informing open-science policy and research evaluation.For everyone: We take a census of open citation data and measure how much of the scientific reference network is visible to the public and how it changes over time.
- Relational schemata for distributed SPARQL query processingIn SBD@SIGMOD, Amsterdam, The Netherlands, Mar 2019For insiders: We compare relational schema designs for distributed SPARQL on Spark and show how schema choice impacts performance on standard RDF benchmarks.For everyone: We compare different storage “blueprints” for linked data on big clusters, like testing how shelf layouts change how fast you find the right book.
- ScholarSight: Visualizing Temporal Trends of Scientific ConceptsIn JCDL, Champaign, IL, USA, Mar 2019For insiders: ScholarSight visualizes temporal trends of scientific concepts extracted from papers, helping researchers explore emerging and fading topics.For everyone: ScholarSight works like a trend telescope for science, showing which ideas rise or fade so researchers can see what is heating up and what is cooling down.
- Team Peter Brinkmann at SemEval-2019 Task 4: Detecting Biased News Articles Using Convolutional Neural NetworksIn SemEval@NAACL-HLT 2019, Minneapolis, MN, USA, Mar 2019For insiders: We detect hyperpartisan biased news with neural text models and analyze trade-offs in precision and recall on a large shared-task benchmark.For everyone: We build systems that flag extremely one-sided news, like an early warning label for strongly biased reporting, and we study where these systems fail.
- Bibliometric-Enhanced arXiv: A Data Set for Paper-Based and Citation-Based TasksIn BIR@ECIR, Cologne, Germany, Mar 2019For insiders: We provide a bibliometric-enhanced arXiv dataset for paper- and citation-based tasks.For everyone: We add citation links and metadata to arXiv papers, like attaching extra labels to millions of documents so recommendation tools can be tested fairly.
- Finding Temporal Trends of Scientific ConceptsIn BIR@ECIR, Cologne, Germany, Mar 2019For insiders: We track fine-grained scientific concept trends in arXiv full texts using statistical trend tests, helping understand how research topics evolve.For everyone: We track fine-grained scientific ideas over time, like a seismograph for concepts that shows which ones steadily grow and which ones fade.
- PaperHunter: A System for Exploring Papers and Citation ContextsIn ECIR, Cologne, Germany, Mar 2019For insiders: PaperHunter lets users search citation contexts and polarity, enabling qualitative exploration of how papers are cited and discussed.For everyone: PaperHunter lets you search the exact sentences where a paper is cited, like finding every time a name is mentioned and reading the surrounding context.
2018
- A Linked Data Wrapper for CrunchBaseSemantic Web, Mar 2018For insiders: We provide Crunchbase as Linked Data, enabling richer integration and querying of startup, people, and investment data with web standards.For everyone: We publish Crunchbase as Linked Data, turning startup and investment information into a connected map that is easier to combine, query, and analyze.
- Linked Data Quality of DBpedia, Freebase, OpenCyc, Wikidata, and YAGOSemantic Web, Mar 2018For insiders: We compare major knowledge graphs using a systematic data-quality framework and help practitioners choose the right graph for their application needs.For everyone: We create a consumer-style test report for major knowledge graphs, comparing data quality so developers can choose the right “map of the world” for their use case.
- Which Knowledge Graph Is Best for Me?CoRR, Mar 2018For insiders: We translate knowledge graph quality analysis into practical selection rules so users can pick a knowledge graph that matches their data quality requirements.For everyone: We offer a practical “which map should I buy?” guide for knowledge graphs, turning quality checks into simple rules that help people choose.
- A High-Quality Gold Standard for Citation-based TasksIn LREC, Miyazaki, Japan, Mar 2018For insiders: We create a high-quality arXiv-based gold standard with correctly linked citations, enabling reliable evaluation of citation-centric NLP tasks.For everyone: We build a carefully checked reference set where citations in arXiv full texts are linked correctly.
- PRoST: Distributed Execution of SPARQL Queries Using Mixed Partitioning StrategiesIn EDBT, Vienna, Austria, Mar 2018For insiders: PRoST is a Spark-based RDF store that speeds up SPARQL by combining vertical partitioning and property tables for scalable graph querying.For everyone: PRoST speeds up queries on linked data by changing the “warehouse layout”, so answers can be found faster on big computing clusters.
- CITEWERTs: A System Combining Cite-Worthiness with Citation RecommendationIn ECIR, Grenoble, France, Mar 2018For insiders: CITEWERTs combines cite-worthiness detection with citation recommendation to guide writers to sentences that need citations and suggest references.For everyone: CITEWERT first checks whether a sentence really needs a citation, like a spell-checker for missing references, and then suggests suitable papers.
- To Cite, or Not to Cite? Detecting Citation Contexts in TextIn ECIR, Grenoble, France, Mar 2018For insiders: We detect which sentences truly require citations, reducing unnecessary recommendations and better matching real scientific writing behavior.For everyone: We teach a system to spot which sentences should cite sources.
- Annotating Domain-Specific Texts with Babelfy: A Case StudyIn EYRE, Mar 2018For insiders: We evaluate entity linking on German customer feedback using Babelfy and show strong annotation accuracy for practical domain text workflows.For everyone: We test how well a knowledge tool tags messy, specialized text, like putting accurate name tags on customer feedback so it becomes searchable.
2017
- The xLiMe system: Cross-lingual and cross-modal semantic annotation, search and recommendation over live-TV, news and social media streamsWeb Semantics, Mar 2017For insiders: We enable cross-lingual and cross-media semantic annotation, search, and recommendation over live TV, news, and social media streams.For everyone: We build a kind of “universal remote” for information streams, linking TV, news, and social media across languages so topics can be followed beyond language barriers.
- Semantic Search for Novel InformationMar 2017For insiders: This thesis develops semantic search methods to detect and extract novel information from text streams, supporting monitoring, knowledge graphs, and timely decision-making.For everyone: This thesis is about finding truly new facts in a flood of text, like building a semantic metal detector that spots novel information worth saving.
2016
- XKnowSearch! Exploiting Knowledge Bases for Entity-based Cross-lingual Information RetrievalIn CIKM, Indianapolis, Indiana, USA, Mar 2016For insiders: XKnowSearch uses knowledge-base entities to disambiguate queries and retrieve documents across languages for entity-based cross-lingual information retrieval.For everyone: We build a cross-language search tool that uses structured background knowledge like a bilingual map, so you can find relevant texts even when the query and documents use different languages.
- Exploiting Knowledge Bases for Multilingual and Cross-lingual Semantic Annotation and SearchIn ISWC, Bethlehem, PA, USA, Mar 2016For insiders: We build infrastructure for multilingual semantic annotation and demonstrate cross-lingual semantic search that links text to knowledge base entities.For everyone: We build a system that highlights the same people and places across different languages, like a multilingual marker that lets readers spot and connect information even when texts are written in different languages.
2015
- Using a semantic wiki for technology forecast and technology monitoringProgram, Mar 2015For insiders: We extend Semantic MediaWiki with forecasting and monitoring features so organizations can track emerging technologies using structured knowledge and practical workflows.For everyone: We turn a semantic wiki into a radar for emerging technologies, so teams can collect and compare new signals instead of losing them in scattered notes.
2014
- Kuphi - an Investigation Tool for Searching for and via Semantic RelationsIn ESWC, Anissaras, Crete, Greece, Mar 2014For insiders: Kuphi enables entity-centric search and relation-based exploration, letting users investigate documents by following semantic relations, not only keywords.For everyone: Kuphi helps investigate texts by following who is connected to whom, like pulling a thread from one name to the next until a hidden story appears.
- xLiD-Lexica: Cross-lingual Linked Data LexicaIn LREC, Reykjavik, Iceland, Mar 2014For insiders: We release cross-lingual linked-data lexica that map surface forms to entities across languages, supporting multilingual annotation and information access.For everyone: We build a cross-language lexicon that links words to the same real-world things, like a multilingual dictionary where “Paris” always means the same place.
- Exploiting Semantic Annotations for Entity-based Information RetrievalIn ISWC, Riva del Garda, Italy, Mar 2014For insiders: We show how semantic annotations support entity-based information retrieval and interactive query refinement to reduce ambiguity and improve relevance.For everyone: We show that tagging text with real entities improves search by meaning, like swapping fuzzy keyword hunting for clear labels you can click and refine.
2013
- A Comparative Evaluation of Cross-Lingual Text Annotation TechniquesIn CLEF, Mar 2013For insiders: We compare cross-lingual text annotation techniques and introduce an evaluation framework for systematic multilingual knowledge extraction.For everyone: We compare ways to connect texts with background knowledge across languages, like labeling the same story in different tongues, and we set up a fair “test track” to compare systems.
- Ontology-Supported Document Ranking for Novelty SearchIn ESWC, Montpellier, France, Mar 2013For insiders: We rank documents for ontology-supported novelty search to surface novel, relevant information for knowledge-base population and monitoring.For everyone: We use a knowledge-based compass to rank documents by what is genuinely new, like a treasure hunter looking for fresh finds rather than repeated facts.
- A Semantic Wiki for Novelty Search on DocumentsIn BIR, Delft, The Netherlands, Mar 2013For insiders: We present a semantic-wiki workflow for novelty search that continuously ingests and structures new web information for ongoing monitoring tasks.For everyone: We build a semantic wiki as a watchtower that keeps pulling fresh web information into a structured store, so new signals do not get lost in the noise.