CSC PhD Studentship in Electronic Engineering and Computer Science, Queen Mary University of London

Project title: The shape of words

The ability of Large Language Models (LLMs) (AKA Pretrained Foundation Models) to capture essential linguistic features, whether it is syntactic, semantic, or others, remains largely mysterious. This is in part because the tools currently used to investigate LLMs are too basic to analyze the intricate geometry of the embeddings produced by the models’ huge number of parameters. We therefore propose to re-think the way we analyze the embedding spaces and to develop tools that are better suited for the task.

Topological Data Analysis (TDA) is a collection of data-driven methods based on algebraic topology. Persistent Homology (PH) is the most popular TDA method, representing structural information related to connected components in the embedding space (holes, bubbles, etc.), and is commonly used to extract topological features underlying point-clouds.

We plan to analyze embedding spaces using PH and other TDA techniques, and to develop new methods and measures to better describe embedding spaces and extract the information they encode.

This endeavour is likely to shed light on the inner-working of LLMs, their training regimes, and the type of information they encode in their topological structures, linguistic or otherwise, and will provide novel topological approach that describes the “shape of words”.

This project will be co-funded by the China Scholarship Council (CSC). CSC is offering a monthly stipend to cover living expenses and QMUL is waving fees and hosting the student. These scholarships are available only for Chinese candidates. 

For more information, please contact us:

Haim Dubossarsky h.dubossarsky@qmul.ac.uk

Omer Bobrowski o.bobrowski@qmul.ac.uk

Leave a Reply

Your email address will not be published. Required fields are marked *

 

This site uses Akismet to reduce spam. Learn how your comment data is processed.