Title: Jets as a Model of Data
Abstract: Natural datasets — including images, video, and language — are richly-structured yet opaque, and we lack a first‑principles model of the data-generating process. Most theoretical work to date has focused on toy models of Gaussian data, which are analytically tractable but are far too simple to describe the real world. I will argue that physics data, specifically simulated jet data, is “just right” as a model of data: complex enough to model features of natural data such as hierarchical structure and fractal dimensionality, but simple enough to be calculable. I will outline a research program through which studying simulated jets can shed light on numerous questions in machine learning theory: How do generative models beat the curse of dimensionality? How are latent variables encoded, and can models generalize under distribution shifts? Can we predict scaling exponents for neural scaling laws? Are generative models truly learning the “rules” of the data-generating process? Answering these questions can also help us better use machine learning to understand the physics of real high-energy experimental data, in a virtuous cycle of “physics for AI for physics”.
https://lbnl.zoom.us/j/94928022788?pwd=emVQWG1mTnhSbHVqekVuenk0VEVQZz09