I am a computational statistician, with a background in causal inference and machine learning. Currently, I develop Generative AI software at Google. Previously, I also lectured a sequence of graduate-level scientific-computing courses at Stanford University. With my unique blend of academic, teaching, and industry experience, I aspire to contribute to the academic community by publishing innovative work and mentoring the next generation of researchers. With over a decade of impactful research experience at world-renowned institutions such as Stanford, Google, and Berkeley, I am confident in my ability not only to excel in a PhD program but also to make pioneering contributions to the field while giving back to the academic and broader community in meaningful ways.
My research journey began with my B.A. thesis in Economics at U.C. Berkeley, advised by Gregory Duncan, where I came up with a quasi-experiment framework to answer the question, “what’s the causal effect of sports participation on GPA’s?” (thesis writing sample). This was a practical question of interest to me as a student athlete. I used STATA to analyze Average Treatment Effects, and the quasi-experiment involved looking toward student-athletes who either walked onto the team midway through their collegiate career, or who got injured midway through their collegiate sports career and then were forced to no longer play sport (*in either case, we get to observe the scholastic performance for the same individual, partitioned by participation in sport). I found that football players with low incoming SAT scores were likely to have their grades increased by 0.8 grade points as a function of being a student-athlete, whereas for women’s crew the effect was reversed: participating in sports decreased their GPA’s by almost 1/2 a grade point. The intuition here is partly a regression to the (global) mean, but also can be explained by minimum grade requirements for being eligible to compete at the NCAA level, and access to tutors. Being able to answer a meaningful and practically significant question of interest is what inspired my career of research.
For several years after graduation, I worked with Daniel McFadden, Nobel laureate and founder of the logistic regression model, on estimating causal damages from the 2010 Gulf Oil Spill. In particular, I was the core contributor for procuring R code to support Dan’s expert testimony on how the spill affected tourism in the Gulf. We constructed a discrete choice model to infer the value of lost visitation to the Gulf, based in part on proprietary survey data. A national level survey with data from over 26,000 households traveling to almost 40,000 unique destinations was collected in order to estimate the value of taking a trip to a particular location, conditional on demographic characteristics of the trip participants as well as destination specific features. I used Python to create two giant “time” and “distance” travel cost matrices, each containing the cost, in minutes or miles, for all combinations of travelers and trips in the survey. From this, it was possible to estimate e.g. what the cost of driving and flying was per mile of travel. When we put together travel costs with lost visitation from beach surveys, we were able to come up with a bottom-line number for the impact of the spill on visitation and the financial implications to surrounding counties. I used ArcGIS to map out the fiscal impacts as a function of geographic location across all 3,000+ counties in the US.
At Stanford, I earned an M.S. in Computational Mathematics (GPA 3.96). After earning an A+ in a project-based Distributed Algorithms course, I worked with Professor Reza Zadeh on a distributed min-cut algorithm for the remainder of the summer. Although we didn’t publish, it was an excellent exercise in research (draft paper). I was also afforded a research internship at Lawrence Livermore National Lab, where I worked with Kaiser Research on Sepsis prognosis using machine learned models and bayesian methods (example publication).
For my M.S. thesis, I worked under the supervision of esteemed economist Guido Imbens (now Nobel Laureate). My work applied causal inference with machine learning toward sports analytics, receiving an A+ for its creativity and rigor (thesis publication). Specifically, we analyzed bookmaker’s odds of meeting the spread, i.e. the expected score differential between two teams, and used this as a conditioner on which we could observe unanticipated effects (rationale). We measured the nightlife index within a city using Bureau of Labor Statistics census data on the number of musical recording studios and entertainment clubs within the locale (distribution of nightlife). We then filtered our attention to cases where games are played back-to-back within 24 hours of each other; to the extent that next-day opponent is uncorrelated with exposure to treatment, we have identification in our variable of interest (heatmap). Teams visiting cities with higher nightlife indices consistently underperformed against bookmaker expectations, revealing a novel pattern in sports analytics. We replicate our analysis and find consistent results in two different sports: NBA and MLB. As a robustness check, we also verify that the hangover effect dissipates when players rest more than 24 hours after visiting a city with an active nightlife index. Assessing the strategy in the online marketplace validated the predictive power of our model (online performance).
After graduating, I was asked by the Director of ICME, Margot Gerritsen, to serve as a Lecturer for my alma mater; for the next seven years, in addition to my full time role(s) within industry, I concurrently taught graduate courses in Python and C++ to nearly 200 students annually. I managed a staff of at least 4 TA’s each quarter, and I was awarded “Best Lecturer” 3 separate years, as determined by students and fellow faculty. I also led programming workshops and interview preparation sessions, honing my ability to distill complex ideas and fostering a passion for teaching that I hope to continue in a faculty role after completing my PhD. I’m confident that skills I garnered as a lecturer will be useful in describing my research in front of large audiences.
At YouTube, I developed recommendation algorithms that transitioned from heuristic to machine-learned strategies, contributing significantly as a founding engineer on YouTube Shorts. Concretely, while we started with a deterministic triggering and ranking strategy (i.e. always placing the Shorts “shelf” in the same position within the feed for all users), over the course of several years I developed and iteratively refined a reward model to capture how much value the Shelf delivers on a per-user basis, and then created a probabilistic model to show the shelf with a probability that is proportional to the estimated long(er) term value. My foundations in distributed algorithms allowed me to personally implement my own ideas into production using C++.
The Gemini 1.5 Technical Report captured the attention of the scientific community (garnering over 2,000 citations in the past year). As part of the core team, I created a “self-critique” framework for LLM evaluation that was consistent with human raters, and was recognized on the 1.5 technical report accordingly. The mechanism for evaluation was a side-by-side framework comparing two different model responses to a given prompt. The most effective way to obtain reliable and interpretable results was to embed explicit rubrics within the instruction set to the agent. Using few-shot learning, we were able to engineer a machine-learned rater with inter-rater reliability that was at least as good as the average human-rater. After creating a satisfactory evaluation framework, I then became interested with combining the results from human-raters, comprising ground truth labels, with predictions for model quality from the agent in a way that sensibly augments our evaluation dataset. Specifically, I proposed using Model Assisted Estimation within the LLM self-critique framework: this allows us to refine a model based estimate with an offset factor that captures the expected bias, yielding an overall unbiased estimator with reduced variance. In other words, instead of using just an LLM score for model quality, or alternatively just using a human rater based score, I came up with a way to combine the two sets of data meaningfully and in a statistically rigorous way.
Most recently, my work in Generative AI has focused on media algorithms and spatial Super Resolution. In order to tackle a new domain, I started with a literature review focusing on seminal papers such as Attention is All You Need, Vision Transformer, and Video Vision Transformer. Additionally, I found Denoising Diffusion Probabilistic Models to be quite relevant. I then personally implemented a Diffusion Transformer for video enhancement, i.e. a Content-Aware Degradation-Driven Transformer (inspired by the latest research). As the sole author of the prototype algorithm, I acquired significant experience architecting neural networks using JAX primitives.
Research on Training Compute-Optimal Large Language Models indicates that large Vision and Language Models (VLMs) are often undertrained and could benefit from more diverse and informative data. A corollary is that given finite resource constraints, the composition of the training data matters a lot for final model quality. This motivates a central question: can a VLM autonomously construct its own learning curriculum? This is an important research question in the age of LLM's, which are notoriously expensive to train on internet-scale datasets. Instead of treating each training observation equally after each passing epoch, what if we focused more attention onto "harder" examples? This might positively affect model quality, or it might make training more efficient for a fixed quality outcome. This is reminiscent and inspired by ideas in the literature from active learning (wherein we might choose to train on examples with higher prediction uncertainty as a way to take larger step sizes for a fixed learning rate within a gradient descent update).
I envision a framework where a VLM, after each training epoch, engages in a form of "self-reflection" to identify areas of weakness. Specifically, the model could analyze its performance on the training corpus, using metrics like prediction uncertainty or entropy (in the style of Reference-Free Confidence-Based Truthfulness Estimation) to pinpoint examples or types of examples where it struggles. Inspired by Robotic Control via Embodied Chain of Thought Reasoning, the model would then "think through" which of these challenging examples are most informative for improving its performance, perhaps based on an evaluation benchmark which contextualizes the knowledge of the model in an applied scenario. During the next epoch, the observations could be randomly selected for inclusion (characterized by a Bernoulli distribution parameterized by the model's assessment of the example's informativeness). With each passing epoch, the model has a chance to focus on different types of examples based on its weaknesses. Another way to look at it is: since foundation models are undertrained, we can explore the idea of interweaving lightweight evaluation within our training loop without excessive worry of overfitting.
The "bootstrapped" dataset (technically sampled without replacement) can be viewed as a distribution over tasks, where each "task" is loosely defined as a skill or capability implicitly learned from a subset of the data (e.g., understanding specific visual concepts, performing a type of reasoning, or handling a particular linguistic structure). This framing naturally lends itself to the application of Model Agnostic Meta-Learning for Fast Adaptation of Deep Networks, enabling the model to learn efficiently and still be useful for downstream fine-tuning. This approach is also reminiscent of the self-improvement loop that characterizes human learning (humans implicitly focus on "harder and harder examples" when learning new skills).