Martijn de Vries

San Francisco Bay Area, CA · martijndevries91@gmail.com · Resume

As a passionate data scientist with a background in astrophysics, I bring an academic and analytical perspective to the practice of modeling and interpreting extensive data sets. I have an innate curiosity and drive to find unique solutions to complex problems, and to translate my analyses into actionable insights that can make meaningful change and impact in the world.

Skills

Coding

Bash
Git
HTML
Jupyter
Markdown
Python
R
SQL

Python Libraries

Matplotlib
Numpy
OpenCV
Pandas
Plotly
Pytorch
Scikit-learn
Scipy
Seaborn
Tensorflow

Machine Learning

Classification
Computer Vision
Clustering
Natural Language Processing
Neural Networks
Recommender Systems
Regression
Time Series

Competencies

Analytics
Algorithms
API's
A/B Testing
Bayesian Inference
Critical Thinking
Communication Skills
Data Cleaning
Hypothesis Testing
Predictive Modeling
Problem Solving
Public Speaking
Segmentation
Quantitative Research,
Tableau
Web Scraping

Projects

Rendering Handwritten Equations

As a student at university, I regularly wished I could easily take a picture of my notes and equations and have them show up digitally! With this in mind, I built a tool that can recognize individual symbols in an equation, and render the equation in LaTeX format.

Github Repo

Project Overview and Outcomes:

The tool uses OpenCVs findcontour function to detect symbols in the image. I then built a custom pre-processing pipeline that infers which contours belong together to form symbols, and what the order of the symbols should be. The symbols are then fed to retrained EfficientNetB0 Convolutional Neural Network, and a prediction for each symbol is made. Finally, the predictions for each symbol are stiched into an equation in LaTeX math mode format in the postprocessing pipeline, using the information from the previous steps.

At present, the tool predicts at least 50% of the equation correctly 70% of the time, and makes a perfect prediction about 16% of the time. A more robust version of the tool is currently in development

Solar Intensity Forecasting

As renewable energy sources start to make up an increasingly large share of electricity production in the United States, the challenge of how to integrate these (often variable and irregular) sources into the grid is becoming more and more relevant. In a small team, we constructed a model that can forecast the Solar intensity a few days ahead in Los Angeles, so that the energy grid can be optimally managed.

Github Repo

Project Overview and Outcomes:

In order to train and test our model, we used freely available weather data from the National Solar Radiation Database, which collects weather information at kilometer-resolution throughout the entire US using satellite data. We then used three different neural-net based approaches to forecast the 'Global Horizontal Irradiance', or GHI, several days in the future.

Although the Neural Prophet model performs the best, with the lowest scoring Root Mean Squared Error, it only works with univariate timeseries, and thus requires current GHI measurements to forecast GHI. In contrast, a Recurrent Neural Network can forecast the GHI using weather data, with an average error of 15% at solar noon.

NLP Analysis and Classification of Reddit posts

For this project, I wanted to investigate the way that political polarization in the United States is reflected in language use. In order to gauge this, I built a classification model that can classify posts and comments from two politically-themed subreddits: r/politics and r/conservative.

Github Repo

Project Overview and Outcomes:

I collected data from October 2022, the month leading up to the midterms, from both subreddits using the Pushshift API. After cleaning and EDA, I built separate models for both posts and comments. I used a TF-IDF vectorizer to vectorize the language, as well as a custom scikit-learn transformer to customize the number of bigrams (sets of two words) that are retained in the model. Posts and comments are classified based on the language, as well as some secondary identifying features.

The best-working model is a stacked model consisting of logistic regression, random forest, and Naive Bayes. The model is able to classify 88.6% of posts correctly, and 65.4% of comments. The main difference in performance is simply due to the fact that comments are typically shorter, which makes it harder to categorize them. By studying feature importances of the various models, we can gain interesting insights into the difference in language use in these two subreddits. An example is shown in the in the image to the right, which shows the most predictive bigrams in the Naive Bayes classifier. If one of these bigrams occurs in a post title, the model is much more likely to classify the posts as belonging to r/politics (blue) or r/conservative (red).

Predicting Housing Prices in Ames, Iowa

I used the famous Ames, Iowa housing dataset to build a predictive house prices model using linear regression techniques. In order to find the most predictive features, I performed extensive EDA and built several models of increasing complexity. The plot on the right shows the predicted vs actual sale price for the best performing model.

Github Repo

Project Overview and Outcomes:

Because this dataset has a relatively large number of features (81), extensive EDA and feature engineering was necessary to find the best model. The best-performing model includes a total of 22 features, of which 11 are categorical (and therefore have to be one-hot encoded). Additionally, I found that log-transforming the sale price improves the performance of the model, as the relationship between the predictor variables and the sale price are not strictly linear.

The root mean squared error of the model on the testing data is about $22100. Adding regularization (Ridge Regression) increased model performance by a negligible amount, indicating that the model might still be somewhat underfit.

Work Experience

Data Science Immersive Fellow

General Assembly

Honed critical data science techniques on data analysis, visualization, and machine learning across 480+ hours of expert-led instruction, finishing in the top 20% of the class. Individually and in small teams, completed 6 data science projects in two-week timeframes, succinctly communicating results to technical and non-technical audiences.

Computer Vision: Designed a stand-alone tool that accepts handwritten math equations as input, then leverages a Convolutional Neural Network to render the equation digitally, correctly predicting at least 50% of the equation in 70% of cases.
Time Series Forecasting: Predicted solar power available in Los Angeles using a Recurrent Neural Network and weather data from the National Solar Radiation Database, with an average error of 15% at solar noon.
NLP and classification: Developed a binary classification model which correctly classifies Reddit posts and comments between r/politics and r/conservative with 88.6% accuracy, using logistic regression, random forest, and stacking.

March 2023 - June 2023

Postdoctoral Researcher

Kavli Institute for Particle Astrophysics and Cosmology | Stanford University

Analyzed and modeled observational data of astrophysical objects in partnership with distinguished faculty researchers. Coordinated international research amongst cross-functional teams of 10+ scientists, and authored peer-reviewed papers in leading journals.

Communication: Presented results of peer-reviewed studies at US conferences; collaborated with the Chandra press office to craft an accessible, scientifically meaningful, and visually appealing press release - in the top 10% most watched videos on the Chandra X-ray Observatory Youtube Channel.
Mentorship: Supervised 3 new scientists/researchers in department operating procedures, instrumentation software, and best practices. Instilled data stewardship and rigorous application of statistical models.
Proposals: Crafted compelling proposals with a success rate ~3x above the average, securing time at astrophysical observatories. As a member of the Chandra Time Allocation Committee in 2022, assessed and evaluated observing proposals based on merit, cost, risk, urgency, and suitability for the observatory.

Select Peer-Reviewed Publications

Chandra Measurements of Gas Homogeneity and Turbulence in the Perseus Cluster
Developed a custom statistical method to measure galaxy cluster turbulence, using Bayesian Inference and Markov Chain Monte Carlo Sampling. Reduced error margins by 20%, and redefined benchmarks for simulations, theoretical work, and future observations.
The Long Filament of PSR J2030+4415
Discovered new pulsar filament, mapping its full extent with fresh observations, feeding new data-driven insights into the origin and physical nature of these filaments. Developed a new 'filament-finding' algorithm based on this discovery, which will be used to search for filaments in upcoming observations.
A Quarter Century of Guitar Nebula/Filament Evolution
Spearheaded first long-term study of a pulsar filament, comparing 4 epochs of observations (data from 25+ years). Developed generalizable and re-usable Python code to fit bow shock models to observational data.

October 2019 - December 2022

Education

General Assembly Data Science Immersive

12-Week Online Program

March 2023 - June 2023

University of Amsterdam

Ph.D. in Astrophysics

September 2015 - August 2019

University of Amsterdam

M.Sc. in Astrophysics

February 2013- August 2015

University of Groningen

B.Sc. in Astronomy

September 2009 - January 2013

Press Release

In 2022, I led a study titled 'The Long Filament of PSR J2030+4415', which described the discovery of a new pulsar filament. Pulsar filaments are beginning to garner wider scientific interest, as they could explain positron excess of cosmic rays detected here on Earth. The paper was highlighted in a NASA press release (link here), and gained attention on popular science websites like space.com and astronomy.com. Below, you can find a YouTube video which explains more about the system.

Publications

See my Google Scholar for a complete list of Publications

2023

Martijn de Vries, Adam B. Mantz, Steven W. Allen. R. Glenn Morris, Irina Zhuravleva, Rebecca E. A. Canning, Steven R. Ehlert, Anna Ogorzalek, Aurora Simionescu and Norbert Werner
Chandra measurements of gas homogeneity and turbulence at intermediate radii in the Perseus Cluster
Published in: Monthly Notices of the Royal Astronomical Society
Online | Github

2022

Martijn de Vries, Roger W. Romani, Oleg Kargaltsev, George Pavlov, Bettina Posselt, Patrick Slane, Niccolo’ Bucciantino, C. -Y. Ng and Noel Klingler
A Quarter Century of Guitar Nebula/Filament Evolution
Published in: The Astrophysical Journal
Online | Github

Martijn de Vries and Roger. W. Romani
The Long Filament of PSR J2030+4415
Published in: The Astrophysical Journal
Press Release | Online | Github

2021

Martijn de Vries, Roger W. Romani, Oleg Kargaltsev, George Pavlov, Bettina Posselt, Patrick Slane, Niccolo’ Bucciantini, C. -Y. Ng and Noel Klingler
PSR J1709-4429’s Proper Motion and Its Relationship to SNR G343.1-2.3
Published in: The Astrophysical Journal
Online

2020

Martijn de Vries and Roger W. Romani
PSR J2030+4415's Remarkable Bow Shock, PWN, and Filament
Published in: The Astrophysical Journal Letters
Online

Bradford Snios, Amalya C. Johnson, Paul E.J. Nulsen, Ralph P. Kraft, Martijn de Vries Richard A. Perley, Lerato Sebokolodi and Michael W. Wise
The X-Ray Cavity Around Hotspot E in Cygnus A: Tunneled by a Deflected Jet
Published in: The Astrophysical Journal
Online

2019

Martijn de Vries, Michael W. Wise, Paul E.J. Nulsen, Aneta Siemiginowska, Antonia Rowlinson and Christopher S. Reynolds
Evidence for a TDE origin of the radio transient Cygnus A-2
Published in: Monthly Notices of the Royal Astronomical Society
Online

2018

Martijn de Vries, Michael Wise, Daniela Huppenkothen, Paul E.J. Nulsen, Bradford Snios, Martin J. Hardcastle, Mark Birkinshaw, Diana M. Worrall, Ryan T. Duffy and Brian R. McNamara
Detection of non-thermal X-ray emission in the lobes and jets of Cygnus A
Published in: Monthly Notices of the Royal Astronomical Society
Online