Editor's Picks
- The Scientific Virtues
Science education usually starts with teaching students different tools and techniques, methods for conducting research...This is wrong. Science education should begin with the scientific virtues...The scientific virtues are: a) Stupidity, b) Arrogance, c) Laziness, d) Carefreeness, e) Beauty, f) Rebellion, g) Humor...
- Using a data dictionary as your roadmap to quality data
A data dictionary, a rectangular format collection of names, definitions, and attributes about variables in a dataset, is arguably the single most important piece of documentation you will create..While a data dictionary, sometimes also called a codebook or variable information log, is often used as a tool to help you and others interpret your data at the end of your project, it is actually even more powerful if created before you ever collect a single piece of data, serving as a roadmap as you design your data collection tools and clean your data...
- The Farama Foundation: The future of open source reinforcement learning
Today we’re announcing the Farama Foundation – a new nonprofit organization designed in part to house major existing open source reinforcement learning (“RL”) libraries in a neutral nonprofit body. We aim to provide standardization and long term maintenance to these projects, as well as improvements to their reproducibility, performance, and quality of life features. We are also working to develop key pieces of missing software for the open source reinforcement learning ecosystem...This post explains who we are, what we’re working on right now, and what our long term goals and vision are. This post also publicly announces the release of Gymnasium, a library where the future maintenance of OpenAI Gym will be taking place...
A Message from this week's Sponsor:
Learn and Practice AI/ML with Global Communities
Join the largest AI/ML/Data developers community globally (180K+ developers in 150+ countries) to learn and practice AI, machine learning, deep learning, and data science technologies. A few upcoming learning events:
- Nov 1st (Austin): Build Image Recognition System with Kafka
- Nov 2nd (Silicon Valley, NYC, Bengaluru): Google Data Stream Processing Night
- Nov 10th (Seattle, Boston, New York): AWS Dev Day on Cloud Data Lakehouse
- Nov 15th (Virtual): MLOps Platform - Notebook to Production (Expert Level Workshop)
- And 20+ more on the website
Data Science Articles & Videos
- Create Data-Rich Presentation from Jupyter Notebook
Presentation is a great way to share your results and findings with a non-technical audience. The data-rich presentation with charts, tables, and code may be tedious to create. The good news is that you can create a presentation directly from Jupyter Notebook!...
- The Russian Roulette: An Unbiased Estimator of the Limit
The Russian Roulette offers a simple way to construct an unbiased estimator for the limit of a sequence. It allows for example to construct an unbiased estimator of the pseudoinverse of a matrix, which is otherwise difficult to obtain. We'll first show that the estimator is unbiased. Then we'll discuss one of the original applications of this method: an unbiased estimator of the matrix pseudoinverse. Finally, we'll discuss its limitations and practical issues through a variance analysis...
- The most important recent developments in AI
From solving maths and science problems to translating with astonishing accuracy between hundreds of languages – not to mention generating images and videos based on a natural language prompt – AI is making strides pretty much across the board...In this article, I’ll briefly discuss some of the most recent (and the most exciting!) developments that you should know about...
- A Transformer That Solves Small Tabular Classification Problems in a Second
This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes < 1 second & yields SOTA performance (competitive with the best AutoML pipelines in an hour)...So far, it is limited in scale, though: it can only tackle problems up to 1000 training examples, 100 features and 10 classes...TabPFN is radically different from previous ML methods. It is a meta-learned algorithm and it provably approximates Bayesian inference with a prior for principles of causality and simplicity. Qualitatively, its resulting predictions are very intuitive as well, with very smooth uncertainty estimates...
- Math of Gaussian Mixture Model Clustering
The math of Gaussian Mixture Model Clustering can be tough for undergrads to grasp, but it gives a TON of insight into how GMM works!...I made this GMM math worksheet to do with my class...
- Generalizing in the Real World with Representation Learning
As applications of ML, particularly in AI systems, become more pervasive in the real world, we need to critically examine these assumptions, norms, and problem settings, as well as the methods that have become de-facto standards. There is much we still do not understand about how and why deep networks trained with stochastic gradient descent are able to generalize as well as they do, why they fail when they do, and how they will perform on out-of-distribution data. In this thesis I cover some of my work towards better understanding deep net generalization, identify several ways assumptions and problem settings fail to generalize to the real world, and propose ways to address those failures in practice...
- Coding for Economists: Common Plots
In this chapter, we’ll look at some of the most common plots that you might want to make–and how to create them using the most popular data visualisations libraries, including matplotlib, plotnine, seaborn, altair, and plotly...
- LangChain - Building applications with LLMs through composability
Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. But using these LLMs in isolation is often not enough to create a truly powerful app - the real power comes when you are able to combine them with other sources of computation or knowledge...This library is aimed at assisting in the development of those types of applications. It aims to create: a) a comprehensive collection of pieces you would ever want to combine, b) a flexible interface for combining pieces into a single comprehensive "chain", and c) a schema for easily saving and sharing those chains...
- Low-Rank Approximation Toolbox: Nyström, Cholesky, and Schur
In this post, we will draw a connection between low-rank approximation by Nyström approximation and solving linear systems of equations by Gaussian elimination. The connection between these two seemingly unrelated areas of matrix computations will pay dividends, leading to effective algorithms to compute Nyström approximations by the (partial) Cholesky factorization of a positive (semi)definite matrix and an elegant description of the residual of the Nyström approximation as the Schur complement....
- Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion
In this work, we propose to learn a unified policy for whole-body control of a legged manipulator using reinforcement learning. We propose Regularized Online Adaptation to bridge the Sim2Real gap for high-DoF control, and Advantage Mixing exploiting the causal dependency in the action space to overcome local minima during training the whole-body system. We also present a simple design for a low-cost legged manipulator, and find that our unified policy can demonstrate dynamic and agile behaviors across several task setups...
- Optimisation & Generalisation in Networks of Neurons
The goal of this thesis is to develop the optimisation and generalisation theoretic foundations of learning in artificial neural networks. On optimisation, a new theoretical framework is proposed for deriving architecture-dependent first-order optimisation algorithms. The approach works by combining a "functional majorisation" of the loss function with "architectural perturbation bounds" that encode an explicit dependence on neural architecture. The framework yields optimisation methods that transfer hyperparameters across learning problems. On generalisation, a new correspondence is proposed between ensembles of networks and individual networks...
- The Unreasonable Effectiveness of Data Pipeline Smoke Tests
Data practitioners waste time writing unit tests to catch bugs they could have caught with smoke tests...In this post, we’ll discuss a powerful technique for speeding up data pipeline development: the data pipeline smoke test. You write your smoke test just once: you don’t need to write a test for every newly derived data asset. It can complete in a few seconds and exercises every transformation inside your data pipeline...The idea of the data pipeline smoke test is to automatically run all your data transformations on empty or synthetic data...
Tool*
Jumpstart your data science journey and master the foundations of our data-driven world with Anaconda.
If you're looking to learn essential data science skills, there’s no need to sort through countless tools, guides, and boot camps that overpromise and underdeliver—Anaconda is here! With an Anaconda subscription, you can now access on-demand data science training and cloud-hosted notebooks. Learn from experts in the field and spin up data science projects anytime, anywhere—with all the packages and computing power you need. Whether you’re just getting started or ready to take your data science skills to the next level, Anaconda provides the building blocks you need to make sense of our data-driven world.
Get started at Anaconda.cloud.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
- Data Scientist - Mount Sinai Data Commons - NYC
A position is available for an individual with skills in data science, bioinformatics and software engineering to play the key role in running and managing the Mount Sinai Data Commons – known as the Data Ark. The Data Ark team brings together all the most important data sets used by Sinai researchers (e.g. 1000G, GTEx, UK Biobank) in a single location on our HPC server (minvera.org), performs QA/QC processing of the data, conducts initial demographics analyses to showcase the different data sets, and will be tasked with expanding the data commons to host a large range of different data sets of different types (genotype, WES, WGS, RNA-seq, EHR-linked, imaging etc.), which will come with their own computational and platform challenges...
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
- Understanding ShinyApps
Today, we’ll discover how you can use the power of R (and RStudio) to create, for instance, an interactive visualization with the ShinyApp framework...
- An Introduction to Poisson Flow Generative Models
Poisson Flow Generative Models (PFGMs) are a new type of generative Deep Learning model, taking inspiration from physics much like Diffusion Models. Learn the theory behind PFGMs and how to generate images with them in this easy-to-follow guide...
What you’re up to – notes from DSW readers
- Fill out the form below to appear here :) ...
* To share your projects and updates, share the details here.
** Want to chat with one of the above people? Hit reply and let us know :)
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's newsletter here.
Cutting Room Floor
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian |