Data Science Weekly - Data Science Weekly - Issue 466

Curated news, articles and jobs related to Data Science. 
Keep up with all the latest developments
Email not displaying correctly?
View it in your browser.

Issue #466

October 27 2022

Editor's Picks

 

  • The Scientific Virtues
    Science education usually starts with teaching students different tools and techniques, methods for conducting research...This is wrong. Science education should begin with the scientific virtues...The scientific virtues are: a) Stupidity, b) Arrogance, c) Laziness, d) Carefreeness, e) Beauty, f) Rebellion, g) Humor...
  • Using a data dictionary as your roadmap to quality data
    A data dictionary, a rectangular format collection of names, definitions, and attributes about variables in a dataset, is arguably the single most important piece of documentation you will create..While a data dictionary, sometimes also called a codebook or variable information log, is often used as a tool to help you and others interpret your data at the end of your project, it is actually even more powerful if created before you ever collect a single piece of data, serving as a roadmap as you design your data collection tools and clean your data...
  • The Farama Foundation: The future of open source reinforcement learning
    Today we’re announcing the Farama Foundation – a new nonprofit organization designed in part to house major existing open source reinforcement learning (“RL”) libraries in a neutral nonprofit body. We aim to provide standardization and long term maintenance to these projects, as well as improvements to their reproducibility, performance, and quality of life features. We are also working to develop key pieces of missing software for the open source reinforcement learning ecosystem...This post explains who we are, what we’re working on right now, and what our long term goals and vision are. This post also publicly announces the release of Gymnasium, a library where the future maintenance of OpenAI Gym will be taking place...
 
 

A Message from this week's Sponsor:

 



Learn and Practice AI/ML with Global Communities

Join the largest AI/ML/Data developers community globally (180K+ developers in 150+ countries) to learn and practice AI, machine learning, deep learning, and data science technologies. A few upcoming learning events:
  • Nov 1st (Austin): Build Image Recognition System with Kafka
  • Nov 2nd (Silicon Valley, NYC, Bengaluru): Google Data Stream Processing Night
  • Nov 10th (Seattle, Boston, New York): AWS Dev Day on Cloud Data Lakehouse
  • Nov 15th (Virtual): MLOps Platform - Notebook to Production (Expert Level Workshop)
  • And 20+ more on the website


 
 

Data Science Articles & Videos

 
  • Create Data-Rich Presentation from Jupyter Notebook
    Presentation is a great way to share your results and findings with a non-technical audience. The data-rich presentation with charts, tables, and code may be tedious to create. The good news is that you can create a presentation directly from Jupyter Notebook!...
  • The Russian Roulette: An Unbiased Estimator of the Limit
    The Russian Roulette offers a simple way to construct an unbiased estimator for the limit of a sequence. It allows for example to construct an unbiased estimator of the pseudoinverse of a matrix, which is otherwise difficult to obtain. We'll first show that the estimator is unbiased. Then we'll discuss one of the original applications of this method: an unbiased estimator of the matrix pseudoinverse. Finally, we'll discuss its limitations and practical issues through a variance analysis...
  • The most important recent developments in AI
    From solving maths and science problems to translating with astonishing accuracy between hundreds of languages – not to mention generating images and videos based on a natural language prompt – AI is making strides pretty much across the board...In this article, I’ll briefly discuss some of the most recent (and the most exciting!) developments that you should know about...
  • A Transformer That Solves Small Tabular Classification Problems in a Second
    This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes < 1 second & yields SOTA performance (competitive with the best AutoML pipelines in an hour)...So far, it is limited in scale, though: it can only tackle problems up to 1000 training examples, 100 features and 10 classes...TabPFN is radically different from previous ML methods. It is a meta-learned algorithm and it provably approximates Bayesian inference with a prior for principles of causality and simplicity. Qualitatively, its resulting predictions are very intuitive as well, with very smooth uncertainty estimates...
  • Math of Gaussian Mixture Model Clustering
    The math of Gaussian Mixture Model Clustering can be tough for undergrads to grasp, but it gives a TON of insight into how GMM works!...I made this GMM math worksheet to do with my class...
  • Generalizing in the Real World with Representation Learning
    As applications of ML, particularly in AI systems, become more pervasive in the real world, we need to critically examine these assumptions, norms, and problem settings, as well as the methods that have become de-facto standards. There is much we still do not understand about how and why deep networks trained with stochastic gradient descent are able to generalize as well as they do, why they fail when they do, and how they will perform on out-of-distribution data. In this thesis I cover some of my work towards better understanding deep net generalization, identify several ways assumptions and problem settings fail to generalize to the real world, and propose ways to address those failures in practice...
  • Coding for Economists: Common Plots
    In this chapter, we’ll look at some of the most common plots that you might want to make–and how to create them using the most popular data visualisations libraries, including matplotlib, plotnine, seaborn, altair, and plotly...
  • LangChain - Building applications with LLMs through composability
    Large language models (LLMs) are emerging as a transformative technology, enabling developers to build applications that they previously could not. But using these LLMs in isolation is often not enough to create a truly powerful app - the real power comes when you are able to combine them with other sources of computation or knowledge...This library is aimed at assisting in the development of those types of applications. It aims to create: a) a comprehensive collection of pieces you would ever want to combine, b) a flexible interface for combining pieces into a single comprehensive "chain", and c) a schema for easily saving and sharing those chains...
  • Low-Rank Approximation Toolbox: Nyström, Cholesky, and Schur
    In this post, we will draw a connection between low-rank approximation by Nyström approximation and solving linear systems of equations by Gaussian elimination. The connection between these two seemingly unrelated areas of matrix computations will pay dividends, leading to effective algorithms to compute Nyström approximations by the (partial) Cholesky factorization of a positive (semi)definite matrix and an elegant description of the residual of the Nyström approximation as the Schur complement....
  • Deep Whole-Body Control: Learning a Unified Policy for Manipulation and Locomotion
    In this work, we propose to learn a unified policy for whole-body control of a legged manipulator using reinforcement learning. We propose Regularized Online Adaptation to bridge the Sim2Real gap for high-DoF control, and Advantage Mixing exploiting the causal dependency in the action space to overcome local minima during training the whole-body system. We also present a simple design for a low-cost legged manipulator, and find that our unified policy can demonstrate dynamic and agile behaviors across several task setups...
  • Optimisation & Generalisation in Networks of Neurons
    The goal of this thesis is to develop the optimisation and generalisation theoretic foundations of learning in artificial neural networks. On optimisation, a new theoretical framework is proposed for deriving architecture-dependent first-order optimisation algorithms. The approach works by combining a "functional majorisation" of the loss function with "architectural perturbation bounds" that encode an explicit dependence on neural architecture. The framework yields optimisation methods that transfer hyperparameters across learning problems. On generalisation, a new correspondence is proposed between ensembles of networks and individual networks...
  • The Unreasonable Effectiveness of Data Pipeline Smoke Tests
    Data practitioners waste time writing unit tests to catch bugs they could have caught with smoke tests...In this post, we’ll discuss a powerful technique for speeding up data pipeline development: the data pipeline smoke test. You write your smoke test just once: you don’t need to write a test for every newly derived data asset. It can complete in a few seconds and exercises every transformation inside your data pipeline...The idea of the data pipeline smoke test is to automatically run all your data transformations on empty or synthetic data...
 
 

Tool*

 



Jumpstart your data science journey and master the foundations of our data-driven world with Anaconda.

If you're looking to learn essential data science skills, there’s no need to sort through countless tools, guides, and boot camps that overpromise and underdeliver—Anaconda is here! With an Anaconda subscription, you can now access on-demand data science training and cloud-hosted notebooks. Learn from experts in the field and spin up data science projects anytime, anywhere—with all the packages and computing power you need. Whether you’re just getting started or ready to take your data science skills to the next level, Anaconda provides the building blocks you need to make sense of our data-driven world.

Get started at Anaconda.cloud.


*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

   
 

Jobs

 
  • Data Scientist - Mount Sinai Data Commons - NYC

    A position is available for an individual with skills in data science, bioinformatics and software engineering to play the key role in running and managing the Mount Sinai Data Commons – known as the Data Ark. The Data Ark team brings together all the most important data sets used by Sinai researchers (e.g. 1000G, GTEx, UK Biobank) in a single location on our HPC server (minvera.org), performs QA/QC processing of the data, conducts initial demographics analyses to showcase the different data sets, and will be tasked with expanding the data commons to host a large range of different data sets of different types (genotype, WES, WGS, RNA-seq, EHR-linked, imaging etc.), which will come with their own computational and platform challenges...
     

        Want to post a job here? Email us for details --> team@datascienceweekly.org

 

 

Training & Resources

 
  • Understanding ShinyApps
    Today, we’ll discover how you can use the power of R (and RStudio) to create, for instance, an interactive visualization with the ShinyApp framework...
  • An Introduction to Poisson Flow Generative Models
    Poisson Flow Generative Models (PFGMs) are a new type of generative Deep Learning model, taking inspiration from physics much like Diffusion Models. Learn the theory behind PFGMs and how to generate images with them in this easy-to-follow guide...
 

What you’re up to – notes from DSW readers

 
  • Fill out the form below to appear here :) ...
 

* To share your projects and updates, share the details here.

** Want to chat with one of the above people? Hit reply and let us know :)

 

Last Week's Newsletter's 3 Most Clicked Links

   

* Based on unique clicks.

** Find last week's newsletter here.

 

Cutting Room Floor

 


P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian
Follow on Twitter
Copyright © 2013-2022 DataScienceWeekly.org, All rights reserved.
unsubscribe from this list    update subscription preferences 

Key phrases

Older messages

Data Science Weekly - Issue 465

Thursday, October 20, 2022

Curated news, articles and jobs related to Data Science. Keep up with all the latest developments Email not displaying correctly? View it in your browser. Issue #465 October 20 2022 Editor's Picks

Data Science Weekly - Issue 464

Thursday, October 13, 2022

Curated news, articles and jobs related to Data Science. Keep up with all the latest developments Email not displaying correctly? View it in your browser. Issue #464 October 13 2022 Editor's Picks

Data Science Weekly - Issue 463

Thursday, October 6, 2022

Curated news, articles and jobs related to Data Science. Keep up with all the latest developments Email not displaying correctly? View it in your browser. Issue #463 October 06 2022 Editor's Picks

Data Science Weekly - Issue 462

Thursday, September 29, 2022

Curated news, articles and jobs related to Data Science. Keep up with all the latest developments Email not displaying correctly? View it in your browser. Issue #462 September 29 2022 Editor's

Data Science Weekly - Issue 461

Friday, September 23, 2022

Curated news, articles and jobs related to Data Science. Keep up with all the latest developments Email not displaying correctly? View it in your browser. Issue #461 September 22 2022 Editor's

You Might Also Like

🤓 The Meta Quest Might Be the VR Steam Deck Soon — Games to Play After Finishing Wordle

Friday, April 26, 2024

Also: Why a Cheap Soundbar Is Better Than Nothing, and More! How-To Geek Logo April 26, 2024 Did You Know TMI: Rhinotillexomania is the medical term for obsessive nose picking. 🖥️ Get Those Updates

JSK Daily for Apr 26, 2024

Friday, April 26, 2024

JSK Daily for Apr 26, 2024 View this email in your browser A community curated daily e-mail of JavaScript news A Solid primer on Signals with Ryan Carniato (JS Party #320) Ryan Carniato joins Amal

So are we banning TikTok or what?

Friday, April 26, 2024

Also: Can an influencer really tank an $800M company? View this email online in your browser By Haje Jan Kamps Friday, April 26, 2024 Image Credits: Jonathan Raa/NurPhoto / Getty Images Welcome to

[AI Incubator] 300+ people are already in. Enrollment closes tonight at 11:59pm PT.

Friday, April 26, 2024

How to decide if you're ready. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Daily Coding Problem: Problem #1423 [Medium]

Friday, April 26, 2024

Daily Coding Problem Good morning! Here's your coding interview problem for today. This problem was asked by Google. You are given an array of nonnegative integers. Let's say you start at the

Data science for Product Managers

Friday, April 26, 2024

Crucial resources to empower you with data that matters. ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌ ‌

Inner Thoughts

Friday, April 26, 2024

'The Inner Circle' Comes Around... Inner Thoughts By MG Siegler • 26 Apr 2024 View in browser View in browser If you'll allow me a brief meta blurb this week (not a Meta blurb, plenty of

Digest #135: Kubernetes Hacks, Terraform CI/CD, HashiCorp Acquisition, AWS Data Transfer Monitoring

Friday, April 26, 2024

Explore Advanced Kubernetes Techniques, Dive Into Terraform CI/CD Frameworks, Monitor AWS Data Transfer, and Explore Cloud Security with Gitleaks! ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Build5Nines Newsletter - April 25, 2024

Friday, April 26, 2024

View this email in your browser Build5Nines Build5Nines Newsletter Thank you for subscribing! I look forward to sharing with you the latest cloud news, technical help, and other thoughts around DevOps

Ranked | Which City Has the Most Billionaires in 2024? 💰

Friday, April 26, 2024

Just two countries account for half of the top 20 cities with the most billionaires. And the majority of the other half are found in Asia. View Online | Subscribe Presented by: Discover what drives