Data Science Weekly - Data Science Weekly - Issue 444

Curated news, articles and jobs related to Data Science.
Keep up with all the latest developments

Email not displaying correctly?
View it in your browser.

Issue #444

May 26 2022

Editor Picks

Stanford MLSys Seminar Episode 5: Chip Huyen [Video]
This talk covers what it means to operationalize ML models. It starts by analyzing the difference between ML in research vs. in production, ML systems vs. traditional software, as well as myths about ML production...It then goes over the principles of good ML systems design and introduces an iterative framework for ML systems design, from scoping the project, data management, model development, deployment, maintenance, to business analysis...The talk ends with a survey of the ML production ecosystem, the economics of open source, and open-core businesses....

NonCompositional or Why composition is DALL-E’s strength, not its weakness
When we compose meanings, concepts, semantics or any other ‘elements’ of cognition, the outcome is not easily predictable like it is when we compose functions in mathematics or operations in a computer programme....it makes no sense to criticise DALL-E (or neural networks in general) for their poor composition. It is precisely because their composition is surprisingly good that emotions have been stirred and people are enjoying tweeting and sharing these things so much! Yeah, all good fun, but we can’t learn anything scientific or conceptual from this brute-force approach Well, I’m not so sure. Let’s consider a bit more history….

Large Language Models are Zero-Shot Reasoners [Twitter thread + paper]
Simply adding “Let’s think step by step” before each answer increases the accuracy on MultiArith from 17.7% to 78.7% and GSM8K from 10.4% to 40.7% with GPT-3...

A Message from this week's Sponsor:

Retool is the fast way to build an interface for any database

With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow.

Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.

Data Science Articles & Videos

Imagen - unprecedented photorealism × deep level of language understanding
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. ...

How the Music Platform Spotify Collects and Uses Your Data
This is a part of our Recess series in which university students from across Canada briefly explain key concepts in AI that young people should know about: specifically, what AI does, how it works, and what it means for you...

A survey on adversarial attacks and defences
In recent times, different types of adversaries based on their threat model leverage vulnerabilities to compromise a deep learning system where ad-versaries have high incentives...However, there are only a few strong countermeasures which can be used in all types of attack scenarios to design a robust deep learning system. Herein, the authors attempt to provide a detailed discussion on different types of adversarial attacks with various threat models and also elaborate on the efficiency and challenges of recent countermeasures against them......

Let's Continue Bundling into the Database
A very silly blog post came out a couple months ago about The Unbundling of Airflow. I didn’t fully read the article, but I saw its title and skimmed it enough to think that it might’ve been too thin of an argument to hold water...I actually don’t care that much about the bundling argument that I will make in this post. Truthfully, I just want to argue that feature stores, metrics layers, and machine learning monitoring tools are all abstraction layers on the same underlying concepts, and 90% of companies should just implement these “applications” in SQL on top of streaming databases...

You're Relying on Data Too Much
Data can often make decisions worse, not better. This blog post gives an example of one such situation as a metaphor...

Bridging the Resource Divide for Artificial Intelligence Research
White House report from Lynne Parker, Deputy United States Chief Technology Officer and Director of the National Artificial Intelligence Initiative Office...Today, as co-chair of the Task Force and as part of OSTP’s broader work to advance the responsible research, development, and use of AI, I am proud to announce the submission of the interim report of the NAIRR Task Force to the President and Congress. This report lays out a vision for how this national cyberinfrastructure could be structured, designed, operated, and governed to meet the needs of America’s research community...

Introducing PeerXiv - A modern platform for peer-review of preprints
What would a peer review process look like if it was designed today? Peer review is one of the cornerstones of the research community, and yet while our community keeps advancing and growing, the reviewing process remains almost unchanged...We strongly believe that peer review can be so much better for both authors and reviewers and we are excited to share PeerXiv, our proposal to do just that...

Artificial intelligence is breaking patent law
The patent system assumes that inventors are human. Inventions devised by machines require their own intellectual property law and an international treaty...In 2020, a machine-learning algorithm helped researchers to develop a potent antibiotic that works against many pathogens. Artificial intelligence (AI) is also being used to aid vaccine development, drug design, materials discovery, space technology and ship design. Within a few years, numerous inventions could involve AI. This is creating one of the biggest threats patent systems have faced...

On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing
Data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models...

AI reveals unsuspected math underlying search for exoplanets
University of California, Berkeley, astronomers found unsuspected connections hidden in the complex mathematics arising from general relativity—in particular, how that theory is applied to finding new planets around other stars...In a paper appearing this week in the journal Nature Astronomy, the researchers describe how an AI algorithm developed to more quickly detect exoplanets when such planetary systems pass in front of a background star and briefly brighten it—a process called gravitational microlensing—revealed that the decades-old theories now used to explain these observations are woefully incomplete....

Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation
In this joint work with Vikram Voleti and Christopher Pal, we show that a single diffusion model can solve many video tasks: 1) interpolation, 2) forward/reverse prediction, and 3) unconditional generation through a well-designed masking scheme 🧙‍♂️....

Tools*

Check out the new Anaconda Community for all-things data!

Want insights into the newest developments in the world of data, or need help getting “unstuck” on a problem?

Our Community Forums is the place to go! Be the first to engage with other professionals and ask questions to the broader data community. Users can join in conversations around trends, debate new features, post questions to the community, and more. Plus, it’s another avenue for technical help!

Create your free Anaconda Community account now.

*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!

Jobs

Data Scientist - Hungryroot - Remote

Hungryroot is looking for a Data Scientist to join our growing Data Team. As a Data Scientist, you will work closely with other Data Scientists and Data Engineers to develop various Machine Learning models that power Hungryroot and it’s AI functions. These models include traditional forecasting models, as well as more industry-specific optimization challenges.

As a Data Scientist at Hungryroot, you will work on answering questions like: how do you tell what food someone would like to eat this week, how do you determine whether they enjoyed it or not, maybe they liked their means last week, but are now looking for different options, maybe they like the same food on Tuesdays, but variety on Fridays, what about spicy food, is Green Chilly as spicy as Green Curry?

Want to post a job here? Email us for details --> team@datascienceweekly.org

Training & Resources

MIT Spring 2022 Machine Learning for Healthcare Class (6.871/HST.956)
Introduces students to machine learning in healthcare, including the nature of clinical data and the use of machine learning for risk stratification, disease progression modeling, precision medicine, diagnosis, subtype discovery, and improving clinical workflows. Topics include causality, interpretability, algorithmic fairness, time-series analysis, graphical models, deep learning and transfer learning. Guest lectures by clinicians from the Boston area and course projects with real clinical data emphasize subtleties of working with clinical data and translating machine learning into clinical practice....

What Is Active Metadata, and Why Does It Matter?
Just like data mesh or the metrics layer, active metadata is the latest hot topic in the data world. As with every other new concept that gains popularity in the data stack, there’s been a sudden explosion of vendors rebranding to “active metadata”, ads following you everywhere and...confusion...With everyone talking about active metadata, it must be pretty easy to understand, right?...I’ve broken down the ideas behind active metadata with as little jargon as possible. Keep reading to learn what active metadata is, what it looks like, how you can actually use it, how it fits into the modern data stack, and why it even matters...

Quick review of The Data Science Course 2022: Complete Data Science Bootcamp on udemy [Reddit Discussion]
This course was hyped among some DS influencers and I thought I would try it. For balance against the hype, I wanted to express my dissatisfaction with it...

What you’re up to – notes from DSW readers

Alex is working on building a predictive model for customer segmentation...

Daniel Czwalinna is working on measuring the effect of false labels on image classification model performance....

Andrew Van Dyke is working on an algorithmic trading system. Algorithms are randomly generated from compositions of basic math functions on input data. These algorithms are then refined via a Genetic Algorithm....

Frank is working on Master's thesis in logistics and supply chain management...

Frank Corrigan is working on building a NLP-enabled async, voice-first communication platform....

* To share your projects and updates, share the details here.

** Want to chat with one of the above people? Hit reply and let us know :)

Last Week's Newsletter's 3 Most Clicked Links

Software Development Resources for Data Scientists

Why are companies willing to spend so much on hiring new employees but on retaining them?

Preston’s Paradox

* Based on unique clicks.

** Find last week's newsletter here.

P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian

Follow on Twitter

unsubscribe from this list update subscription preferences

Data Science Weekly - Data Science Weekly - Issue 444

Issue #444

May 26 2022

Editor Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Tools*

Jobs

Training & Resources

What you’re up to – notes from DSW readers

Last Week's Newsletter's 3 Most Clicked Links

Older messages

Data Science Weekly - Issue 443

Data Science Weekly - Issue 442

Data Science Weekly - Issue 440

Data Science Weekly - Issue 440

Data Science Weekly - Issue 439

You Might Also Like

Import AI 399: 1,000 samples to make a reasoning model; DeepSeek proliferation; Apple's self-driving car simulator

Defining Your Paranoia Level: Navigating Change Without the Overkill

5 ways AI can help with taxes 🪄

Recurring Automations + Secret Updates

The First Provable AI-Proof Game: Introducing Butterfly Wings 4

GCP Newsletter #437

Charted | The 1%'s Share of U.S. Wealth Over Time (1989-2024) 💰

The Great Social Media Diaspora & Tapestry is here

Daily Coding Problem: Problem #1689 [Medium]

📧 Stop Conflating CQRS and MediatR

Data Science Weekly - Data Science Weekly - Issue 444

Issue #444 May 26 2022

Editor Picks

A Message from this week's Sponsor:

Data Science Articles & Videos

Tools*

Jobs

Training & Resources

What you’re up to – notes from DSW readers

Last Week's Newsletter's 3 Most Clicked Links

Older messages

You Might Also Like

Issue #444

May 26 2022