Editor Picks
- Machine Learning and Tax Enforcement
Each year, the Internal Revenue Services receives over 3 billion information returns, such as W-2s and 1099-INTs, from employers, banks, and other entities...In 2021, the Biden administration proposed that a portion of its request for a 55 percent boost (after adjusting for inflation) to the IRS budget over the next decade be used for developing machine learning. If successful, machine learning would marshal the vast trove of data currently received by the IRS to achieve more targeted and productive enforcement actions...
- The Annotated Diffusion Model
In this blog post, we'll take a deeper look into Denoising Diffusion Probabilistic Models (also known as DDPMs, diffusion models, score-based generative models or simply autoencoders) as researchers have been able to achieve remarkable results with them for (un)conditional image/audio/video generation. Popular examples (at the time of writing) include GLIDE and DALL-E 2 by OpenAI, Latent Diffusion by the University of Heidelberg and ImageGen by Google Brain...
- How fast can we perform a forward pass?
Over the last month, I’ve spent a lot of time trying to answer the following question: How quickly can we perform one forward pass in a transformer model?...By a transformer model, I mean BERT, GPT-3, T5, Chinchilla, or other large language models that use a transformer architecture. By a forward pass, I mean the computation needed to generate the next token given all the tokens so far.[1] By “how quickly”, I mean how much wall clock time elapses between the call to the forward pass and its completion...
A Message from this week's Sponsor:
Online Data Science Programs from Drexel University
Find your algorithm for success with an online data science degree from Drexel University. Gain essential skills in tool creation and development, data and text mining, trend identification, and data manipulation and summarization by using leading industry technology to apply to your career. Learn more.
Data Science Articles & Videos
- Condemning the deployment of GPT-4chan
The deployment of GPT-4chan is a clear example of irresponsible practice. GPT-4chan is a language model that Kilcher trained on over three million 4chan threads from the Politically Incorrect /pol/ board, a community full of racist, sexist, xenophobic, and hateful speech that has been linked to white-supremacist violence such as the Buffalo shooting last month. He then used GPT-4chan to generate and deceptively post over 30,000 posts on 4chan mimicking the hateful comments it was trained on without identifying the model as a bot. Kilcher now claims that the release of “the most horrible model on the internet” was “a prank and light-hearted trolling.”...Kilcher’s decision to deploy this bot does not meet any test of reasonableness. His actions deserve censure. He undermines the responsible practice of AI science. If you agree with this statement, please fill out this form to sign it...
- Mapping Urban Trees Across North America with the Auto Arborist Dataset
Today we introduce the Auto Arborist Dataset, a multiview urban tree classification dataset that, at ~2.6 million trees and >320 genera, is two orders of magnitude larger than those in prior work. To build the dataset, we pulled from public tree censuses from 23 North American cities (shown above) and merged these records with Street View and overhead RGB imagery. As the first urban forest dataset to cover multiple cities, we analyze in detail how forest models can generalize with respect to geographic distribution shifts, crucial to building systems that scale. We are releasing all 2.6M tree records publicly, along with aerial and ground-level imagery for 1M trees...
- Learning to Infer Structures of Network Games
Strategic interactions between a group of individuals or organisations can be modelled as games played on networks, where a player's payoff depends not only on their actions but also on those of their neighbours. Inferring the network structure from observed game outcomes (equilibrium actions) is an important problem with numerous potential applications in economics and social sciences...
- How do you ace your SQL skills? [Reddit Discussion]
I am asking about mastering them. Like queries with varying levels of complexity. Some of the Technical Analysts I've worked with have written most mind-blowing Scripts with ease. I encounter the databases daily and want to acquire that levels of proficiency. I am familiar with SQL but I want to take it to the next level. Would you guys suggest me the best places to start exploring and also the strategies that worked for you to enhance your SQL skillsets...
- Lyapunov Density Models: Constraining Distribution Shift in Learning-Based Control
When deploying learning-based controllers, we seek a mechanism to constrain the agent to states and actions that resemble those in the training data..However, in order for an agent to remain in-distribution throughout it's trajectory, the agent must not only avoid visiting states and actions that are out-of-distribution...We present Lyapunov density models (LDMs): a generalization of control Lyapunov functions and density models that provides guarantees on an agent's ability to stay in-distribution over its entire trajectory...
- Diagram as Code
Diagrams lets you draw the cloud system architecture in Python code. It was born for prototyping a new system architecture design without any design tools. You can also describe or visualize the existing system architecture as well. Diagrams currently supports main major providers including: AWS, Azure, GCP, Kubernetes, Alibaba Cloud, Oracle Cloud etc... It also supports On-Premise nodes, SaaS and major Programming frameworks and languages...
- Parti - Pathways Autoregressive Text-to-Image Model
We introduce the Pathways Autoregressive Text-to-Image model (Parti), an autoregressive text-to-image generation model that achieves high-fidelity photorealistic image generation and supports content-rich synthesis involving complex compositions and world knowledge...
- The State of Data Engineering 2022
A year has passed since we shared the State of Data Engineering 2021...It was another year worthy of its own prime-time drama, and we’re back to share our updated, digestible snapshot of it all!...
- Outsider Oversight: Designing a Third Party Audit Ecosystem for AI Governance
Much attention has focused on algorithmic audits and impact assessments to hold developers and users of algorithmic systems accountable. But existing algorithmic accountability policy approaches have neglected the lessons from non-algorithmic domains: notably, the importance of interventions that allow for the effective participation of third parties. Our paper synthesizes lessons from other fields on how to craft effective systems of external oversight for algorithmic deployments...
Course*
Land Your Dream Job with TDI
One Week Left for Priority Enrollment to Our Data Bootcamps!
Apply by July 1 to earn our coveted priority enrollment package and you’ll get:
- Up to $2k of tuition
- Early access to our 12-day python bootcamp
- Premier access to our resume review services
- The early chance to join our discord to chat with peers before the course even starts.
- Did we mention you can also increase your chances of getting a full-tuition scholarship?
What are you waiting for? Early application closes on July 1 so don’t wait!
Apply Now.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
- Senior Data Scientist, Startup Creation at Redesign Health - US
As our Senior Data Scientist for our Startup Creation team, you will set up and configure the data infrastructure for our startups, and work with the startup founding team to define data driven KPIs, and implement automated statistical analyses of customer behavior. Your goal is to make all of the companies that we launch data-driven from day one.
In this role, you will function as an in-house implementation team for the companies that Redesign Health launches (internally referred to as OpCos). We provide data strategy, data pipeline, data analytics and forecasting services to newly formed companies in a repeatable and scalable manner...
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
- OpenFold - Trainable, memory-efficient, and GPU-friendly PyTorch reproduction of AlphaFold 2
OpenFold carefully reproduces (almost) all of the features of the original open source inference code (v2.0.1). The sole exception is model ensembling, which fared poorly in DeepMind's own ablation testing and is being phased out in future DeepMind experiments. It is omitted here for the sake of reducing clutter. In cases where the Nature paper differs from the source, we always defer to the latter...
What you’re up to – notes from DSW readers
- Working on something cool? Let us know here :) ...
* To share your projects and updates, share the details here.
** Want to chat with one of the above people? Hit reply and let us know :)
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's newsletter here.
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian |