Editor Picks
- Stanford MLSys Seminar Episode 5: Chip Huyen [Video]
This talk covers what it means to operationalize ML models. It starts by analyzing the difference between ML in research vs. in production, ML systems vs. traditional software, as well as myths about ML production...It then goes over the principles of good ML systems design and introduces an iterative framework for ML systems design, from scoping the project, data management, model development, deployment, maintenance, to business analysis...The talk ends with a survey of the ML production ecosystem, the economics of open source, and open-core businesses....
- NonCompositional or Why composition is DALL-E’s strength, not its weakness
When we compose meanings, concepts, semantics or any other ‘elements’ of cognition, the outcome is not easily predictable like it is when we compose functions in mathematics or operations in a computer programme....it makes no sense to criticise DALL-E (or neural networks in general) for their poor composition. It is precisely because their composition is surprisingly good that emotions have been stirred and people are enjoying tweeting and sharing these things so much! Yeah, all good fun, but we can’t learn anything scientific or conceptual from this brute-force approach Well, I’m not so sure. Let’s consider a bit more history….
A Message from this week's Sponsor:
Retool is the fast way to build an interface for any database
With Retool, you don't need to be a developer to quickly build an app or dashboard on top of any data set. Data teams at companies like NBC use Retool to build any interface on top of their data—whether it's a simple read-write visualization or a full-fledged ML workflow.
Drag and drop UI components—like tables and charts—to create apps. At every step, you can jump into the code to define the SQL queries and JavaScript that power how your app acts and connects to data. The result—less time on repetitive work and more time to discover insights.
Data Science Articles & Videos
- Imagen - unprecedented photorealism × deep level of language understanding
We present Imagen, a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding. Imagen builds on the power of large transformer language models in understanding text and hinges on the strength of diffusion models in high-fidelity image generation. Our key discovery is that generic large language models (e.g. T5), pretrained on text-only corpora, are surprisingly effective at encoding text for image synthesis: increasing the size of the language model in Imagen boosts both sample fidelity and image-text alignment much more than increasing the size of the image diffusion model. ...
- A survey on adversarial attacks and defences
In recent times, different types of adversaries based on their threat model leverage vulnerabilities to compromise a deep learning system where ad-versaries have high incentives...However, there are only a few strong countermeasures which can be used in all types of attack scenarios to design a robust deep learning system. Herein, the authors attempt to provide a detailed discussion on different types of adversarial attacks with various threat models and also elaborate on the efficiency and challenges of recent countermeasures against them......
- Let's Continue Bundling into the Database
A very silly blog post came out a couple months ago about The Unbundling of Airflow. I didn’t fully read the article, but I saw its title and skimmed it enough to think that it might’ve been too thin of an argument to hold water...I actually don’t care that much about the bundling argument that I will make in this post. Truthfully, I just want to argue that feature stores, metrics layers, and machine learning monitoring tools are all abstraction layers on the same underlying concepts, and 90% of companies should just implement these “applications” in SQL on top of streaming databases...
- Bridging the Resource Divide for Artificial Intelligence Research
White House report from Lynne Parker, Deputy United States Chief Technology Officer and Director of the National Artificial Intelligence Initiative Office...Today, as co-chair of the Task Force and as part of OSTP’s broader work to advance the responsible research, development, and use of AI, I am proud to announce the submission of the interim report of the NAIRR Task Force to the President and Congress. This report lays out a vision for how this national cyberinfrastructure could be structured, designed, operated, and governed to meet the needs of America’s research community...
- Introducing PeerXiv - A modern platform for peer-review of preprints
What would a peer review process look like if it was designed today? Peer review is one of the cornerstones of the research community, and yet while our community keeps advancing and growing, the reviewing process remains almost unchanged...We strongly believe that peer review can be so much better for both authors and reviewers and we are excited to share PeerXiv, our proposal to do just that...
- Artificial intelligence is breaking patent law
The patent system assumes that inventors are human. Inventions devised by machines require their own intellectual property law and an international treaty...In 2020, a machine-learning algorithm helped researchers to develop a potent antibiotic that works against many pathogens. Artificial intelligence (AI) is also being used to aid vaccine development, drug design, materials discovery, space technology and ship design. Within a few years, numerous inventions could involve AI. This is creating one of the biggest threats patent systems have faced...
- On the Impact of Data Augmentation on Downstream Performance in Natural Language Processing
Data augmentation is a common strategy to improve generalization and robustness of machine learning models. While data augmentation has been widely used within computer vision, its use in the NLP has been been comparably rather limited. The reason for this is that within NLP, the impact of proposed data augmentation methods on performance has not been evaluated in a unified manner, and effective data augmentation methods are unclear. In this paper, we look to tackle this by evaluating the impact of 12 data augmentation methods on multiple datasets when finetuning pre-trained language models...
- AI reveals unsuspected math underlying search for exoplanets
University of California, Berkeley, astronomers found unsuspected connections hidden in the complex mathematics arising from general relativity—in particular, how that theory is applied to finding new planets around other stars...In a paper appearing this week in the journal Nature Astronomy, the researchers describe how an AI algorithm developed to more quickly detect exoplanets when such planetary systems pass in front of a background star and briefly brighten it—a process called gravitational microlensing—revealed that the decades-old theories now used to explain these observations are woefully incomplete....
Tools*
Check out the new Anaconda Community for all-things data!
Want insights into the newest developments in the world of data, or need help getting “unstuck” on a problem?
Our Community Forums is the place to go! Be the first to engage with other professionals and ask questions to the broader data community. Users can join in conversations around trends, debate new features, post questions to the community, and more. Plus, it’s another avenue for technical help!
Create your free Anaconda Community account now.
*Sponsored post. If you want to be featured here, or as our main sponsor, contact us!
Jobs
- Data Scientist - Hungryroot - Remote
Hungryroot is looking for a Data Scientist to join our growing Data Team. As a Data Scientist, you will work closely with other Data Scientists and Data Engineers to develop various Machine Learning models that power Hungryroot and it’s AI functions. These models include traditional forecasting models, as well as more industry-specific optimization challenges.
As a Data Scientist at Hungryroot, you will work on answering questions like: how do you tell what food someone would like to eat this week, how do you determine whether they enjoyed it or not, maybe they liked their means last week, but are now looking for different options, maybe they like the same food on Tuesdays, but variety on Fridays, what about spicy food, is Green Chilly as spicy as Green Curry?
Want to post a job here? Email us for details --> team@datascienceweekly.org
Training & Resources
- MIT Spring 2022 Machine Learning for Healthcare Class (6.871/HST.956)
Introduces students to machine learning in healthcare, including the nature of clinical data and the use of machine learning for risk stratification, disease progression modeling, precision medicine, diagnosis, subtype discovery, and improving clinical workflows. Topics include causality, interpretability, algorithmic fairness, time-series analysis, graphical models, deep learning and transfer learning. Guest lectures by clinicians from the Boston area and course projects with real clinical data emphasize subtleties of working with clinical data and translating machine learning into clinical practice....
- What Is Active Metadata, and Why Does It Matter?
Just like data mesh or the metrics layer, active metadata is the latest hot topic in the data world. As with every other new concept that gains popularity in the data stack, there’s been a sudden explosion of vendors rebranding to “active metadata”, ads following you everywhere and...confusion...With everyone talking about active metadata, it must be pretty easy to understand, right?...I’ve broken down the ideas behind active metadata with as little jargon as possible. Keep reading to learn what active metadata is, what it looks like, how you can actually use it, how it fits into the modern data stack, and why it even matters...
What you’re up to – notes from DSW readers
- Alex is working on building a predictive model for customer segmentation...
- Daniel Czwalinna is working on measuring the effect of false labels on image classification model performance....
- Andrew Van Dyke is working on an algorithmic trading system. Algorithms are randomly generated from compositions of basic math functions on input data. These algorithms are then refined via a Genetic Algorithm....
- Frank is working on Master's thesis in logistics and supply chain management...
- Frank Corrigan is working on building a NLP-enabled async, voice-first communication platform....
* To share your projects and updates, share the details here.
** Want to chat with one of the above people? Hit reply and let us know :)
Last Week's Newsletter's 3 Most Clicked Links
* Based on unique clicks.
** Find last week's newsletter here.
P.S., Enjoy the newsletter? Please forward it to your friends and colleagues - we'd love to have them onboard :) All the best, Hannah & Sebastian |