Hello and thank you for tuning in to Issue #504.
Once a week we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
Seeing this for the first time? Subscribe here:
Want to support us? Become a paid subscriber here.
If you don’t find this email useful, please unsubscribe here.
And now, let's dive into some interesting links from this week:
:)
The Dark Forest of R&D and Capital Deployment in AI
While on the surface we mostly just see large rounds and revenue or usage leaks, in reality AI companies are perhaps some of the most complex businesses we’ve had being built in tech in some time. Doing core AI model R&D necessitates a need to play 4D Chess around research communities, capital accumulation and deployment, talent acquisition, competitive understanding, and commercialization….But to understand whether AI companies are embarking on a rollercoaster of capital deployment that is a Super Cycle, or a Euthanasia Coaster style journey, we must also understand the impending shifts that will drive all strategy over the next decade…
Debriefs: Teams learning from doing in context
Debriefs are a type of work meeting in which teams discuss, interpret, and learn from recent events during which they collaborated. In a variety of forms, debriefs are found across a wide range of organizational types and settings. Well-conducted debriefs can improve team effectiveness by 25% across a variety of organizations and settings…After a discussion of various purposes for which debriefs have been used, we proceed with a historical review of development of the concepts and use in industries and contexts. We then review the psychological factors relevant to debrief effectiveness and the outcomes for individuals, teams, and organizations that deploy debriefs…
Flavor network and the principles of food pairing
The cultural diversity of culinary practice, as illustrated by the variety of regional cuisines, raises the question of whether there are any general patterns that determine the ingredient combinations used in food today or principles that transcend individual tastes and recipes. We introduce a flavor network that captures the flavor compounds shared by culinary ingredients…
Revolutionize your data labeling process with Label Studio - the ultimate open source data labeling platform. Unleash unparalleled flexibility for all data types and supercharge your labeling with ML-assisted techniques. Whether you're training Large Language Models (LLMs), validating AI models, or fine-tuning existing ones, Label Studio is your must-have tool.
Experience the future of data labeling with seamless cloud storage connectivity and a customizable interface that perfectly fits your needs. Join a vibrant community of over 250,000 data scientists and machine learning professionals who have improved their labeling efficiency with Label Studio.
Download Label Studio for free from LabelStud.io!
Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
How I Created an Animation Of the Embeddings During Fine-Tuning
In a previous article, I used an animation to demonstrate changes in the embeddings during the fine-tuning process. This was achieved by performing Principal Component Analysis (PCA) on the embeddings. These embeddings were generated from models at various stages of fine-tuning and their corresponding checkpoints…In this article, I aim to provide a comprehensive guide on how to create such an animation, detailing the steps involved: fine-tuning, creation of embeddings, outlier detection, PCA, Procrustes, review, and creation of the animation….The complete code for the animation is also available in the accompanying notebook on GitHub….
Plant datasets - 3 new ones in the last week
I've been working on my own little arxiv scraper to find me some interesting papers on new datasets. And once in a while there's a cluster with a common theme. This week, it seems to be plants!..1) A New Dataset and Comparative Study for Aphid Cluster Detection…2) TomatoDIFF: On–plant Tomato Segmentation with Denoising Diffusion Models… and 3) TreeFormer: Transformer-based Tree Counting…
Jobs In Data Salary Calculator
The Salary Calculator is based on the compensation prediction model which was built by a Kaggle Competitions Grandmaster narsil for Kaggle 2022 Data Science and Machine Learning Survey - Analytics competition. It received an Honourable Mention from competition hosts and was in the Top 9 among more than 300 submitted solutions…
The shape of AGI: Cartoons and back of envelope
In this blog post, I will not focus on the timelines or risks but rather on the shape we could expect AI systems to take and their economic impact. I will use no fancy models or math and keep everything on the level of cartoons or back-of-envelope calculations. My main contention is that even post-AGI, AI systems will be incomparable to humans and stay this way for an extended period of time, which may well be longer than the time to achieve AGI in the first place…
The Rise of the AI Engineer
Emergent capabilities are creating an emerging title: to wield them, we'll have to go beyond the Prompt Engineer and write *software*….
Rethinking Backdoor Attacks
In our latest paper, we provide a new perspective on data poisoning (backdoor) attacks. We show that without assumptions on the attack, backdoor triggers are indistinguishable from features already present in the dataset. In our work, we assume that backdoors correspond to the strongest feature present in the data, and we leverage datamodels to detect backdoored inputs…
Perspectives on Diffusion Models
Perspectives on diffusion, or how diffusion models are autoencoders, deep latent variable models, score function predictors, reverse SDE solvers, flow-based models, RNNs, and autoregressive models, all at once!…
Synthetic training data with Nathan Kundtz
In this video Robin catches up with Nathan Kundtz to learn about the creation, and use of synthetic image data in training machine machine models. Nathan has a PhD in physics, and over 40 peer reviewed papers and 15 patents to his name. As a serial entrepreneur, he has successfully founded multiple companies and raised over $250 million in venture capital funding…
How is ChatGPT's behavior changing over time?
GPT-3.5 and GPT-4 are the two most widely used large language model (LLM) services. However, when and how these models are updated over time is opaque. Here, we evaluate the March 2023 and June 2023 versions of GPT-3.5 and GPT-4 on four diverse tasks: 1) solving math problems, 2) answering sensitive/dangerous questions, 3) generating code and 4) visual reasoning. We find that the performance and behavior of both GPT-3.5 and GPT-4 can vary greatly over time…
LLMs as Workers in Human-Computational Algorithms? Replicating Crowdsourcing Pipelines with LLMs
LLMs have shown promise in replicating human-like behavior in crowdsourcing tasks that were previously thought to be exclusive to human abilities. However, current efforts focus mainly on simple atomic tasks. We explore whether LLMs can replicate more complex crowdsourcing pipelines. We find that modern LLMs can simulate some of crowdworkers' abilities in these "human computation algorithms," but the level of success is variable and influenced by requesters' understanding of LLM capabilities, the specific skills required for sub-tasks, and the optimal interaction modality for performing these sub-tasks…
* Define and deploy new approaches, exploiting the wealth of data and the power of associated technologies, to respond to the problems of teams in areas: Americas, China, Japan, Europe, South Asia and North Asia
* Collaborate on a daily basis with the teams on their needs and build ready-to-use algorithms in order to feed their business challenges and customer experience in particular
* Ensure all stages of data science projects: framing and management, implementation, development and operation, adoption and commitment
Support the business teams in the use of the tools put in place to serve their challenges and enable them to act more and more independently
* Ensure the governance of Data Science projects according to the defined principles
* Monitor, maintain and improve the models and tools in place
Apply here
Want to post a job here? Email us for details --> team@datascienceweekly.org
Tutorial on amortized optimization
This tutorial presents an introduction to the amortized optimization foundations behind these advancements and overviews their applications in variational inference, sparse coding, gradient-based meta-learning, control, reinforcement learning, convex optimization, optimal transport, and deep equilibrium networks. The source code for this tutorial is available…
"Learning Theory from First Principles" by Francis Bach
PDF of excellent book draft on machine learning (slated for MIT Press)…The goal of the class (and thus of this textbook) is to present old and recent results in learning theory for the most widely-used learning architectures. This class is geared towards theory-oriented students as well as students who want to acquire a basic mathematical understanding of algorithms used throughout machine learning and associated fields that are significant users of learning methods such as computer vision or natural language processing. Moreover, it is well suited to students and researchers coming from other areas of applied mathematics and that want to learn about the theory behind machine learning…
Awesome Conformal Prediction
A professionally curated list of awesome Conformal Prediction videos, tutorials, books, papers, Ph.D. and M.Sc theses, articles and open-source libraries…
* Based on unique clicks.
** Find last week's issue #503 here.
Thanks for joining us this week :)
All our best,
Hannah & Sebastian
P.S.,
If you found this newsletter helpful, consider supporting us by becoming a paid subscriber here: https://datascienceweekly.substack.com/subscribe :)
Copyright © 2013-2023 DataScienceWeekly.org, All rights reserved.