|
Hello! Once a week, we write this email to share the links we thought were worth sharing in the Data Science, ML, AI, Data Visualization, and ML/Data Engineering worlds.
And now…let's dive into some interesting links from this week.
📽️ New 4 hour (lol) video lecture: "Let’s reproduce GPT-2 (124M)" Andrej Karpathy start with empty file and end up with a GPT-2 (124M) model: - first we build the GPT-2 network - then we optimize it to train very fast - then we set up the training run optimization and hyperparameters by referencing GPT-2 and GPT-3 papers - then we bring up model evaluation, and - then cross our fingers and go to sleep…
Home-Cooked Software and Barefoot Developers The emerging golden age of home-cooked software, barefoot developers, and why the local-first community should help build it…
Don’t miss the AI conference of the year! Join 5000+ attendees, 350+ speakers and 150+ AI exhibitors at Ai4, North America's largest AI industry conference — taking place Las Vegas on August 12-14. Enjoy dedicated content & unbeatable networking for both business & technical leaders from every major industry and job function.
* Want to sponsor the newsletter? Email us for details --> team@datascienceweekly.org
🧵 AI-powered Jupyter Notebook built using React 🧵 Thread is a Jupyter Notebook that combines the experience of OpenAI's code interpreter with the familiar development environment of a Python notebook. With Thread, you can use natural language to generate cells, edit code, ask questions or fix errors all while being able to edit or re-run code as you would in a regular Jupyter Notebook. Best of all, Thread runs locally, and can be used for free with your own API key… Jason Wei & Hyung Won Chung of OpenAI Two part talk: Intuitions on Language Models (Jason) Jason will talk about some basic intuitions on language models, inspired by manual examination of data…Shaping the Future of AI from the History of Transformer (Hyung Won)…I will provide a highly-opinionated view on the early history of Transformer architectures, focusing on what motivated each development and how each became less relevant with more compute…
What mishap have you done because you were good in ML but not the best in statistics? [Reddit Discussion] I feel like there are many people who are good in ML but not necessarily good in statistics. I am curious about the possible trade offs not having a good statistics foundation…
Can LLMs invent better ways to train LLMs? Earlier this year, Sakana AI started leveraging evolutionary algorithms to develop better ways to train foundation models like LLMs. In a recent paper, we have also used LLMs to act as better evolutionary algorithms! Given these surprising results, we began to ask ourselves: Can we also use LLMs to come up with a much better algorithm to train LLMs themselves? We playfully term this self-referential improvement process LLM² (‘LLM-squared’) as an homage to previous fundamental work in meta-learning. As a significant step towards this goal, we’re excited to release our report, Discovering Preference Optimization Algorithms with and for Large Language Models…
jax-diffusion-transformer Implementation of Diffusion Transformer (DiT) in JAX…
PostgreSQL and Pgvector: Now Faster Than Pinecone, 75% Cheaper, and 100% Open Source Introducing pgvectorscale, a new open-source extension that makes PostgreSQL an even better database for AI applications. Pgvectorscale builds upon pgvector to unlock large-scale, high-performance AI use cases previously only achievable with specialized vector databases like Pinecone…
The Unreasonable Effectiveness of Human Feedback This post presents quantitative results showing how human feedback allows Foyle to assist with building and operating Foyle. In 79% of cases, Foyle provided the correct answer, whereas ChatGPT alone would lack sufficient context to achieve the intent. Furthermore, the LLM API calls cost less than $.002 per intent whereas a recursive, agentic approach could easily cost $2-$10…
Can children (4-8 years old) strategically decide what learning activity to practice when they are free to choose? [PDF link downloads] Yes…[interesting research paper]
How Bad Is the Data Environment where you work? [Reddit Discussion] I just want to know if data and it's processes is as shocking as it is where I work. I have bridging tables that don't bridge. I have tables with no keys. I have tables with incomprehensible soup of abbreviations as names. I have columns with the same business name in different databases that have different values and both are incorrect. So many corners have been cut that this is environment is a circle. Is it this bad everywhere or is it better where you work? Edit: Please share horror stories, the ones I see so far are hilarious and are making me feel better😅…
An Empirical Study on the Energy Usage and Performance of Pandas and Polars Data Analysis Python Libraries [PDF] We aim to assess the energy usage of Pandas, a widely-used Python data manipulation library, and Polars, a Rust-based library known for its performance. The study aims to provide insights for data scientists by identifying scenarios where one library outperforms the other in terms of energy usage, while exploring the possible correlations between energy and performance metrics…
Incorporating time-varying seasonality in forecast models Seasonality is very common in real-world time series. Many series vary in periodic, regular ways. For example, ice cream sales tend to be higher in warmer holiday months, while counts of migratory birds fluctuate strongly around the annual migration cycle. Because of how pervasive seasonality is, many time series and forecasting methods have been developed specifically to deal with this feature…The purpose of this brief post is to highlight one strategy for capturing seasonality, and time-varying seasonal patterns, in Dynamic Generalized Additive Models…
Enhancing Code Completion for Rust in Cody Although most LLMs are trained on corpora that include several programming languages, we often observe differential performance across languages, especially languages like Rust that are not well represented in popular training datasets. In this post, we share early results from our efforts to improve the performance of LLMs for code completion in such languages…
The Prompt Report: A Systematic Survey of Prompting Techniques While prompting is a widespread and highly researched concept, there exists conflicting terminology and a poor ontological understanding of what constitutes a prompt due to the area's nascency. This paper establishes a structured understanding of prompts, by assembling a taxonomy of prompting techniques and analyzing their use. We present a comprehensive vocabulary of 33 vocabulary terms, a taxonomy of 58 text-only prompting techniques, and 40 techniques for other modalities. We further present a meta-analysis of the entire literature on natural language prefix-prompting…
A User’s Guide to Statistical Inference and Regression Quantitative research involves a host of choices about the model to use, variables to include, tuning parameters to set, assumptions to make, and so on. Without a deep understanding of statistics, you may find these choices bewildering and confusing, and you may simply (and possibly erroneously) yield to the default settings of your statistical software. The goal of this book is to give you the foundation to make methodological choices for your specific application with knowledge and with confidence….
Language models on the command-line Handout for a talk I gave about LLM and CLI tools…Notes for a talk I gave at Mastering LLMs: A Conference For Developers & Data Scientists…
* Based on unique clicks. ** Find last week's issue #550 here.
Looking to get a job? Check out our “Get A Data Science Job” Course It is a comprehensive course that teaches you everything related to getting a data science job based on answers to thousands of emails from readers like you. The course has 3 sections: Section 1 covers how to get started, Section 2 covers how to assemble a portfolio to showcase your experience (even if you don’t have any), and Section 3 covers how to write your resume. Promote yourself/organization to ~62,000 subscribers by sponsoring this newsletter. 35-45% weekly open rate.
Thank you for joining us this week! :) Stay Data Science-y! All our best, Hannah & Sebastian
| |