EuroPython Society: Board Report for February 2025

March 17, 2025, 3:30 pm

≫ Next: Lucas Cimon: A review of HTML linters

≪ Previous: Python Bytes: #424 We Will Test in Production

In February, our top priority was event logistics and organizational planning. We worked closely with our event manager, Anežka, on important aspects such as the venue, catering, and other logistics. We&aposre happy to announce that the contract with the venue has been signed!

Another priority was budget planning. Our funding comes from ticket sales and sponsors. We reviewed fixed costs and discussed our strategy for this year. We want to keep the event as affordable as possible to allow more people to attend while also attracting sponsors. At the same time, we need to make sure that the event breaks even and remains financially sustainable in the long term. We also worked on defining sponsorship packages.

The third priority was onboarding the remaining co-leads and teams. Some board members are still involved in specific teams to support new co-leads and other newcomers. We&aposre making sure that everyone has the support and tools they need to contribute.

Individual reports:

Artur

Budget: Discussing different scenarios and overall plan for the budget.
Sponsorship setup: Packages, team meetings and internal sponsorship flow and infrastructure
Updates to the internal discord bot and the community voting app
Community: Attending FOSDEM & Python Pizza Brno
Finaid: onboarding and working out the plan for setup updates for 2025.
Event logistics: Working with Anežka and the rest of the team on various items regarding different providers, contracts and payments.

Mia

Comms & Design: worked on the design brief and budget proposal. Drafted and scheduled some community voting and reviews posts. Reviewed others. Comms & Design team calls. Website design coordination & calls.
Budget: worked on the proposal & spreadsheets.
Sponsorship: helped define sponsorship packages and pricing. Reviewed and helped prepare content for the web and other materials. Coordinated communication between multiple people.
Infrastructure: code reviews.
Community: attended Brno Python Pizza.

Aris

Billing: Setup payments for grants and vendors
Billing: Looked into the current billing workflow and how it can be optimized.
Budget: Onboarding myself to the spreadsheet, looked and discussed the different scenarios
Ops: Onboarding team members, capacity planning and kickoff meeting
Community: Attending FOSDEM & Python Pizza Brno

Ege

Transfer 2025 Discord server ownership to EPS account
22-24 website migration
Website PRs
Program API setup with the new deployment logic

Shekhar

PR for Visa Application process for EuroPython 2025 Conference.
Overviewing Budget sheet for EuroPython 2025 Conference
Overviewing Grant programme and the existing proposals from various conferences.
Finaid team coordination and helped launch the FINAID programme.

Cyril

Anders

↧

Lucas Cimon: A review of HTML linters

March 17, 2025, 4:49 pm

≫ Next: EuroPython: Keynote & Ticket Sales Announcement!

≪ Previous: EuroPython Society: Board Report for February 2025

... and how to use them in CI pipelines.

Comparing W3C v.Nu HTML checke, html-tidy, htmlhint, html-validate, LintHTML and html-eslint.

— Permalink

↧

EuroPython: Keynote & Ticket Sales Announcement!

March 18, 2025, 12:00 am

≫ Next: PyCon: PyCon US 2025 - Travel Grants Transparency Blog Post

≪ Previous: Lucas Cimon: A review of HTML linters

Hello, Pythonistas! 🐍 Welcome back to our cosy corner of community & code!

It&aposs absolutely fantastic to have you here again, especially as we countdown the days until our reunion in Prague! 🎉 Quite a few things have happened since our last catch-up so let’s dive right into it!

📣 Programme

We&aposre super excited to announce Savannah Ostrowski as a keynote speaker for EuroPython 2025! 🐍✨

Savannah Ostrowski is a Python Core Developer and a product lead for Python Developer Experience and Notebooks at Snowflake. She helps maintain the argparse module in the Python standard library and the new JIT compiler introduced in Python 3.13. A self-taught developer with a background in geospatial computing, she has built a career at the intersection of developer tools and open-source software.

Before Snowflake, Savannah led a product for Docker’s runtime, working on foundational technology for the container ecosystem, including Docker Engine (moby/moby) and Docker CLI. She previously worked at Microsoft on the Azure Developer CLI and was the product manager for the Pylance language server.

If you&aposve seen any of her previous talks, you know she&aposs a fantastic speaker with practical insights to share and if you haven&apost checked out her presentations yet, you&aposre in for a treat!

Join us at EuroPython 2025 to hear Savannah&aposs unique insights and vision for Python&aposs future. It&aposs a session you won&apost want to miss!

🎟️ Ticket Sales

Get ready, Python enthusiasts! EuroPython 2025 tickets will start this week! Tutorial spaces are limited, so if you plan to purchase a combined or tutorials-only ticket, we recommend registering soon.

Prices stay the same as last year, and for more detailed announcement follow us on social media for all the latest updates:

LinkedIn: https://www.linkedin.com/company/europython/
X: https://x.com/europython
Mastodon: https://fosstodon.org/@europython
BlueSky: https://bsky.app/profile/europython.eu

💰 Sponsorship

If you&aposre passionate about supporting EuroPython and helping us make the event accessible to all, consider becoming a sponsor or asking your employer to join us in this effort.

By sponsoring EuroPython, you’re not just backing an event – you&aposre gaining highly targeted visibility and the chance to present your company or personal brand to one of the largest and most diverse Python communities in Europe and beyond!

We offer a range of sponsorship tiers, some with limited slots available. Along with our main packages, there are optional add-ons and optional extras.

🐦 We have an Early Bird 10% discount for companies that sign up by March 28th.🐦

👉 More information at: https://ep2025.europython.eu/sponsorship/sponsor/

👉 Contact us at sponsors@europython.eu

💶 Financial Aid

We are also pleased to announce our financial aid, sponsored by the EuroPython Society. The goal is to make the conference open to everyone, including those in need of financial assistance.

You can apply for three different types of grants:

Free Ticket Voucher Grant: Get a voucher for a free standard in-person Conference Ticket, covering the main conference days and sprints (Wed – Sun).
Travel/Accommodation Grant: We will reimburse travel and/or accommodation costs up to €400.
Visa Application Fee Grant: Get reimbursed for the costs of a short-stay Schengen visa to the Czech Republic (up to €80).

Submissions for the first round of our financial aid programme are open until April 4th 2024. More information on https://ep2025.europython.eu/finaid/

🎥 YouTube

Our YouTube channel has hit 4.2M+ views! Ever wondered which talks are the most popular?

Here are the top 10 most-watched talks on our channel:

Thomas Perl – Developing Android Apps Completely in Python (Part 2)
Dan Taylor – Get Productive with Python in Visual Studio Code
Daniel Pope – Programming Physics Games with Python and OpenGL
Sebastian Witowski – Writing Faster Python
Raymond Hettinger – What Makes Python So AWESOME
Mark Shannon – How We’re Making Python 3.11 Faster
Omar Mendez – Let’s Play with Python and OpenCV
Michał Karzyński – Building Beautiful RESTful APIs with Flask
Nicolas Tollervey – Music Theory, Genetic Algorithms, and Python
Radoslav Georgiev – Django Structure for Scale and Longevity

A huge thank you to our amazing speakers who share their knowledge with the community!

📺 Watch them here: https://www.youtube.com/channel/UC98CzaYuFNAA_gOINFB0e4Q

EuroPython Conference

The official YouTube Channel of the EuroPython conferences. Current edition: EuroPython 2025. EuroPython is the official European conference for the Python programming language. Copyright © 2004-2025, EuroPython Speakers, EuroPython Society, Sweden and the Local Organisers of the EuroPython confer…

YouTube

📊 EuroPython Society Board Report

The EuroPython conference wouldn’t be what it is without the incredible volunteers who make it all happen. 💞 Behind the scenes, there’s also the EuroPython Society—a volunteer-led non-profit that manages the fiscal and legal aspects of running the conference, oversees its organization, and works on a few smaller projects like the grants programme. To keep everyone in the loop and promote transparency, the Board is sharing regular updates on what we’re working on.

The February board report is ready: https://www.europython-society.org/board-report-for-february/.

📣 Community Outreach

Beyond merely attending Python gatherings, EuroPython takes an active role in fostering their success. We proudly serve as community sponsors, providing support to regional Python meetups. We are committed to nurturing the broader ecosystem and bolstering Python conferences throughout Europe. 🐍💙

Brno Python Pizza

EuroPython proudly sponsored Brno Python Pizza and had some of our members join the event in person. Special thanks to Jake Balaš for delivering an excellent lightning talk about EuroPython! Thank you to the organizers for putting together such an excellent event, and to all the attendees who made it so special.

💞Upcoming Events in the Python Community

PyCon Austria, Eisenstadt, April 6-7 https://pycon.pyug.at/en/
PyCon Lithuania, Vilnius, April 23-25 https://pycon.lt/
DjangoCon Europe, Dublin, April 23-27 https://2025.djangocon.eu/
PyCon DE & PyData, Darmstadt, April 23-25 https://2025.pycon.de/
PyCamp España 2025, Sevilla, 01 - 04 May https://pycamp.es/
Pycon Italia, Bologna, May 28-31 https://2025.pycon.it/en
EuroPython Prague, 14 July-20 July https://ep2025.europython.eu
EuroSciPy Kraków, August 18-22 https://euroscipy.org/2025/
PyCon Greece 2025 Athens, Greece, 29-30 August https://2025.pycon.gr/en/
PyCamp CZ 25 beta, Třeštice, September 12-14 https://pycamp.cz/
Pycon UK, Manchester, September 19-22 https://2025.pyconuk.org/
PyCon Estonia, Tallinn, October 2-3 https://pycon.ee/
PyCon Sweden, Stockholm, October 30-31 https://pycon.se/

🐣 See You All Next Month

Before saying goodbye, thank you so much for reading. We can’t wait to reunite with all you amazing people in beautiful Prague again. It truly is time to make new Python memories together!

In the meantime, follow us on social media:

LinkedIn: https://www.linkedin.com/company/europython/
X: https://x.com/europython
Mastodon: https://fosstodon.org/@europython
BlueSky: https://bsky.app/profile/europython.eu

With so much joy and excitement,

EuroPython 2025 Team 🤗

↧

PyCon: PyCon US 2025 - Travel Grants Transparency Blog Post

March 18, 2025, 2:53 am

≫ Next: Real Python: Using Structural Pattern Matching in Python

≪ Previous: EuroPython: Keynote & Ticket Sales Announcement!

Providing travel grants to community members for PyCon US and witnessing both their growth and contributions to the event is one of the most fulfilling aspects of our work at the PSF, and every year, we only wish we could award more!

PyCon US 2025 received 952 travel grant applications from 87 countries totaling almost $1.7M. We dislike using the phrase “record-breaking” every year, but it’s true! Again, the number and amount requested have broken our 2024 record. The total dollar amount requested was more than six times the available budget. For 2025, the total PyCon US revenue is budgeted at $2.3M and supports conference costs of $2.5M, including $266K in travel grant funds. The Travel Grant Team offered 272 travel grants and 33 ticket-only grants, amounting to $384K, or about 23%, of travel grant requests received.

The PSF is committed to financial transparency, and in line with that commitment, we are happy to share more about how our PyCon US 2025 travel grant process works.

Travel Grant Funding

PyCon US travel grants are funded from various sources including PyCon US ticket sales and sponsorships, the Python Core Devs, the Packaging Work Group, the PSF, and generous contributions from PyLadies. In addition, we already have a direct sponsorship for 2025 travel grants from OpenEDG Python Institute!

Inflation in event costs, continued economic risk from the tech sector, and global market swings mean that many corporations are funding less travel for both speakers and attendees. PyCon US has received increased travel grant requests over the past two years. We are working with a limited budget and can only accommodate a portion of those who applied.

Travel Grant Award Philosophy

PyCon US Travel Grants are designed to further our non-profit mission of promoting, protecting, and advancing the Python programming language while supporting the growth of a diverse and global community of Python developers. Our goal is to create an event that mirrors this diversity, unites individuals who will gain valuable experiences and bring those benefits back to their communities, and delivers an engaging and impactful experience from the first tutorial to the final sprint.

To achieve these goals, we allocate part of our budget to fund our incredible lineup of speakers, and this year, to reduce costs, we limited travel grants to up to 2 speakers per talk. We prioritized factors such as global representation, welcoming first-time PyCon US attendees, supporting folks looking for new job opportunities, and inviting students, educators, and community organizers from around the world. Despite our best efforts, the bittersweet reality remains that we couldn't award grants to everyone who applied.

Although we are a US-based conference and non-profit organization, PyCon US strives to bring folks from around the world. In fact, about 75% of travel grant funds were offered to non-US Pythonistas covering 58 countries!

Percentage of Travel Grants Awarded by Continent

The PyCon US Travel Grant Team takes a personalized approach to the awards process, and each application is reviewed multiple times (by humans!). The team continues to be impressed with the true community spirit of the Python community! All travel grant awards for PyCon US 2024 were used by awardees.

Award Outcome

As of the acceptance deadline of March 11th, 2025, 226 full travel and 28 ticket-only grants were accepted by grantees. 9 full travel and 2 ticket-only grants were declined, and a handful of grants expired.

We do not have additional funds to award because the Travel Grant Team intentionally over-awards grants; for reference, our budget was $266K, and we offered $384K. Based on several years of historical travel grant offers, acceptance, and use trends, the team has found that many folks aren’t able to use the travel grant due to changes in personal circumstances, illness, changes in work or education commitments, as well as issues securing a US visa.

You can find more information on the PyCon US Travel Grants FAQ. If you have feedback or questions about our process, please contact us at pycon-aid@python.org.

Sincerely,

The PyCon US 2025 Travel Grant Team

p.s. As a final note, we’d love to continue expanding our travel grant program in the future. If you’d like to help do that, a great way is to encourage your employer or other companies you are connected with to sponsor the PSF or to let them know you notice and appreciate it if they are already a sponsor.

↧

Real Python: Using Structural Pattern Matching in Python

March 18, 2025, 7:00 am

≫ Next: PyBites: FastAPI Deployment Made Easy with Docker and Fly.io

≪ Previous: PyCon: PyCon US 2025 - Travel Grants Transparency Blog Post

Structural pattern matching is a powerful control flow construct invented decades ago that’s traditionally used by compiled languages, especially within the functional programming paradigm.

Most mainstream programming languages have since adopted some form of pattern matching, which offers concise and readable syntax while promoting a declarative code style. Although Python was late to join the party, it introduced structural pattern matching in the 3.10 release.

In this video course, you’ll:

Master the syntax of the match statement and case clauses
Explore various types of patterns supported by Python
Learn about guards, unions, aliases, and name binding
Extract values from deeply nested hierarchical data structures
Customize pattern matching for user-defined classes
Identify and avoid common pitfalls in Python’s pattern matching

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

PyBites: FastAPI Deployment Made Easy with Docker and Fly.io

March 18, 2025, 9:14 am

≫ Next: PyCoder’s Weekly: Issue #673: Textual UIs, Tail-Call Performance, Bidirectional Generators, and More (March 18, 2025)

≪ Previous: Real Python: Using Structural Pattern Matching in Python

For the PDM program I worked on a FastAPI project to track books using the Google Book API and also provide AI powered recommendations using Marvin AI. As the project came closer to deployment, I knew that I wanted to try out containerization for a reliable and repeatable way to deploy. I chose Docker due to its widespread use, open-source nature, and consistent behavior across environments.If you’re new to Docker or looking for a straightforward guide to deploying a FastAPI app with Docker and Fly.io, this post is for you.

FastAPI Set Up for Docker

Before deploying the app, we need to containerize it using Docker. To do this we need to start with creating a Dockerfile, which defines how the project will be packaged and run inside of the container.

Project Structure

A clean project structure will improve the build efficiency, dependency management, security, and easier maintainability and readability. Here is the structure I used for my FastAPI project:

$ tree
.
├── Dockerfile
├── LICENSE
├── Procfile
├── README.md
├── alembic.ini
├── app.py
├── auth.py
├── config.py
├── db.py
├── docker-compose.yml
├── fly.toml
├── heroku.yml
├── main.py
├── migrations
│   ├── README
│   ├── env.py
│   ├── script.py.mako
│   └── versions
│       ├── 2ab73586da75_initial_migration.py
│       ├── 2e3f7780d24b_update_user_book_status_models.py
│       ├── 56b69f39cacf_add_book_index.py
│       └── bcc627763cfc_add_rate_limit_table.py
├── models.py
├── pages
│   ├── ai_recommendations.py
│   ├── login.py
│   ├── saved_books.py
│   └── signup.py
├── pyproject.toml
├── requirements.txt
├── screenshots
│   ├── ai_recommendation.png
│   ├── book_search.png
│   └── saved_books.png
├── services
│   ├── __init__.py
│   ├── google_books.py
│   └── marvin_ai.py
├── styles
│   └── global.css
├── tests
│   ├── __init__.py
│   ├── conftest.py
│   └── test_main.py
└── wait-for-it.sh

8 directories, 38 files

Writing the Dockerfile:

The Dockerfile is the key component that tells Docker how to set up the environment for the FastAPI app.

# Use the full Python image instead of slim to avoid missing system dependencies
FROM python:3.11

# Set environment variables
ENV PYTHONUNBUFFERED=1 \
   UV_NO_INDEX=1 \
   DEBIAN_FRONTEND=noninteractive

# Set the working directory inside the container
WORKDIR /app

# Install uv globally first
RUN pip install --no-cache-dir uv

# Install dependencies globally (avoiding virtual env issues)
COPY pyproject.toml ./
RUN uv pip install --system -r pyproject.toml

# Copy the rest of the application code
COPY . .

# Expose the FastAPI default port
EXPOSE 8000

# Copy wait-for-it.sh into the container (if needed)
COPY wait-for-it.sh /usr/local/bin/wait-for-it
RUN chmod +x /usr/local/bin/wait-for-it

# Run the application
CMD ["sh", "-c", "/usr/local/bin/wait-for-it db:5432 -- uv run uvicorn main:app --host 0.0.0.0 --port 8000"]

I initially tried using python: 3.11-slim to keep the container lightweight but ran into some missing system dependencies. After researching my issues, I decided to go with the full image, which solved the problems I was having. Optimizing the image is a future goal of the project.

To ensure the database is ready before starting the FastAPI app, we use a small script called wait-for-it.sh. This utility blocks the container’s startup until a specified host and port (in our case, the database) becomes available. It’s a lightweight, reliable way to avoid race conditions where the app tries to connect to the database before it’s fully up—something that can often happen in Dockerized deployments where services start concurrently.

Creating a .dockerignore File

After the Dockerfile is done, we need to set up a .dockerignore file. This is used to keep the Docker image small and clean. It works just like a .gitignore file, as we want to exclude files that aren’t relevant to the container plus exclude any secret keys, environment variables, etc.

__pycache__/
*.pyc
*.pyo
*.sqlite3
.env
migrations/**/__pycache__/
tests/__pycache__/
.vscode/
.idea/
*.swp
*.swo
.DS_Store

With the Dockerfile and Dockerignore files ready, we are set to build the image locally!

Building and Running the Docker Container Locally

Now we need to build the container and test the FastAPI project before deploying it. This will ensure that the project will run smoothly in a Dockerized environment.

Building the Image

First step is to build the Docker image. A Docker image is a blueprint for containers. This image will package the code, dependencies, and environment configurations. To build the image, we need to run the below command in the root of the project (where the Dockerfile is located):

docker build -t read-radar-api .

docker build tells Docker to build an image
-t read-radar-api will assign a tag (read-radar-api) to the image for reference.
. specifies the current directory as the build context.

Running the Container

Once the image is successfully built we can run the container. Use the below command to start:

docker run -p 8000:8000 read-radar-api

docker run will start a new container
-p 8000:8000 will map port 8000 on your machine to port 8000 inside of the container
read-radar-api is the name of the Docker image just built

After running this command, logs from Uvicorn should be available. This means that your FastAPI app is running inside the container!

You can test running the app by going to http://localhost:8000/docs in your browser. If working correctly, you’ll see the Swagger UI for the FastAPI interactive docs.

Using Docker Compose for PostgreSQL

For my project I used PostgreSQL for my database and we will also need to get that set up before deploying to Fly.io.

Docker Compose makes a PostgreSQL database simple to set up. Just create a docker-compose.yml file in the root of the project directory:

version: "3.8"
services:
 db:
   image: postgres:15
   container_name: postgres_db
   restart: always
   environment:
     POSTGRES_USER: ${POSTGRES_USER}
     POSTGRES_PASSWORD: ${POSTGRES_PASSWORD}
     POSTGRES_DB: ${POSTGRES_DB}
   volumes:
     - postgres_data:/var/lib/postgresql/data
   ports:
     - "5432:5432"
 web:
   build: .
   container_name: fastapi_app
   depends_on:
     - db
   environment:
     DATABASE_URL: ${DATABASE_URL}
   ports:
     - "8000:8000"
   volumes:
     - .:/app
   command: ["/usr/local/bin/wait-for-it", "db:5432", "--", "uv", "run", "python", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000", "--reload"]
volumes:
 postgres_data:

Once that file is set up, we can have Docker spin up both services using:

docker-compose up -d

This will have both our FastAPI running at the localhost and PostgreSQL will be available at postgresql://myuser:mypassword@localhost:5432/mydatabase (change the user, password, and database to match yours)

Once both are running successfully, we are ready to move onto deploying to Fly.io.

Deploying to Fly.io

Why Fly.io?

Containerization is great, but without deploying it to the cloud, others won’t be able to access the app. There are a variety of options to deploy like Fly.io or Heroku. I chose Fly.io because I was having issues with Heroku that required me to contact support. But using Fly.io was pretty simple, supporting Docker containers out of the box, and is relatively cheap with them changing to a pay-as-you-go system. You can also set up credits, so that you don’t run the risk of getting a super high bill.

Setting Up Fly.io

The CLI is a tool that is needed to deploy and it can be downloaded with the following command:

curl -L https://fly.io/install.sh | sh

Then just restart the terminal and login. It will guide you through creating an account if you don’t have one yet:

flyctl auth login

Once logged in, we can initialize the Fly.io app. Once in your FastAPI project folder, run the following command:

flyctl launch

This will detect the Dockerfile and create a fly.toml configuration file.

Setting Up PostgreSQL on Fly.io

Fly.io provides a managed PostgreSQL service, which makes the choice of database pretty easy if you are planning on using Fly.io like I did.

To set up the database just use the following command:

flyctl postgres create

Just like initializing the app, it’ll ask you for a region for the database. Fly.io will then provision a PostgreSQL database and then provide a database connection URL. It’s super important to save this URL!

Now that the FastAPI app and PostgreSQL database are set up in Fly.io, we need to connect them! It’s a simple command that will link the PostgreSQL instance to the FastAPI app and then store the database connection URL as a Fly.io secret. Run the following command:

flyctl postgres attach --app read-radar-api

Deploying the App

Now we can deploy the FastAPI container to Fly.io! Just do the following command:

flyctl deploy

This will build the Docker image, push it to Fly.io’s container registry, and deploy it in the chosen region. Once it successfully completes, Fly.io will provide the public URL for the API:

https://read-radar-api.fly.dev/docs

Applying Database Migrations

Once deployed, we will need to apply database migrations in the deployed container. I’ve used Alembic for this and it makes it a simple process. First, we need to open an SSH session of the Fly app:

flyctl ssh console

Then run:

alembic upgrade head

This will apply the database migrations to the deployed container.

Later I learned that Fly supports release commands that run before a new deployment becomes active, so you can automate this step by adding this to your fly.toml file:

[deploy]
  release_command = "alembic upgrade head"

Debugging & Testing

A few simple commands to check on how the app is working on Fly.io.

To check the logs for any error just run:

flyctl logs

To check the app’s status:

flyctl status

And if the app crashes, you can redeploy by doing:

flyctl deploy –remote-only

Final Thoughts

WIth that, deploying the Docker container to Fly.io is complete! The next step would be to integrate it with a frontend. For my project I used Streamlit and it was a fairly easy process since all of the logic was done on the FastAPI backend. Other future enhancements would be CI/CD for automated deployments, and using Fly.io’s auto-scaling features.

This project was a great introduction to Docker and Fly.io! The combination of clear documentation, guidance from my coach, and help from ChatGPT got me through the tricky parts. As we enter this new age of AI, it’s powerful to ask specific questions to AI — just be sure to understand and apply its suggestions wisely. That said, having a coach to provide tailored feedback and accountability made all the difference in truly grasping the concepts.

I hope this guide will help others implement Docker and Fly.io!

↧

PyCoder’s Weekly: Issue #673: Textual UIs, Tail-Call Performance, Bidirectional Generators, and More (March 18, 2025)

March 18, 2025, 12:30 pm

≫ Next: Python Morsels: Refactoring long boolean expressions

≪ Previous: PyBites: FastAPI Deployment Made Easy with Docker and Fly.io

#673 – MARCH 18, 2025
View in Browser »

Python Textual: Build Beautiful UIs in the Terminal

Textual is a Python library for building text-based user interfaces (TUIs) that support rich text, advanced layouts, and event-driven interactivity in the terminal. This tutorial showcases some of the ways you can design an appealing and engaging UI using Textual.
REAL PYTHON

Quiz: Python Textual: Build Beautiful UIs in the Terminal

REAL PYTHON

Performance of the Python 3.14 Tail-Call Interpreter

Prior reports of 10% speed-up from the tail-call interpreter coming in Python 3.14 may be overstated. This article breaks down where that number came from and what the reality may be.
NELSON ELHAGE

Postman AI Agent Builder Is Here: The Quickest Way to Build AI Agents. Start Building

Postman AI Agent Builder is a suite of solutions that accelerates agent development. With centralized access to the latest LLMs and APIs from over 18,000 companies, plus no-code workflows, you can quickly connect critical tools and build multi-step agents — all without writing a single line of code →
POSTMANsponsor

Binary Search as a Bidirectional Generator

Python generators support a .send() method, allowing you to receive data within the generator itself. This post talks about how to use this to implement a binary search algorithm.
RODRIGO GIRÃO SERRÃO

PSF Distinguished Service Award for Thomas Wouters

PYTHON SOFTWARE FOUNDATION

PyOhio July 26-27, Call for Papers

PRETALX.COM• Shared by Keith Murray

Articles & Tutorials

Providing Multiple Constructors in Your Python Classes

In this step-by-step tutorial, you’ll learn how to provide multiple constructors in your Python classes. To this end, you’ll learn different techniques, such as checking argument types, using default argument values, writing class methods, and implementing single-dispatch methods.
REAL PYTHON

Getting to Know Duck Typing in Python

In this video course, you’ll learn about duck typing in Python—a type system based on an object’s behavior rather than inheritance. By taking advantage of duck typing, you can create flexible and decoupled sets of Python classes that work together or independently.
REAL PYTHONcourse

Detect and Localize Anomalies With Intel AI, Powered by OpenVINO

Discover Anomalib, a library of ready-to-use algorithms for efficient anomaly detection. Optimized to run locally and open source, it’s designed to help you spot the odd one out. Get the code on GitHub.
INTEL CORPORATIONsponsor

Satellogic’s Open Satellite Feed

This post explores the “Satellogic EarthView” data feed, starting with determining where the satellites are, and moving to the corresponding ground imagery. The post uses a combination of Python and DuckDB to achieve its objectives.
MARK LITWINTSCHIK

The Hierarchy of Controls

This article, subtitled “how to stop devs from dropping prod” takes an idea from mechanical engineering used to ensure safety around machinery and brings it to the software world to prevent accidental destruction of data.
HILLEL WAYNE

Build a Dice-Rolling Application With Python

In this step-by-step project, you’ll build a dice-rolling simulator app with a minimal text-based user interface using Python. The app will simulate the rolling of up to six dice. Each individual die will have six sides.
REAL PYTHON

“Rules” That Terminal Programs Follow

The conventions that most terminal programs follow mean that you can more easily know how to control them. Julia’s post talks about “rules” that terminal programs tend to follow, and so should yours.
JULIA EVANS

Sustainable Coding: How Do I Apply It as a Cloud Engineer?

Choices we make as programmers effect the amount of processing power required in production and ultimately that has a carbon cost. Ed’s post talks about how he thinks about this larger picture.
ED CREWE

Font Ligatures for Your Code Editor and Terminal

A font ligature combines two characters into a single rendering, allowing “>=” to look like a single symbol. This article shows you how you can do this with common terminals and editors.
MIGUEL GRINBERG

Python Discord 2024 Survey Report

The Python Discord server does an annual survey. This page is a giant notebook showing the results for the last four years along with the code that generates the corresponding graphs.
PYTHON DISCORD

The Boolean Trap

Often when using a Boolean in an API you are making the API harder to understand. This post explains why most of the time you should use enums instead.
ENGINEER’S CODEX

Faster Branch Coverage Measurement

After nearly two years, Ned thinks this is finally ready: coverage.py can use sys.monitoring to more efficiently measure branch coverage.
NED BATCHELDER

Projects & Code

Events

Weekly Real Python Office Hours Q&A (Virtual)

March 19, 2025
REALPYTHON.COM

Creating Python Communities and Outreach

March 20 to March 21, 2025
NOKIDBEHIND.ORG

PyData Bristol Meetup

March 20, 2025
MEETUP.COM

PyLadies Dublin

March 20, 2025
PYLADIES.COM

PyCamp Argentina 2025

March 21 to March 25, 2025
PYAR.DISCOURSE.GROUP

Happy Pythoning!
This was PyCoder’s Weekly Issue #673.
View in Browser »

[ Subscribe to 🐍 PyCoder’s Weekly 💌 – Get the best Python news, articles, and tutorials delivered to your inbox once a week >> Click here to learn more ]

↧

Python Morsels: Refactoring long boolean expressions

March 18, 2025, 3:40 pm

≫ Next: Django Weblog: Django 5.2 release candidate 1 released

≪ Previous: PyCoder’s Weekly: Issue #673: Textual UIs, Tail-Call Performance, Bidirectional Generators, and More (March 18, 2025)

You can improve the readability of long boolean expressions by splitting your code, naming sub-expressions, or rewriting your Boolean logic.

Table of contents

Breaking up long expressions

Here's a fairly long Boolean expression:

fromdatetimeimportdatetimeevent={"name":"Intro","date":datetime(2030,3,6),"full":False}user={"name":"jill","verified":True,"role":"admin","perms":["edit"]}ifuser["verified"]andevent["date"]>datetime.now()andnotevent["full"]:print("Here's the event signup form...")

We could make this a little bit more readable by splitting our code up over multiple lines (thanks to implicit line continuations), with each line starting with a Boolean operator:

fromdatetimeimportdatetimeevent={"name":"Intro","date":datetime(2030,3,6),"full":False}user={"name":"jill","verified":True,"role":"admin","perms":["edit"]}if(user["verified"]andevent["date"]>datetime.now()andnotevent["full"]):print("Here's the event signup form...")

Or we could put the Boolean operators at the end of each line, if we prefer:

fromdatetimeimportdatetimeevent={"name":"Intro","date":datetime(2030,3,6),"full":False}user={"name":"jill","verified":True,"role":"admin","perms":["edit"]}if(user["verified"]andevent["date"]>datetime.now()andnotevent["full"]):print("Here's the event signup form...")

But PEP8 (the official Python style guide) recommends putting binary operators (operators that go in between two values, like and and or) at the beginning of each line, for the sake of readability.

That way it's a little bit easier to see at a glance how we're joining our sub-expressions together:

fromdatetimeimportdatetimeevent={"name":"Intro","date":datetime(2030,3,6),"full":False}user={"name":"jill","verified":True,"role":"admin","perms":["edit"]}if(user["verified"]andevent["date"]>datetime.now()andnotevent["full"]):print("Here's the event signup form...")

Naming sub-expressions with variables

We could also try using …

Read the full article: https://www.pythonmorsels.com/refactoring-boolean-expressions/

↧

Django Weblog: Django 5.2 release candidate 1 released

March 19, 2025, 12:49 am

≫ Next: Real Python: Quiz: Python's Bytearray

≪ Previous: Python Morsels: Refactoring long boolean expressions

Django 5.2 release candidate 1 is the final opportunity for you to try out a composite of new features before Django 5.2 is released.

The release candidate stage marks the string freeze and the call for translators to submit translations. Provided no major bugs are discovered that can't be solved in the next two weeks, Django 5.2 will be released on or around April 2. Any delays will be communicated on the on the Django forum.

Please use this opportunity to help find and fix bugs (which should be reported to the issue tracker), you can grab a copy of the release candidate package from our downloads page or on PyPI.

The PGP key ID used for this release is Sarah Boyce: 3955B19851EA96EF

↧

Real Python: Quiz: Python's Bytearray

March 19, 2025, 5:00 am

≫ Next: Real Python: LangGraph: Build Stateful AI Agents in Python

≪ Previous: Django Weblog: Django 5.2 release candidate 1 released

In this quiz, you’ll test your understanding of Python’s Bytearray: A Mutable Sequence of Bytes.

By working through this quiz, you’ll revisit the key concepts and uses of bytearray in Python.

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Real Python: LangGraph: Build Stateful AI Agents in Python

March 19, 2025, 7:00 am

≫ Next: Quansight Labs Blog: Quansight Labs Annual Report 2024: Year of focus and execution

≪ Previous: Real Python: Quiz: Python's Bytearray

LangGraph is a versatile Python library designed for stateful, cyclic, and multi-actor Large Language Model (LLM) applications. LangGraph builds upon its parent library, LangChain, and allows you to build sophisticated workflows that are capable of handling the complexities of real-world LLM applications.

By the end of this tutorial, you’ll understand that:

You can use LangGraph to build LLM workflows by defining state graphs with nodes and edges.
LangGraph expands LangChain’s capabilities by providing tools to build complex LLM workflows with state, conditional edges, and cycles.
LLM agents in LangGraph autonomously process tasks using state graphs to make decisions and interact with tools or APIs.
You can use LangGraph independently of LangChain, although they’re often used together to complement each other.

Explore the full tutorial to gain hands-on experience with LangGraph, including setting up workflows and building a LangGraph agent that can autonomously parse emails, send emails, and interact with API services.

While you’ll get a brief primer on LangChain in this tutorial, you’ll benefit from having prior knowledge of LangChain fundamentals. You’ll also want to ensure you have intermediate Python knowledge—specifically in object-oriented programming concepts like classes and methods.

Get Your Code:Click here to download the free sample code that you’ll use to build stateful AI agents with LangGraph in Python.

Take the Quiz: Test your knowledge with our interactive “LangGraph: Build Stateful AI Agents in Python” quiz. You’ll receive a score upon completion to help you track your learning progress:

Interactive Quiz

LangGraph: Build Stateful AI Agents in Python

Take this quiz to test your understanding of LangGraph, a Python library designed for stateful, cyclic, and multi-actor Large Language Model (LLM) applications. By working through this quiz, you'll revisit how to build LLM workflows and agents in LangGraph.

Install LangGraph

LangGraph is available on PyPI, and you can install it with pip. Open a terminal or command prompt, create a new virtual environment, and then run the following command:

Shell
(venv)$ python-mpipinstalllanggraph
Copied!

This command will install the latest version of LangGraph from PyPI onto your machine. To verify that the installation was successful, start a Python REPL and import LangGraph:

Python
>>> importlanggraph
Copied!

If the import runs without error, then you’ve successfully installed LangGraph. You’ll also need a few more libraries for this tutorial:

Shell
(venv)$ python-mpipinstalllangchain-openai"pydantic[email]"
Copied!

You’ll use langchain-openai to interact with OpenAI LLMs, but keep in mind that you can use any LLM provider you like with LangGraph and LangChain. You’ll use pydantic to validate the information your agent parses from emails.

Before moving forward, if you choose to use OpenAI, make sure you’re signed up for an OpenAI account and that you have a valid API key. You’ll need to set the following environment variable before running any examples in this tutorial:

.env
OPENAI_API_KEY=<YOUR-OPENAI-API-KEY>
Copied!

Note that while LangGraph was made by the creators of LangChain, and the two libraries are highly compatible, it’s possible to use LangGraph without LangChain. However, it’s more common to use LangChain and LangGraph together, and you’ll see throughout this tutorial how they complement each other.

With that, you’ve installed all the dependencies you’ll need for this tutorial, and you’re ready to create your LangGraph email processor. Before diving in, you’ll take a brief detour to set up quick sanity tests for your app. Then, you’ll go through an overview of LangChain chains and explore LangGraph’s core concept—the state graph.

Create Test Cases

When developing AI applications, testing and performance tracking is crucial for understanding how your chain, graph, or agent performs in the real world. While performance tracking is out of scope for this tutorial, you’ll use several example emails to test your chains, graphs, and agent, and you’ll empirically inspect whether their outputs are correct.

To avoid redefining these examples each time, create the following Python file with example emails:

Pythonexample_emails.py
EMAILS=[# Email 0"""    Date: October 15, 2024    From: Occupational Safety and Health Administration (OSHA)    To: Blue Ridge Construction, project 111232345 - Downtown Office    Complex Location: Dallas, TX    During a recent inspection of your construction site at 123 Main    Street,    the following safety violations were identified:    Lack of fall protection: Workers on scaffolding above 10 feet    were without required harnesses or other fall protection    equipment. Unsafe scaffolding setup: Several scaffolding    structures were noted as    lacking secure base plates and bracing, creating potential    collapse risks.    Inadequate personal protective equipment (PPE): Multiple    workers were    found without proper PPE, including hard hats and safety    glasses.    Required Corrective Actions:    Install guardrails and fall arrest systems on all scaffolding    over 10 feet. Conduct an inspection of all scaffolding    structures and reinforce unstable sections. Ensure all    workers on-site are provided    with necessary PPE and conduct safety training on proper    usage.    Deadline for Compliance: All violations must be rectified    by November 10, 2024. Failure to comply may result in fines    of up to    $25,000 per violation.    Contact: For questions or to confirm compliance, please reach    out to the    OSHA regional office at (555) 123-4567 or email    compliance.osha@osha.gov.""",# Email 1"""    From: debby@stack.com    Hey Betsy,    Here's your invoice for $1000 for the cookies you ordered.""",# Email 2"""    From: tdavid@companyxyz.com    Hi Paul,    We have an issue with the HVAC system your team installed in    apartment 1235. We'd like to request maintenance or a refund.    Thanks,    Terrance""",# Email 3"""    Date: January 10, 2025    From: City of Los Angeles Building and Safety Department    To: West Coast Development, project 345678123 - Sunset Luxury    Condominiums    Location: Los Angeles, CA    Following an inspection of your site at 456 Sunset Boulevard, we have    identified the following building code violations:    Electrical Wiring: Exposed wiring was found in the underground parking    garage, posing a safety hazard. Fire Safety: Insufficient fire    extinguishers were available across multiple floors of the structure    under construction.    Structural Integrity: The temporary support beams in the eastern wing    do not meet the load-bearing standards specified in local building    codes.    Required Corrective Actions:    Replace or properly secure exposed wiring to meet electrical safety    standards. Install additional fire extinguishers in compliance with    fire code requirements. Reinforce or replace temporary support beams    to ensure structural stability. Deadline for Compliance: Violations    must be addressed no later than February 5,    2025. Failure to comply may result in    a stop-work order and additional fines.    Contact: For questions or to schedule a re-inspection, please contact    the Building and Safety Department at    (555) 456-7890 or email inspections@lacity.gov.""",]
Copied!

You can read through these right now if you want, but you’ll get links back to these test emails throughout the tutorial.

Work With State Graphs

Read the full article at https://realpython.com/langgraph-python/ »

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Quansight Labs Blog: Quansight Labs Annual Report 2024: Year of focus and execution

March 18, 2025, 5:00 pm

≫ Next: PyCon: Refund Policy for International Attendees

≪ Previous: Real Python: LangGraph: Build Stateful AI Agents in Python

Presenting our 2024 annual report! Read about our open source project and community highlights, initiatives, and work culture.

↧

PyCon: Refund Policy for International Attendees

March 19, 2025, 8:51 am

≫ Next: PyBites: Optimizing Python: Understanding Generator Mechanics, Expressions, and Efficiency

≪ Previous: Quansight Labs Blog: Quansight Labs Annual Report 2024: Year of focus and execution

International travel to the United States has become more complex for many in our community. PyCon US welcomes all community members to Pittsburgh and we are committed to running a safe and friendly event for everyone who is joining us for PyCon US in Pittsburgh.

Each nation has its own relationship with the United States, so please contact your country’s State Department, Travel Ministry or Department of Foreign Affairs for travel information specific to traveling from your country to the US. Ultimately, each person must make their own decision based on their personal risk assessment and the travel conditions.

If it feels feasible and safe for you to attend PyCon US this year, then we’d love to see you! It is more important than ever to connect with our fellow community members. In light of current conditions, PyCon US would like to highlight the support we provide for international travelers.

Refund Policy Details

If your PyCon US trip is canceled due to not being able to obtain a visa or you are denied entry at the US border with a valid visa; or if you have COVID, influenza, measles, or other communicable diseases, PyCon US will grant you a refund of your ticket and waive the cancellation fee.

Additionally, if you have a valid visa to travel to the United States and are denied entry upon arrival to the United States, please see the details below and note that you will need to provide documentation that you arrived in the United States and were denied entry.

Airfare:

Please request a refund of your airfare from your airline carrier or booking agent.
PyCon US will reimburse you the portion of the PyCon US airfare the airline will not refund.

Hotel:

If you used the PyCon US registration system to book a hotel room, the PyCon US registration team will personally work with the hotel to make sure you do not have to pay a cancellation fee.

Please note the above policy only applies to attendees not traveling to or from OFAC sanctioned countries.

In the event that you have a refund request or questions, please contact our registration team.

PyCon US hopes that the expanded refund and travel support policy offers attendees the ability to plan more confidently and will continue to make PyCon US 2025 an option for as many Pythonistas from around the world as possible.

↧

PyBites: Optimizing Python: Understanding Generator Mechanics, Expressions, and Efficiency

March 20, 2025, 12:10 am

≫ Next: PyBites: Optimizing Python: Understanding Generator Mechanics, Expressions, and Efficiency

≪ Previous: PyCon: Refund Policy for International Attendees

Python generators provide an elegant mechanism for handling iteration, particularly for large datasets where traditional approaches may be memory-intensive. Unlike standard functions that compute and return all values at once, generators produce values on demand through the yield statement, enabling efficient memory usage and creating new possibilities for data processing workflows.

Generator Function Mechanics

At their core, generator functions appear similar to regular functions but behave quite differently. The defining characteristic is the yield statement, which fundamentally alters the function’s execution model:

def simple_generator():
    print("First yield")
    yield 1
    print("Second yield")
    yield 2
    print("Third yield")
    yield 3

When you call this function, it doesn’t execute immediately. Instead, it returns a generator object:

gen = simple_generator()
print(gen)
# <generator object simple_generator at 0x000001715CA4B7C0>

This generator object controls the execution of the function, producing values one at a time when requested:

value = next(gen)  # Prints "First yield" and returns 1
value = next(gen)  # Prints "Second yield" and returns 2

State Preservation and Execution Pausing

What makes generators special is their ability to pause execution and preserve state. When a generator reaches a yield statement:

Execution pauses
The yielded value is returned to the caller
All local state (variables, execution position) is preserved
When next()is called again, execution resumes from exactly where it left off

This mechanism creates an efficient way to work with sequences without keeping the entire sequence in memory at once.

Execution Model and Stack Frame Suspension

Generators operate with independent stack frames, meaning their execution context remains intact between successive calls. Unlike standard functions, which discard their execution frames upon return, generators maintain their internal state until exhausted, allowing efficient handling of sequences without redundant recomputation.

When a normal function returns, its stack frame (containing local variables and execution context) is immediately destroyed. In contrast, a generator’s stack frame is suspended when it yields a value and resumed when next() is called again. This suspension and resumption is managed by the Python interpreter, maintaining the exact state of all variables and the instruction pointer.

This unique execution model is what enables generators to act as efficient iterators over sequences that would be impractical to compute all at once, such as infinite sequences or large data transformations.

Generator Control Flow and Multiple yield points

Generators can contain multiple yield statements and complex control flow:

def fibonacci_generator(limit):
    a, b = 0, 1
    while a < limit:
        yield a
        a, b = b, a + b

# Multiple yield points with conditional logic
def conditional_yield(data):
    for item in data:
        if item % 2 == 0:
            yield f"Even: {item}"
        else:
            yield f"Odd: {item}"

This flexibility allows generators to implement sophisticated iteration patterns while maintaining their lazy evaluation benefits.

Memory Efficiency: The Key Advantage

The primary benefit of generators is their memory efficiency. Let’s compare standard functions and generators:

def get_all_numbers(numbers: list):
    """Normal function - allocates memory for entire list at once"""
    result = []
    for i in range(numbers):
        result.append(i)
    return result

def yield_all_numbers(numbers: list):
    """Generator - produces one value at a time"""
    for i in range(numbers):
        yield i

To quantify the difference:

import sys
regular_list = get_all_numbers(1000000)
generator = yield_all_numbers(1000000)

print(f"List size: {sys.getsizeof(regular_list)} bytes")
print(f"Generator size: {sys.getsizeof(generator)} bytes")
# List size: 8448728 bytes
# Generator size: 208 bytes

This dramatic difference in memory usage makes generators invaluable when working with large datasets that would otherwise consume excessive memory.

Generator Expressions

Python offers a concise syntax for creating generators called generator expressions. These are similar to list comprehensions but use parentheses and produce values lazily:

# List comprehension - creates the entire list in memory
squares_list = [x * x for x in range(10)]


# Generator expression - creates values on demand
squares_gen = (x * x for x in range(10))

The performance difference becomes significant with large datasets:

import sys
import time

# Compare memory usage and creation time for large dataset
start = time.time()
list_comp = [x for x in range(100_000_000)]
list_time = time.time() - start
list_size = sys.getsizeof(list_comp)

start_gen = time.time()
gen_exp = (x for x in range(100_000_000))
gen_time = time.time() - start_gen
gen_size = sys.getsizeof(gen_exp)

print(f"List comprehension: {list_size:,} bytes, created in {list_time:.4f} seconds")
# List comprehension: 835,128,600 bytes, created in 4.9007 seconds

print(f"Generator expression: {gen_size:,} bytes, created in {gen_time:.4f} seconds")
# Generator expression: 200 bytes, created in 0.0000 seconds

Minimal Memory, Maximum Speed

The generator expression is so fast (effectively zero seconds) because the Python interpreter doesn’t actually compute or store any of those 100 million numbers yet. Instead, the generator expression simply creates an iterator object that remembers:

How to produce the numbers (x for x in range(100_000_000)).
The current state (initially, the start point).

The size reported (200 bytes) is the memory footprint of the generator object itself, which includes a pointer to the generator’s code object, and the Internal state required to track iteration, but none of the actual values yet.

Chaining and Composing Generators

One of the elegant aspects of generators is how easily they can be composed. Python’s itertools module provides utilities that enhance this capability:

from itertools import chain, filterfalse

# Chain multiple generator expressions together
result = chain((x * x for x in range(10)), (y + 10 for y in range(5)))

# Filter values from a generator
odd_squares = filterfalse(lambda x: x % 2 == 0, (x * x for x in range(10)))

# Transform values from a generator
doubled_values = map(lambda x: x * 2, range(10))

Final Thoughts: When to Use Generators

Python generators offer an elegant, memory-efficient approach to iteration. By yielding values one at a time as they’re needed, generators allow you to handle datasets that would otherwise overwhelm available memory. Their distinct execution model, combining state preservation with lazy evaluation, makes them exceptionally effective for various data processing scenarios.

Generators particularly shine in these use cases:

Large Dataset Processing: Manage extensive datasets that would otherwise exceed memory constraints if loaded entirely.
Streaming Data Handling: Effectively process data that continuously arrives in real-time.
Composable Pipelines: Create data transformation pipelines that benefit from modular and readable design.
Infinite Sequences: Generate sequences indefinitely, processing elements until a specific condition is met.
File Processing: Handle files line-by-line without needing to load them fully into memory.

For smaller datasets (typically fewer than a few thousand items), the memory advantages of generators may not be significant, and standard lists could provide better readability and simplicity.

In an upcoming companion article, I’ll delve deeper into how these fundamental generator concepts support sophisticated techniques to tackle real-world challenges, such as managing continuous data streams.

↧

PyBites: Optimizing Python: Understanding Generator Mechanics, Expressions, and Efficiency

March 20, 2025, 12:10 am

≫ Next: Seth Michael Larson: I fear for the unauthenticated web

≪ Previous: PyBites: Optimizing Python: Understanding Generator Mechanics, Expressions, and Efficiency

Generator Function Mechanics

def simple_generator():
    print("First yield")
    yield 1
    print("Second yield")
    yield 2
    print("Third yield")
    yield 3

When you call this function, it doesn’t execute immediately. Instead, it returns a generator object:

gen = simple_generator()
print(gen)
# <generator object simple_generator at 0x000001715CA4B7C0>

This generator object controls the execution of the function, producing values one at a time when requested:

value = next(gen)  # Prints "First yield" and returns 1
value = next(gen)  # Prints "Second yield" and returns 2

State Preservation and Execution Pausing

What makes generators special is their ability to pause execution and preserve state. When a generator reaches a yield statement:

Execution pauses
The yielded value is returned to the caller
All local state (variables, execution position) is preserved
When next()is called again, execution resumes from exactly where it left off

This mechanism creates an efficient way to work with sequences without keeping the entire sequence in memory at once.

Execution Model and Stack Frame Suspension

Generator Control Flow and Multiple yield points

Generators can contain multiple yield statements and complex control flow:

def fibonacci_generator(limit):
    a, b = 0, 1
    while a < limit:
        yield a
        a, b = b, a + b

# Multiple yield points with conditional logic
def conditional_yield(data):
    for item in data:
        if item % 2 == 0:
            yield f"Even: {item}"
        else:
            yield f"Odd: {item}"

This flexibility allows generators to implement sophisticated iteration patterns while maintaining their lazy evaluation benefits.

Memory Efficiency: The Key Advantage

The primary benefit of generators is their memory efficiency. Let’s compare standard functions and generators:

def get_all_numbers(numbers: list):
    """Normal function - allocates memory for entire list at once"""
    result = []
    for i in range(numbers):
        result.append(i)
    return result

def yield_all_numbers(numbers: list):
    """Generator - produces one value at a time"""
    for i in range(numbers):
        yield i

To quantify the difference:

import sys
regular_list = get_all_numbers(1000000)
generator = yield_all_numbers(1000000)

print(f"List size: {sys.getsizeof(regular_list)} bytes")
print(f"Generator size: {sys.getsizeof(generator)} bytes")
# List size: 8448728 bytes
# Generator size: 208 bytes

This dramatic difference in memory usage makes generators invaluable when working with large datasets that would otherwise consume excessive memory.

Generator Expressions

Python offers a concise syntax for creating generators called generator expressions. These are similar to list comprehensions but use parentheses and produce values lazily:

# List comprehension - creates the entire list in memory
squares_list = [x * x for x in range(10)]


# Generator expression - creates values on demand
squares_gen = (x * x for x in range(10))

The performance difference becomes significant with large datasets:

import sys
import time

# Compare memory usage and creation time for large dataset
start = time.time()
list_comp = [x for x in range(100_000_000)]
list_time = time.time() - start
list_size = sys.getsizeof(list_comp)

start_gen = time.time()
gen_exp = (x for x in range(100_000_000))
gen_time = time.time() - start_gen
gen_size = sys.getsizeof(gen_exp)

print(f"List comprehension: {list_size:,} bytes, created in {list_time:.4f} seconds")
# List comprehension: 835,128,600 bytes, created in 4.9007 seconds

print(f"Generator expression: {gen_size:,} bytes, created in {gen_time:.4f} seconds")
# Generator expression: 200 bytes, created in 0.0000 seconds

Minimal Memory, Maximum Speed

How to produce the numbers (x for x in range(100_000_000)).
The current state (initially, the start point).

Chaining and Composing Generators

One of the elegant aspects of generators is how easily they can be composed. Python’s itertools module provides utilities that enhance this capability:

from itertools import chain, filterfalse

# Chain multiple generator expressions together
result = chain((x * x for x in range(10)), (y + 10 for y in range(5)))

# Filter values from a generator
odd_squares = filterfalse(lambda x: x % 2 == 0, (x * x for x in range(10)))

# Transform values from a generator
doubled_values = map(lambda x: x * 2, range(10))

Final Thoughts: When to Use Generators

Generators particularly shine in these use cases:

Large Dataset Processing: Manage extensive datasets that would otherwise exceed memory constraints if loaded entirely.
Streaming Data Handling: Effectively process data that continuously arrives in real-time.
Composable Pipelines: Create data transformation pipelines that benefit from modular and readable design.
Infinite Sequences: Generate sequences indefinitely, processing elements until a specific condition is met.
File Processing: Handle files line-by-line without needing to load them fully into memory.

For smaller datasets (typically fewer than a few thousand items), the memory advantages of generators may not be significant, and standard lists could provide better readability and simplicity.

↧

Seth Michael Larson: I fear for the unauthenticated web

March 19, 2025, 5:00 pm

≫ Next: PyCharm: データクリーニングとは？データサイエンスで重要な手順とベストプラクティス

≪ Previous: PyBites: Optimizing Python: Understanding Generator Mechanics, Expressions, and Efficiency

LLM and AI companies seem to all be in a race to breathe the last breath of air in every room they stumble into. This practice started with larger websites, ones that already had protection from malicious usage like denial-of-service and abuse in the form of services like Cloudflare or Fastly.

But the list of targets has been getting longer. At this point we're seeing LLM and AI scrapers targeting small project forges like the GNOME GitLab server.

How long until scrapers start hammering Mastodon servers? Individual websites? Are we going to have to require authentication or JavaScript challenges on every web page from here on out?

All this for what, shitty chat bots? What an awful thing that these companies are doing to the web.

I suggest everyone that uses cloud infrastructure for hosting set-up a billing limit to avoid an unexpected bill in case they're caught in the cross-hairs of a negligent company. All the abusers anonymize their usage at this point, so good luck trying to get compensated for damages.

↧

PyCharm: データクリーニングとは？データサイエンスで重要な手順とベストプラクティス

March 17, 2025, 10:28 pm

≫ Next: Real Python: The Real Python Podcast – Episode #244: A Decade of Automating the Boring Stuff With Python

≪ Previous: Seth Michael Larson: I fear for the unauthenticated web

データサイエンスに関するこのブログ連載記事では、データの入手場所と pandas を使用してそのようなデータを探索する方法について説明してきました。そのようなデータは学習用途に最適ですが、現実世界のデータとはまったく異なっています。学習用のデータはデータクリーニングとキュレーションが完了した状態で提供されることが多いため、データクリーニングの世界を経験しなくてもすぐに学習に取り掛かることができます。一方、現実世界のデータは問題があり、整理されていないものです。現実世界のデータの場合、有用なインサイトを得るためには事前にクリーニングを行う必要があります。それが今回のブログ記事のトピックです。

データの問題は、データ自体の挙動、データの収集方法、さらにはデータの入力方法によって発生する可能性があります。ミスや見過ごしは、これらのどの段階においても発生しうるものです。

ここではデータ変換ではなく、データクリーニングに限定して説明しています。データクリーニングでは、データから導き出した結論を定義した母集団に一般化することができます。対照的に、データ変換ではデータ形式の変換、データの正規化、データの集計などのタスクが伴います。

なぜデータクリーニングが重要なのか？データ分析の精度を向上させる理由

データベースについて最初に理解する必要があるのは、それが何を代表しているかということです。ほとんどのデータセットはより幅広い母集団を代表するサンプルであり、このサンプルを処理して得られた結果をその母集団に外挿する（または一般化）できるようになります。たとえば、前の 2 つのブログ記事では、あるデータセットを使用しました。そのデータセットは大まかには住宅販売に関するものですが、小さな地理的なエリアと短い期間のみを網羅しているもので、そのエリアと期間内のすべての住宅を網羅していない可能性があります。これが、より大きな母集団のサンプルです。

データはより幅広い母集団を代表するサンプルである必要があります。たとえば、定義した期間における対象エリアのすべての住宅販売が挙げられます。使用するデータを確実により幅広い母集団を代表するサンプルにするには、最初に母集団の境界を定義する必要があります。

ご想像の通り、おそらく国勢調査データを除けば、母集団全体を使って作業するのは往々にして現実的ではありません。そのため、境界をどこに置くのかを決める必要があります。そのような境界は、地理、人口統計、期間、行動や活動（取引など）、または業界固有のものなどにすることができます。母集団の定義には多数の方法がありますが、データを確実に一般化するには、データをクリーニングする前に定義しておく必要があります。

要するに、分析や機械学習などの目的でデータを使用する場合は時間をかけてデータをクリーニングし、インサイトを信頼して現実世界に一般化できるようにする必要があります。データをクリーニングすると分析の正確性が増し、機械学習においてはパフォーマンスも改善されます。

データをクリーニングしなければ学習結果を幅広い母集団に確実に一般化できず、要約統計が不正確になり、不正に可視化されるなどの問題が発生する可能性があります。データを使用して機械学習モデルをトレーニングしようとしている場合は、それがエラーや不正確な予測の原因になる可能性もあります。

PyCharm Professional を無料で試す

データクリーニングの具体例 – 5 つの主要手順

データのクリーニングに使用できる 5 つのタスクを見てみましょう。このリストは網羅的なものではありませんが、初めて現実世界のデータに取り掛かる際には役に立つはずです。

データの重複排除

重複はデータを歪める可能性があるため、問題です。販売価格の出現頻度を使用してヒストグラムを作成していると想像してください。同じ値が重複している場合、重複した価格に基づいて不正確なパターンのあるヒストグラムが作成されてしまいます。

補足しておきますが、ここでデータセットの重複が問題であると言う場合は、各行が単一の観測値になっている行そのものが重複していることを指しています。列には重複した値があっても、それは予想されることです。ここでは、重複した観測値についてのみ取り上げています。

幸いにも、データ内の重複を検出するのに役立つ pandas メソッドがあります。必要であれば、JetBrains AIチャットで以下のようなプロンプトを入力すると、その使用方法を調べることができます。

Code to identify duplicate rows

以下の結果が出力されます。

duplicate_rows = df[df.duplicated()]
duplicate_rows

このコードでは DataFrame が "df"という名前である想定になっているため、必要に応じて使用している DataFrame の名前に合わせて変更してください。

これまで使用してきた Ames Housing データセットには重複データはありませんが、上記の pandas メソッドを試してみたい方は CITES Wildlife Trade Databaseデータセットに対して使用し、重複の有無を確認してみてください。

データセット内に重複が見つかったら、それを除去して結果に歪みが生じないようにする必要があります。そのためのコードも JetBrains AI で次のプロンプトを使用して得られます。

Code to drop duplicates from my dataframe

出力されたコードでは重複が排除され、DataFrame のインデックスがリセットされ、その後に df_cleaned という名前の新しい DataFrame としてデータセットが表示されます。

df_cleaned = df.drop_duplicates()
df_cleaned.reset_index(drop=True, inplace=True)
df_cleaned

より高度な重複管理に使用できる pandas 関数は他にもありますが、データセットの重複排除が初めての方にはこれで十分です。

あり得ない値の処理

データが誤って入力された場合やデータ収集プロセスで何らかのエラーが発生した場合、あり得ない値が発生する場合があります。 Ames Housing データセットの場合は、マイナスの SalePrice や Roof Style に使用されている数値があり得ない値になるでしょう。

データセット内のあり得ない値を特定するには、要約統計を確認する、列ごとに収集者によって定義されたデータ検証ルールを確認する、その検証から外れているデータポイントを記録する、可視化によって異常だと思われるパターンやその他の特徴を特定する、といった多様な手法があります。

あり得ない値はノイズの混入や分析時の問題の原因となり得るため、処理する必要があります。ただし、処理方法にはさまざまな解釈があります。データセットのサイズに比べてあり得ない値の数が多くなければ、そのような値を含むレコードを除去するとよいでしょう。たとえば、データセットの 214 行目にあり得ない値を見つけた場合、pandas drop 関数でその行をデータセットから除去することができます。

ここでも JetBrains AI で以下のようなプロンプトを使用し、必要なコードを生成できます。

Code that drops index 214 from #df_cleaned

PyCharm の Jupyter ノートブックでは、単語の前に # 記号を付けることで、JetBrains AI Assistant に追加のコンテキストを提供していることを示すことができます。ここでは、DataFrame が "df_cleaned"という名前であることを示しています。

生成されたコードでは対象の観測値が DataFrame から除去され、インデックスがリセットされてから表示されます。

df_cleaned = df_cleaned.drop(index=214)
df_cleaned.reset_index(drop=True, inplace=True)
df_cleaned

あり得ない値を処理する戦略としては、代入を使用するのも一般的です。つまり、あり得ない値を定義された戦略に基づいて別のあり得る値に置き換えるのです。最も一般的な戦略の 1 つは、あり得ない値の代わりに中央値を使用することです。中央値は外れ値の影響を受けないため、データサイエンティストにこの目的でよく選ばれていますが、それと同様にデータの平均値やモード値の方が適している場合もあります。

データセットとそのデータの収集方法に関する専門知識がある場合は、あり得ない値をより意味のある値に置き換えることもできます。データ収集プロセスに関わっている場合やそのプロセスを理解している場合は、こちらの方法が適しているかもしれません。

どの方法を選択してあり得ない値を処理するかは、データセット内でのその値の出現頻度、データの収集方法、母集団の定義方法、および専門知識などの他の要因によって異なります。

データの書式

多くの場合、要約統計や早い段階で可視化を行ってデータの形状に関する概要を把握することで、書式の問題を特定できます。書式が矛盾している例には、小数点以下の桁数の定義が統一されていない数値や、”first” や “1st” のようなスペルのばらつきが挙げられます。データの書式が誤っている場合、データのメモリ使用量にも問題が生じる可能性があります。

データセット内で書式の問題が見つかった場合は、値を標準化する必要があります。発生している問題によりますが、通常は独自の標準を定義して変更を適用する必要があります。これについても、pandas ライブラリには roundのような便利な関数が用意されています。 SalesPrice 列を小数点以下 2 桁に丸める場合は、以下のようにして JetBrains AI にそのコードを問い合わせることができます。

Code to round #SalePrice to two decimal places

生成されるコードでは丸めが実行され、それを確認できるように最初の 10 行が出力されます。

df_cleaned['SalePrice'] = df_cleaned['SalePrice].round(2)
df_cleaned.head()

スペルに矛盾がある場合の例も見てみましょう。たとえば、HouseStyle 列に “1Story” と “OneStory” の両方が存在しており、これらが同じものを意味していることが分かっているとします。以下のプロンプトを使用すると、この矛盾を解決するためのコードを取得できます。

Code to change all instances of #OneStory to #1Story in #HouseStyle

生成されるコードはまさにこの矛盾を解消し、すべての “OneStory” を “1Story” に置き換えます。

df_cleaned[HouseStyle'] = df_cleaned['HouseStyle'].replace('OneStory', '1Story')

外れ値の解決

外れ値はデータセットではよく発生しますが、その対処方法はコンテキストによって大きく異なります。外れ値を最も簡単に特定するには箱ひげ図を使用する方法がありますが、これには seabornライブラリと matplotlibライブラリを使用します。箱ひげ図については、pandas でデータを探索する方法に関する前のブログ記事で説明しています。簡単なおさらいが必要であれば、そちらをご覧ください。

この箱ひげ図を使用するため、Ames Housing データセットの SalesPrice を見てみましょう。ここでも JetBrains AI で以下のようなプロンプトを使ってコードを生成します。

Code to create a box plot of #SalePrice

実行すべきコードが以下のように生成されます。

import seaborn as sns
import matplotlib.pyplot as plt

# Create a box plot for SalePrice
plt.figure(figsize=(10, 6))
sns.boxplot(x=df_cleaned['SalePrice'])
plt.title('Box Plot of SalePrice')
plt.xlabel('SalePrice')
plt.show()

箱ひげ図から、青い箱の内側にある縦の中央値の線が中心より左側にあるため、正の歪みがあることが分かります。正の歪みはより比較的安価な住宅価格が多いことを示していますが、これは驚くべきことではありません。箱ひげ図からは、右側に外れ値が多いことも視覚的に分かります。これは、中央価格よりもはるかに高価な少数の住宅があることを示しています。

大多数の住宅よりも高価な小数の住宅があることはよくあることなので、このような外れ値があることは受け入れられるかと思いますが、すべては一般化する母集団とデータから導き出す結論によって決まります。母集団に含まれるものと含まれないものに明確な境界を引くことで、データ内の外れ値が問題となるかどうかを情報に基づいて判断できるようになります。

たとえば、母集団を構成するのが高価な豪邸を購入しない人々なら、そのような外れ値を削除できるかもしれません。しかし、母集団にこのような高価な住宅を購入する可能性があると合理的に考えられる人が含まれているのなら、このような外れ値は母集団に関連性があるため、残すべきかもしれません。

ここでは箱ひげ図を外れ値の特定方法として取り上げていますが、散布図やヒストグラムなどの他の方法でもデータに外れ値が含まれるかどうかを素早く確認し、外れ値に対処すべきかどうかを情報に基づいて判断できます。

外れ値の解決方法は一般的に 2 つに分類されます。外れ値を削除するか、外れ値にあまり影響を受けない要約統計を使用するかの 2 つです。前者の場合、外れ値が実際にどの行であるかを知る必要があります。

これまでは外れ値を視覚的に特定する方法について説明してきました。どの観測値が外れ値であるかどうかを判断するにはさまざまな方法があります。一般的な方法の 1 つは、修正 Z スコアという手法を使用することです。 Z スコアがどのような理由でどう修正されているのかを見る前に、その基本的な定義が以下であることを知っておきましょう。

Z スコア = (データポイントの値– 平均値) / 標準偏差

外れ値を検出するために Z スコアを修正する理由は、平均値も標準偏差もその計算方法によっては外れ値の影響を受けやすいためです。修正 Z スコアは以下のように定義されています。

修正 Z スコア = (データポイントの値– 中央値) / 中央絶対偏差

要約統計を取り上げた際に学習したように、中央値は外れ値の影響を受けません。 中央絶対偏差は、データセットの中央値からの絶対偏差の中央値です。たとえば、データセットに以下の値が含まれているとします。

1, 2, 2, 2, 3, 3, 3, 5,9

この場合、中央値はデータセットの中央にある値である 3 になります。中央値である 3 からの偏差は、上記の順では以下のようになります。

2, 1, 1, 1, 0, 0, 0, 2, 5

これらの値を昇順に並べ替えます。

0, 0, 0, 1, 1, 1, 2, 2, 5

したがって、この例での中央絶対偏差は 1 となります。

この記事の例に使用しているデータセットでは、SalePrice の修正 Z スコアをどのように求められるでしょうか？ JetBrains AI に以下のプロンプトを使用してみましょう。

Given:

Modified Z-score = (data point value – median) / median absolute deviation

Calculate the Modified Z-score for #SalePrice

以下のコードが生成されます。

import pandas as pd
from scipy.stats import median_abs_deviation


# Assuming 'data' is your DataFrame and it contains the 'SalePrice' column


# Calculate the median for the SalePrice column
median_sale_price = data['SalePrice'].median()


# Calculate the Median Absolute Deviation (MAD) for the SalePrice column
mad_sale_price = median_abs_deviation(data['SalePrice'], scale='normal')


# Calculate the Modified Z-scores for the SalePrice column
data['Modified_Z_Score'] = (data['SalePrice'] - median_sale_price) / mad_sale_price


# Display the first few rows with the Modified Z-scores
print(data[['SalePrice', 'Modified_Z_Score']].head())

必要なパッケージをインストールしてこのコードを実行すると、目的を途中まで果たすことができます。しかし、次は SalePrice の外れ値が何であるかを修正 Z スコアに基づいて判断する必要があります。ここでは一般常識に従って外れ値を >=3 または <=-3 としますが、ほとんどの統計的決定と同様に、外れ値は使用しているデータセットに合わせて調整でき、そうする必要があることに留意してください。 <=-3 however, it’s worth noting that like most statistical decisions, it can and should be tailored to your dataset. 出力をさらにカスタマイズするため、JetBrains AI に別のプロンプトを渡しましょう。 Just list those that have a #Modified_Z_Score of 3 or above or -3 or below

このコードスニペットを採用し、上記の関連する行に置き換えます。

# Filter the rows where the Modified Z-score is 3 or above, or -3 or below
outliers = data[(data['Modified_Z_Score'] >= 3) | (data['Modified_Z_Score'] <= -3)]


# Print all the filtered rows, showing their index and SalePrice
outliers = (outliers[['SalePrice', 'Modified_Z_Score']])
outliers

“outliers” という新しい DataFrame に外れ値を格納し、その値を出力して内容を確認できるようにこのコードを修正しました。

次のステップでは、これらの外れ値を DataFrame から削除します。ここでも JetBrains AI で以下のようなプロンプトを使用し、コードを生成できます。

Create a new dataframe without the outliers

data_without_outliers = data.drop(index=outliers.index)


# Display the new DataFrame without outliers
print(data_without_outliers)

“data_without_outliers” という新しい DataFrame では、SalePrice の変数が外れ値と見なされる値が除外されています。

箱ひげ図のコードを更新して新しい DataFrame を確認してみましょう。予想していた通り、正の歪みは依然として示されていますが、外れ値として見なされる値は除去されています。

import seaborn as sns
import matplotlib.pyplot as plt


# Create a box plot for SalePrice
plt.figure(figsize=(10, 6))
sns.boxplot(x=data_without_outliers['SalePrice'])
plt.title('Box Plot of SalePrice')
plt.xlabel('SalePrice')
plt.show()

ここで終わる前に、SalePrice に基づいて外れ値と見なしたことで DataFrame から除去された観測値の数をパーセント率でみてみましょう。

以下のようなプロンプトを使用できます。

Calculate the percentage of observations removed between #data and #data_without_outliers

# Calculate the number of observations in the original and filtered DataFrames
original_count = len(data)
filtered_count = len(data_without_outliers)

# Calculate the number of removed observations
removed_count = original_count - filtered_count

# Calculate the percentage of observations removed
percentage_removed = (removed_count / original_count) * 100

# Display the percentage
print(f"Percentage of observations removed: {percentage_removed:.2f}%")

PyCharm により、5.67% の観測値が除去されたことが示されています。

前述のように、外れ値を残す場合は、外れ値の影響を受けにくい中央値や四分位範囲などの要約値を使用してみてください。定義した母集団と導き出す結論に関連性があるために除去していない外れ値が含まれていることが分かっているデータセットを処理する場合は、それらの測定値を使用して結論を出すことを検討するとよいでしょう。

欠損値

データセット内の欠損値を最も素早く特定する方法は、要約統計を使用することです。念のため、DataFrame 内で右側にある Show Column Statistics（列統計の表示）をクリックしてから Compact（コンパクト）を選択してください。 Ames housing データセットの Lot Frontage で分かるように、列の欠損値は赤色で示されます。

このデータに関して検討すべき欠損には以下の 3 種類があります。

完全にランダムな欠損
ランダムな欠損
ランダムでない欠損

完全にランダムな欠損

完全にランダムな欠損とは、完全に偶発的にデータが欠落しており、欠落の原因がデータセット内の他の変数と無関係であることを指します。これは、アンケートの質問に回答漏れがある場合などに発生することがあります。

完全にランダムな欠損データはまれにしか発生しませんが、最も対処しやすいものでもあります。完全にランダムに欠損している観測値の数が比較的少数である場合は、そのような観測値を削除するのが最も一般的な対処法です。そのような観測値は削除してもデータセットの整合性に影響することはなく、導き出そうとしている結論にも影響しないためです。

ランダムな欠損

ランダムな欠損には欠損のパターンがなさそうに見えても、測定した他の変数からパターンを説明できる欠損を指します。たとえば、データの収集方法が原因でアンケートの質問に回答漏れがあった場合が挙げられます。

Ames housing データセットをもう一度見てみましょう。Lot Frontage 変数は、特定の不動産会社が販売した住宅では欠損の頻度が高くなっているはずです。この場合、この欠損は不動産会社が入力したデータに整合性がないことが原因だと考えられます。それが事実である場合、Lot Frontage データが欠損していることは Lot Frontage そのものではなく、物件を販売した不動産会社によるデータ収集方法（観測対象の特性）に関連していることになります。

データがランダムに欠損している場合は、データが欠損している理由を理解することをお勧めします。これには多くの場合、データの収集方法を調べる作業が伴います。データが欠損している理由を理解したら、対処法を選択できます。比較的よく選択されているランダムな欠損の対処法には、値の代入があります。この対処法はあり得ない値に関してすでに触れましたが、欠損にも有効です。この例で言えば、住宅の規模、建築年、販売価格などの相関変数を使用することも含め、定義された母集団と導き出したい結論に応じてさまざまな選択肢があります。欠損データの原因となっているパターンを理解するには、コンテキスト情報を使用して値を代入できることもしばしばです。それにより、データセット内のデータ間の関連が維持されます。

ランダムでない欠損

最後に取り上げるランダムでない欠損は、データが欠損する可能性が観測対象外のデータに関連している場合に発生します。つまり、欠損が未観測のデータに依存しているということです。

最後にもう一度 Ames housing データセットに戻り、Lot Frontage に欠損データがある事実に目を向けましょう。データの欠損がランダムでない状況としては、販売者が Lot Frontage を小さいと見なし、それを報告することで住宅の販売価格が下がってしまうことを恐れて意図的に報告しなかった場合が挙げられます。 Lot Frontage データが欠損している可能性が間口そのものの規模（観測対象外）に依存しているのであれば、間口が小さいほど報告される可能性が低く、欠損が直接欠損値に関連していることになります。

欠損の可視化

データが欠損している場合は、パターンがあるかどうかを確認する必要があります。パターンがある場合は、データを一般化する前に対処しなければならない可能性のある問題が存在することになります。

パターンを最も簡単に探すには、ヒートマップ可視化を使用する方法があります。そのコードに着手する前に、欠損のない変数を除外しましょう。 JetBrains AI に以下のプロンプトを送ると、必要なコードが得られます。

Code to create a new dataframe that contains only columns with missingness

以下のコードが出力されます。

# Identify columns with any missing values
columns_with_missing = data.columns[data.isnull().any()]

# Create a new DataFrame with only columns that have missing values
data_with_missingness = data[columns_with_missing]

# Display the new DataFrame
print(data_with_missingness)

このコードを実行する前に、最後の行を変更して PyCharm の素晴らしい DataFrame レイアウトを利用できるようにします。

data_with_missingness

では、ヒートマップを作成しましょう。ここでも JetBrains AI に以下のプロンプトを送信してコードを取得します。

Create a heatmap of #data_with_missingness that is transposed

以下の結果が出力されます。

import seaborn as sns
import matplotlib.pyplot as plt


# Transpose the data_with_missingness DataFrame
transposed_data = data_with_missingness.T


# Create a heatmap to visualize missingness
plt.figure(figsize=(12, 8))
sns.heatmap(transposed_data.isnull(), cbar=False, yticklabels=True)
plt.title('Missing Data Heatmap (Transposed)')
plt.xlabel('Instances')
plt.ylabel('Features')
plt.tight_layout()
plt.show()

なお、heatmap の引数から cmap=’viridis’ を除去しました。それを使うと見にくくなるためです。

このヒートマップでは、同じ変数が複数の行にわたって欠損しているため、欠損のパターンがあると考えられます。あるグループでは、Bsmt Qual、Bsmt Cond、Bsmt Exposure、BsmtFin Type 1、およびBsmt Fin Type 2 が同じ観測値からすべて欠損しているのが分かります。別のグループでは、Garage Type、Garage Yr Bit、Garage Finish、Garage Qual、Garage Cond が同じ観測値からすべて欠損しています。

これらの変数はすべて地下室と車庫に関連していますが、欠損していない、車庫または地下室に関連する他の変数もあります。この欠損を説明するとすれば、データが収集された際に別々の不動産会社で車庫と地下室に関する別々の質問が問われたものの、その中にデータセットにあるものほど詳しく記録されていないデータがある場合が挙げられます。このような状況は自分で収集していないデータを扱う場合にはよく発生するため、データセット内の欠損を詳しく知る必要がある場合は、データの収集方法を調べることをお勧めします。

データクリーニングのベストプラクティス – 効率的な前処理のコツ

前述のように、母集団の定義はデータクリーニングのベストプラクティスの中でも特に優先すべき事項です。クリーニングを始める前に、何を達成したいのか、どのようにデータを一般化したいのかを知っておくことが重要です。

すべての方法が再現可能であることを確認する必要があります。なぜなら、再現性はクリーンなデータにも関連しているからです。再現できない状況だと、後続の作業に大きな影響を与えかねません。このため、Jupyter ノートブックを整理して順序を維持し、すべてのステップで（特にクリーニングでは）Markdown 機能を活用して意思決定を文書化することをお勧めします。

データをクリーニングする際には段階的に作業を進め、元の CSV ファイルやデータベースではなく DataFrame を修正するようにし、再現可能で十分に文書化されたコードですべてを実施することをお勧めします。

まとめ – データサイエンスにおけるデータクリーニングの重要性

データクリーニングは大きなテーマであり、多くの課題に直面する可能性があります。データセットが大きくなるほど、クリーニングプロセスも困難になります。母集団を念頭に置き、欠損値の除去と代入のどちらを優先すべきかを考慮してそれらのバランスを保ち、データが元々欠損している理由を理解しながら、結論をより広範に一般化する必要があります。

自分自身をデータの声として考えましょう。データがたどってきた過程や、あらゆる段階でデータの整合性をどのように維持してきたかを理解しているのはあなた自身です。その過程を文書化し、他の人に共有するのに最適な人物はあなたです。

PyCharm Professional を無料で試す

オリジナル（英語）ブログ投稿記事の作者：

Helen Scott

↧

Real Python: The Real Python Podcast – Episode #244: A Decade of Automating the Boring Stuff With Python

March 21, 2025, 5:00 am

≫ Next: Daniel Roy Greenfeld: Using pyinstrument to profile FastHTML apps

≪ Previous: PyCharm: データクリーニングとは？データサイエンスで重要な手順とベストプラクティス

What goes into updating one of the most popular books about working with Python? After a decade of changes in the Python landscape, what projects, libraries, and skills are relevant to an office worker? This week on the show, we speak with previous guest Al Sweigart about the third edition of "Automate the Boring Stuff With Python."

[ Improve Your Python With 🐍 Python Tricks 💌 – Get a short & sweet Python Trick delivered to your inbox every couple of days. >> Click here to learn more and see examples ]

↧

Daniel Roy Greenfeld: Using pyinstrument to profile FastHTML apps

March 21, 2025, 5:02 am

≫ Next: Talk Python to Me: #497: Outlier Detection with Python

≪ Previous: Real Python: The Real Python Podcast – Episode #244: A Decade of Automating the Boring Stuff With Python

FastHTML is built on Starlette, so we use Starlette's middleware tooling and then pass in the result. Just make sure you install pyinstrument.

WARNING: NOT FOR PRODUCTION ENVIRONMENTS Including a profiler like this in a production environment is dangerous. As it exposes infrastructure it is highly risky to include in any location where end users can access it.

"""WARNING: NOT FOR PRODUCTION ENVIRONMENTS"""fromfasthtml.commonimport*fromstarlette.middleware.baseimportBaseHTTPMiddlewarefromstarlette.middlewareimportMiddlewaretry:frompyinstrumentimportProfilerexceptImportError:raiseImportError('Please install pyinstrument')classProfileMiddleware(BaseHTTPMiddleware):asyncdefdispatch(self,request,call_next):profiling=request.query_params.get("profile",False)ifprofiling:profiler=Profiler()profiler.start()response=awaitcall_next(request)profiler.stop()returnHTMLResponse(profiler.output_html())returnawaitcall_next(request)app,rt=fast_app(middleware=(Middleware(ProfileMiddleware)))@rt("/")defget():returnTitled("FastHTML",P("Hello, world!"))serve()

To invoke, make any request to your application with the GET parameter profile=1 and it will print the HTML result from pyinstrument.

↧

Talk Python to Me: #497: Outlier Detection with Python

March 21, 2025, 1:00 am

≫ Next: Techiediaries - Django: 10xDev Newsletter #1: Vibe Coding, Clone UIs with AI; Python for Mobile Dev; LynxJS — Tiktok New Framework; New Angular 19, React 19, Laravel 12 Features; AI Fakers in Recruitment; Local-First Apps…

≪ Previous: Daniel Roy Greenfeld: Using pyinstrument to profile FastHTML apps

Have you ever wondered why certain data points stand out so dramatically? They might hold the key to everything from fraud detection to groundbreaking discoveries. This week on Talk Python to Me, we dive into the world of outlier detection with Python with Brett Kennedy. You’ll learn how outliers can signal errors, highlight novel insights, or even reveal hidden patterns lurking in the data you thought you understood. We’ll explore fresh research developments, practical use cases, and how outlier detection compares to other core data science tasks like prediction and clustering. If you're ready to spot those game-changing anomalies in your own projects, stay tuned. Episode sponsors <a href='https://talkpython.fm/connect-cloud'>Posit</a> <a href='https://talkpython.fm/devopsbook'>Python in Production</a> <a href='https://talkpython.fm/training'>Talk Python Courses</a> <h2 class="links-heading">Links from the show</h2> <div>Data-morph: <a href="https://github.com/stefmolin/data-morph?featured_on=talkpython" target="_blank">github.com</a> PyOD: <a href="https://github.com/yzhao062/pyod?featured_on=talkpython" target="_blank">github.com</a> Prophet: <a href="https://github.com/paullo0106/prophet_anomaly_detection?featured_on=talkpython" target="_blank">github.com</a> Episode transcripts: <a href="https://talkpython.fm/episodes/transcript/497/outlier-detection-with-python" target="_blank">talkpython.fm</a> --- Stay in touch with us --- Subscribe to Talk Python on YouTube: <a href="https://talkpython.fm/youtube" target="_blank">youtube.com</a> Talk Python on Bluesky: <a href="https://bsky.app/profile/talkpython.fm" target="_blank">@talkpython.fm at bsky.app</a> Talk Python on Mastodon: <a href="https://fosstodon.org/web/@talkpython" target="_blank">talkpython</a> Michael on Bluesky: <a href="https://bsky.app/profile/mkennedy.codes?featured_on=talkpython" target="_blank">@mkennedy.codes at bsky.app</a> Michael on Mastodon: <a href="https://fosstodon.org/web/@mkennedy" target="_blank">mkennedy</a> </div>

↧