Case Study: Chicago 311 Graffiti Data#

Note

Source: Adapted from scalaworkshop by George K. Thiruvathukal and Konstantin Läufer. Original Scala implementation at introds-scala-examples/311-case-study-scala. Data published by the City of Chicago under the Chicago Data Portal open data licence.

Problem. The City of Chicago publishes every graffiti-removal request submitted through its 311 system as open data. Can we download that dataset, explore its structure, and find patterns — which ZIP codes generate the most requests? How does volume change by season?

This case study builds a complete data pipeline in Python using pandas for data loading and analysis, and matplotlib for visualisation:

Fetch → Load → Aggregate → Filter → Visualize

The dataset URL is:

https://data.cityofchicago.org/api/views/hec5-y4x5/rows.csv?accessType=DOWNLOAD

Step 1 — Fetch#

urllib.request.urlretrieve streams the file directly to disk without loading the entire response into memory — important for a dataset that can exceed 100 MB:

def fetch_graffiti(output: str = "311_graffiti.csv") -> None:
    """Download the 311 Graffiti Removal dataset from the Chicago Data Portal."""
    print(f"Downloading to {output} ...")
    urllib.request.urlretrieve(DATASET_URL, output)
    print("Done.")
fetch_graffiti("311_graffiti.csv")

Output:

Downloading to 311_graffiti.csv ...
Done.

Step 2 — Load and Inspect#

pd.read_csv reads the file into a DataFrame in one call. The optional nrows parameter limits how many rows are read, which is useful for previewing a large file without loading all of it into memory. We immediately parse the Creation Date column as proper datetime objects so that filtering and grouping by date work naturally later:

def load_graffiti(filename: str, limit: int | None = None) -> pd.DataFrame:
    """Load the dataset into a DataFrame; parse Creation Date as datetime."""
    df = pd.read_csv(filename, nrows=limit)
    df["Creation Date"] = pd.to_datetime(
        df["Creation Date"], format="%m/%d/%Y", errors="coerce"
    )
    return df
df = load_graffiti("311_graffiti.csv", limit=3)
print(df[["Creation Date", "Status", "ZIP Code"]].to_string(index=False))

Output (representative):

Creation Date     Status ZIP Code
   2024-01-02  Completed    60614
   2024-01-02  Completed    60647
   2024-01-03       Open    60618

The dataset includes columns for creation and completion dates, street address, ZIP code, latitude/longitude, ward, and police district.

Step 3 — Aggregate#

Series.value_counts() counts how many rows contain each distinct value in a column and returns a Series sorted from most to least frequent. Chaining .head(top) limits the result to the top N entries:

def aggregate_graffiti(filename: str, group_by: str = "ZIP Code",
                        top: int = 10) -> pd.Series:
    """Return the top-N value counts for the given column."""
    df = load_graffiti(filename)
    return df[group_by].value_counts().head(top)
for zip_code, count in aggregate_graffiti("311_graffiti.csv", top=5).items():
    print(f"{zip_code:10s}  {count:6,}")

Output (representative):

60614       4,821
60647       4,203
60618       3,977
60622       3,840
60625       3,512

Step 4 — Filter#

Real datasets need filtering before analysis. Because load_graffiti already parsed Creation Date as datetime, we can compare it directly against date strings using pandas boolean indexing. The & operator combines conditions row-wise; wrapping each condition in parentheses is required because of Python’s operator precedence:

def filter_graffiti(filename: str, status: str = "Completed",
                     start: str = "2015-01-01",
                     end: str = "2015-12-31") -> pd.DataFrame:
    """Return rows whose Status matches and Creation Date falls in [start, end]."""
    df = load_graffiti(filename)
    mask = (
        (df["Status"] == status) &
        (df["Creation Date"] >= start) &
        (df["Creation Date"] <= end)
    )
    return df[mask]

The result is a new DataFrame containing only the matching rows — no manual date parsing or try/except loops needed.

matches = filter_graffiti("311_graffiti.csv",
                           status="Completed",
                           start="2015-01-01",
                           end="2015-01-31")
print(f"{len(matches):,} completed requests in January 2015")

Output:

9,480 completed requests in January 2015

Step 5 — Visualize#

Grouping by year-month and plotting a bar chart reveals the seasonal pattern in graffiti removal activity. Requests dip in winter (cold weather means fewer outdoor surfaces are tagged) and peak in late spring and summer:

def visualize_graffiti(filename: str,
                        output: str = "graffiti_trend.png",
                        year_start: int | None = None,
                        year_end: int | None = None) -> None:
    """Save a bar chart of monthly graffiti request counts."""
    df = load_graffiti(filename).dropna(subset=["Creation Date"])
    if year_start is not None:
        df = df[df["Creation Date"].dt.year >= year_start]
    if year_end is not None:
        df = df[df["Creation Date"].dt.year <= year_end]

    monthly = df.groupby(df["Creation Date"].dt.to_period("M")).size()
    months = [str(m) for m in monthly.index]
    counts = monthly.to_list()

    fig, ax = plt.subplots(figsize=(12, 5))
    ax.bar(months, counts, color='steelblue')
    ax.set_title("311 Graffiti Removal Requests Per Month")
    ax.set_xlabel("Month")
    ax.set_ylabel("Requests")
    plt.xticks(rotation=45, ha='right')
    plt.tight_layout()
    plt.savefig(output, dpi=100)
    plt.close()
    print(f"Saved {output}")

Note

The Chicago 311 graffiti dataset on the Data Portal covers 2011–2018. The portal stopped updating this particular view after that period; more recent 311 data is published under a different endpoint. When working with open datasets, always check the date range before drawing conclusions.

visualize_graffiti("311_graffiti.csv", "graffiti_trend.png",
                   year_start=2015, year_end=2018)

Output:

Saved graffiti_trend.png

Install matplotlib first if needed:

pip install matplotlib
Bar chart of 311 graffiti removal requests per month, 2015–2018

Monthly graffiti removal requests from the Chicago 311 open dataset (2015–2018).#

To see the full span of the dataset at a glance, plot annual totals across all years:

def visualize_by_year(filename: str,
                       output: str = "graffiti_by_year.png",
                       min_year: int = 2010) -> None:
    """Save a bar chart of total graffiti removal requests per year."""
    df = load_graffiti(filename).dropna(subset=["Creation Date"])
    df = df[df["Creation Date"].dt.year >= min_year]

    yearly = df.groupby(df["Creation Date"].dt.year).size()
    years = list(yearly.index)
    counts = list(yearly.values)

    fig, ax = plt.subplots(figsize=(10, 5))
    ax.bar(years, counts, color='steelblue', width=0.6)
    ax.set_title("311 Graffiti Removal Requests Per Year")
    ax.set_xlabel("Year")
    ax.set_ylabel("Requests")
    ax.set_xticks(years)
    ax.yaxis.set_major_formatter(
        plt.FuncFormatter(lambda x, _: f"{int(x):,}")
    )
    plt.tight_layout()
    plt.savefig(output, dpi=100)
    plt.close()
    print(f"Saved {output}")
visualize_by_year("311_graffiti.csv", "graffiti_by_year.png")

Output:

Saved graffiti_by_year.png
Bar chart of total 311 graffiti removal requests per year, 2010–2018

Total graffiti removal requests per year (2010–2018). The 2010 bar is partial (data begins mid-year); 2011–2018 show full annual volumes.#

What the Data Reveals#

311 reports are not merely complaints — they are a form of civic participation. Analysing this data over time and across ZIP codes surfaces questions about equity: are removal requests serviced equally across neighbourhoods? Are response times consistent? Which wards have the highest volume, and why?

This dataset is a starting point. The same pipeline — fetch, load, aggregate, filter, visualise — applies to any of the hundreds of other datasets published on the Chicago Data Portal.

Challenges#

  1. Call aggregate_graffiti with group_by="Ward" instead of "ZIP Code". Which ward has the most graffiti removal requests?

  2. Write a function average_completion_days(filename) that loads the DataFrame, parses both "Creation Date" and "Completion Date" as datetime, and returns the mean of df["Completion Date"] - df["Creation Date"] (in days) for completed requests. Use pd.to_datetime(..., errors="coerce") to handle missing values — they become NaT and are excluded automatically by .mean().

  3. The dataset includes "Latitude" and "Longitude" columns. Write a function that filters the DataFrame to rows within a bounding box (min/max lat/lon) using boolean indexing, and returns the count. Use it to count requests in a neighbourhood of your choice.

  4. Extend visualize_graffiti to overlay a 12-month rolling average line on top of the monthly bars. After computing monthly, use pd.Series(counts).rolling(12, min_periods=1).mean() to generate the smoothed values.

  5. Download the 311 Pothole Reports dataset (view ID 7as2-ds3y) and compare monthly volumes for graffiti and potholes side-by-side on a single chart using two sets of bars.