As data architectures become more distributed, we’ve found ourselves needing a smarter way to move data around. Everything is simple when you’re pulling data to one spot, but when you have multiple data marts or customers with complex org structures, you can really tie yourself in data-engineering knots.
The same problems arise with customers that want to send us data to work on. After weeks of coordination we might be able to get started, and by the end we’ve created yet another copy of their data to manage.
The solution we stumbled into initially felt like a hack, but in hindsight it maps directly to the structure of the organizations we were working with, dusting off that neuron that remembers Conway’s Law.
The Pattern
Like a number of shops, we standardize on using Airflow for pipeline orchestration, but the concept would apply to any orchestration tool you can distribute.
The basic idea:
- Deploy lightweight Airflow workers within each customer or subsidiary’s network perimeter. Establish secure outbound connections from these workers to your central Airflow infrastructure (using Tailscale, SSH tunnels, VPNs, or whatever the customer prefers).
- The workers process data within the remote environment (cleaning, aggregating, applying structure and governance) and, only then, distribute it across your network or out to external destinations.
The Rationale
Security friction is the biggest barrier. Customers are understandably reluctant to expose their internal databases or APIs to external networks. Security reviews can take months, and the answer is often “no” regardless of how robust your security posture is.
Network complexity compounds quickly. Managing VPN configurations, firewall rules, IP whitelists, and NAT traversal across dozens or hundreds of customer environments becomes an operational nightmare. Each customer has unique network architecture, policies, and constraints.
Data residency and compliance requirements are increasingly stringent. Some customers cannot allow their data to traverse certain networks or geographies, even temporarily.
Customer on-boarding velocity matters. When each new customer integration requires weeks of network engineering and security review, your business can’t scale efficiently.
The distributed worker pattern addresses these challenges by fundamentally changing the security model: instead of asking customers to expose their infrastructure to you, you deploy a controlled agent within their network that allows you to do the work there, and only expose what is needed.
Pros and Cons
Acknowledging that there’s never a silver bullet, here are some pros/cons to consider when weighing this approach against others.
Approach 1: Customer-Pushed Data
How it works: Customers run jobs on their infrastructure to extract data and push it to your API endpoints, SFTP servers, or cloud storage buckets.
Pros:
- Minimal infrastructure requirements on your side
- Customers maintain full control over timing and implementation
- No need to deploy anything on customer networks
Cons:
- Loss of orchestration control. You’re dependent on customer reliability
- Difficult to standardize data quality, formats, and delivery schedules
- Troubleshooting becomes a support nightmare when deliveries fail
- No visibility into extraction process or intermediate failures
- Each customer implements differently, creating integration chaos
When to use it: For low-value integrations where data quality and timeliness are flexible, or when customers have strong technical teams and prefer self-service.
Approach 2: Inbound VPN/Firewall Access
How it works: Customers configure VPNs or firewall rules to allow your central infrastructure to connect directly to their internal resources.
Pros:
- Centralized execution. All your logic runs in one place
- No need to manage distributed workers
- Simpler architecture from your operational perspective
Cons:
- Security approval is extremely difficult; most enterprises flatly refuse inbound access
- Requires coordination with customer network teams for every change
- Creates ongoing security audit burden for customers
- IP whitelisting breaks when your infrastructure changes
- VPN tunnels are brittle and require maintenance on both ends
- Customers must expose internal resources to external networks
When to use it: For small numbers of trusted partners or within a single corporate umbrella where network teams are aligned.
Approach 3: Distributed Airflow Workers (This Pattern)
How it works: Deploy lightweight Airflow workers on customer networks that connect outbound to your central Airflow infrastructure.
Pros:
- Security approval is straightforward: outbound-only connections are routinely permitted
- Unified orchestration: single pane of glass for all customer pipelines
- Standardized implementation: same DAG code across all customers with queue-based routing
- Full visibility: logs, metrics, and monitoring for all executions
- Local data access: workers run where the data lives, no exposure needed
- Incremental rollout: deploy and validate customer-by-customer
- Version control: update DAGs centrally, execution happens everywhere
- Network simplicity: mesh VPNs like Tailscale eliminate configuration complexity
Cons:
- Worker lifecycle management: you must maintain software on customer infrastructure
- Distributed troubleshooting: failures may require investigating customer environments
- Resource coordination: need to plan worker sizing with each customer
- Connection resilience: must handle network interruptions gracefully
- Initial setup overhead: requires coordination to deploy workers initially
- Security responsibility: you’re running code on customer infrastructure
- Worker-to-worker data movement is problematic: this architecture excels at hub-and-spoke (worker → central), but moving data directly between workers requires additional network complexity
When to use it: When you need reliable, standardized data integration across many customers or a customer with many subsidiaries, with strong visibility and control, and security/compliance is a major concern.
Implementation Details
Worker Deployment
The Airflow worker deployment is typically containerized or defined in Infrastructure as Code for consistency and ease of updates:
# docker-compose.yml for customer-side worker
version: '3.8'
services:
airflow-worker:
image: your-registry/airflow-worker:latest
environment:
- AIRFLOW__CELERY__BROKER_URL=redis://your-central-redis
- AIRFLOW__CORE__EXECUTOR=CeleryExecutor
- AIRFLOW__CORE__SQL_ALCHEMY_CONN=postgresql://your-central-db
- AIRFLOW__CELERY__DEFAULT_QUEUE=customer-acme-corp
volumes:
- ./dags:/opt/airflow/dags
- ./logs:/opt/airflow/logs
restart: unless-stopped
The worker connects to your central infrastructure for orchestration but executes tasks locally, with access to customer resources.
Queue-based Routing
DAGs specify target queues to ensure they execute on the appropriate customer worker:
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
default_args = {
'owner': 'data-engineering',
'queue': 'customer-acme-corp', # Routes to ACME Corp's worker
}
with DAG(
'acme_daily_extract',
default_args=default_args,
schedule_interval='0 2 * * *',
start_date=datetime(2024, 1, 1),
catchup=False,
) as dag:
extract_task = PythonOperator(
task_id='extract_from_local_db',
python_callable=extract_customer_data,
queue='customer-acme-corp', # Explicit queue assignment
)
Secure Connectivity Options
Tailscale has emerged as a particularly elegant solution for this pattern:
- Zero-configuration mesh VPN
- Built-in NAT traversal
- ACLs for fine-grained access control
- Easy to deploy alongside workers
SSH Tunnels provide a lightweight alternative:
# Establish reverse tunnel from customer worker to central infrastructure
ssh -R 6379:localhost:6379 -R 5432:localhost:5432 \
-N -f central-airflow-server
Traditional VPNs (WireGuard, OpenVPN, IPSec) work well for organizations with existing VPN infrastructure.
In the end you probably get to choose what the customer says you choose.
Data Movement Patterns
Once workers are in place, data movement becomes straightforward:
def extract_and_load():
# Extract from local customer database
local_conn = get_local_connection('customer_postgres')
data = local_conn.execute("SELECT * FROM sales_data WHERE date = CURRENT_DATE")
# Transform as needed
df = transform_data(data)
# Load to central data warehouse
warehouse_conn = get_warehouse_connection('central_snowflake')
df.to_sql('customer_sales', warehouse_conn, if_exists='append')
The worker has native access to local resources while maintaining secure connectivity to central systems.
One Caveat: The Worker-to-Worker Data Movement Challenge
This architecture has a significant limitation that’s worth understanding upfront: it’s optimized for hub-and-spoke topology, not peer-to-peer.
The problem: Each worker connects outbound to your central infrastructure, but workers don’t inherently have network connectivity to each other. If you need to move data from Customer A’s worker directly to Customer B’s worker, you face a networking challenge.
Solutions:
-
Route through central infrastructure (most common): Worker A pushes to central storage, Worker B pulls from there. Simple and leverages existing connectivity, but doubles network transfer and adds latency.
-
Establish mesh VPN between workers (ideal): Use Tailscale or similar to create direct worker-to-worker connections. Direct transfer with lower latency, but network complexity multiplies (N² connections for N workers) and security approvals get harder.
-
Shared cloud storage as intermediary: Worker A writes to S3/GCS/Azure Blob, Worker B reads from same bucket. Scalable and auditable, but still involves double transfer and cloud egress costs.
-
Hybrid approach with regional hubs: Deploy regional Airflow infrastructure where workers in the same region connect to a regional hub, with regional hubs coordinating via central orchestrator. Reduces cross-region transfers and improves performance, but significantly more complex architecture.
Best Practices
Standardize worker deployments using Infrastructure as Code. You’ll thank yourself when you need to troubleshoot issues across dozens of customer environments.
Monitor everything. Worker health, task execution, data quality. If a worker goes down at 3am, you want to know about it before the customer does.
Version your DAGs carefully. A bug in your code doesn’t just affect one environment anymore; it affects every customer running that DAG. Test thoroughly, deploy incrementally.
Document your queue naming conventions early. “customer-acme” vs “acme-prod” vs “acme_production” might seem trivial until you have 50 customers and no one remembers which is which.
Plan for disaster recovery. Workers will fail, connections will drop, tasks will need to be back-filled. Have runbooks ready.
Test in staging environments that mirror customer configurations before deploying to production. Each customer’s network has its quirks.
Automate worker provisioning. The faster you can spin up a new worker, the faster you can onboard new customers.
Conclusion
Deploying distributed Airflow workers across customer networks changes how you think about data integration. Instead of fighting security teams to get inbound access, you work with the grain of their policies. Instead of managing a dozen different VPN configurations, you deploy workers that connect outbound. Instead of wrestling with each customer’s unique network setup, you standardize on a pattern that works everywhere.
The trade-off is clear: you gain security approval speed, operational consistency, and full visibility in exchange for managing distributed infrastructure. For many use cases, especially those involving multiple customers or strict compliance requirements, that’s a deal worth making.
Start small. Deploy to a couple of friendly customers, iron out the operational kinks, and build confidence in the pattern before scaling it out.