35  REST APIs for Data Science

35.1 Introduction

REST (Representational State Transfer) APIs have become the backbone of modern web applications and data science workflows. As a data scientist, understanding how to consume and create REST APIs is essential for accessing external data sources, deploying machine learning models, and building scalable data applications.

In this chapter we explore why REST exists and its fundamental principles, how REST builds on HTTP and JSON, practical examples using real APIs, and the building and deploying of a REST API for model scoring.

The Problem REST Solves

Before REST, different systems had various ways of communicating over the internet, often complex and proprietary. REST emerged to solve several key problems (R. T. Fielding 2000):

  1. Standardization: Need for a common, predictable way for systems to communicate
  2. Scalability: Ability to handle millions of requests efficiently
  3. Simplicity: Easy to understand and implement
  4. Platform Independence: Works across different programming languages and systems

REST is built on six key principles (R. T. Fielding 2000; Richardson and Ruby 2007):

  1. Stateless: Each request contains all information needed to process it
  2. Client-Server Architecture: Clear separation between data consumer and provider
  3. Cacheable: Responses can be cached to improve performance
  4. Uniform Interface: Consistent way to interact with resources
  5. Layered System: Architecture can have multiple layers (proxies, gateways, etc.)
  6. Code on Demand (optional): Server can send executable code to client

35.2 REST is Built on HTTP and JSON

REST leverages two foundational web technologies: HTTP (HyperText Transfer Protocol) for communication and JSON (JavaScript Object Notation) for data exchange (Richardson and Ruby 2007).

HTTP: The Communication Protocol

HTTP is the protocol that powers the web. It defines how messages are formatted and transmitted between clients and servers.

Key HTTP Components:

  1. URL (Uniform Resource Locator): Identifies the resource
   https://api.openweathermap.org/data/2.5/weather?q=London
  1. HTTP Methods: Define the action to perform (R. Fielding and Reschke 2014b)
    • GET: Retrieve data
    • POST: Create new data
    • PUT: Update existing data
    • DELETE: Remove data
  2. Headers: Provide metadata about the request/response
   Content-Type: application/json
   Authorization: Bearer your-api-key
  1. Status Codes: Indicate the result of the request (R. Fielding and Reschke 2014b)
    • 200: Success
    • 404: Not Found
    • 500: Server Error

JSON: The Data Format

JSON is a lightweight, human-readable data format that has become the standard for REST APIs. We encountered JSON as a data format in Section 10.1.

{
  "name": "John Doe",
  "age": 30,
  "city": "New York",
  "skills": ["Python", "Data Science", "Machine Learning"]
}

Among the JSON advantages are its support by all programming languages, it is easy to read and write, compact and efficient.

35.3 Understanding HTTP Methods: GET and POST

HTTP GET Method

GET is used to retrieve data from a server. It is safe and idempotent (multiple identical requests have the same effect) (R. Fielding and Reschke 2014b).

Characteristics

  • Data are sent in URL parameters
  • Limited data size (URL length limits)
  • Cacheable
  • Should not modify server state

Example GET Request

In this example we use the popular OpenWeatherMap API to retrieve real-world, real-time weather data via a REST API (OpenWeather Ltd 2025). Before you can retrieve weather data through this service, sign up at OpenWeatherMap and retrieve a free API key. You will need to submit the key as part of the API request—that is how OpenWeatherMap keeps track of your usage. The first 1,000 calls per day are free.

The API key is a string, and it is not recommended to include the string directly into your program. When code is shared you do not want any API keys to get away from you. The recommended way is to store the API key in an environment variable and to retrieve it in real time in the program. This can be problematic sometimes, depending on your environment. An environment variable set in one shell might not be visible from the Python process, depending on how it was started.

To work around it, you can store the environment variables in a file named .env in your local directory, and use the load_dotenv method in the dotenv library to retrieve the variable. The following code does exactly that.

If you keep the .env file in the same directory as a repository that is managed with git, add the .env file to your .gitignore file to prevent accidental inclusion in a remote repository.

import requests
import os
from dotenv import load_dotenv

load_dotenv() 
api_key = os.environ.get('OPENWEATHER_API_KEY')

# GET request to retrieve weather data
url = "https://api.openweathermap.org/data/2.5/weather"
params = {
    'q': 'London',
    'appid': api_key,
    'units': 'metric'
}
response = requests.get(url, params=params)
print(response.json())
{'coord': {'lon': -0.1257, 'lat': 51.5085}, 'weather': [{'id': 803, 'main': 'Clouds', 'description': 'broken clouds', 'icon': '04d'}], 'base': 'stations', 'main': {'temp': 23.7, 'feels_like': 23.25, 'temp_min': 22.95, 'temp_max': 24.51, 'pressure': 1013, 'humidity': 43, 'sea_level': 1013, 'grnd_level': 1009}, 'visibility': 10000, 'wind': {'speed': 8.75, 'deg': 260}, 'clouds': {'all': 80}, 'dt': 1750953550, 'sys': {'type': 2, 'id': 2075535, 'country': 'GB', 'sunrise': 1750909484, 'sunset': 1750969309}, 'timezone': 3600, 'id': 2643743, 'name': 'London', 'cod': 200}

HTTP POST Method

POST is used to send data to a server, typically to create new resources or submit data for processing.

Characteristics

  • Data are sent in request body
  • No size limitations
  • Not cacheable
  • Can modify server state

Example POST Request

We are sending in this example a simple request to httpbin.org, a simple HTTP request & response service. httpbin.org is a free, online service that provides a variety of HTTP endpoints for testing and debugging HTTP clients and libraries. It is essentially a “meta API” that allows users to send requests and inspect the responses.

import requests
import json

# POST request to submit data
url = "https://httpbin.org/post"
data = {
    "name": "John Doe",
    "email": "john@example.com",
    "message": "Hello from Python!"
}
response = requests.post(url, json=data)
print(response.json())
{'args': {}, 'data': '{"name": "John Doe", "email": "john@example.com", "message": "Hello from Python!"}', 'files': {}, 'form': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Content-Length': '82', 'Content-Type': 'application/json', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.32.4', 'X-Amzn-Trace-Id': 'Root=1-685d6f80-3f49a9623b9e836e229f5df1'}, 'json': {'email': 'john@example.com', 'message': 'Hello from Python!', 'name': 'John Doe'}, 'origin': '73.152.103.167', 'url': 'https://httpbin.org/post'}

GET vs POST Comparison

Aspect GET POST
Purpose Retrieve data Send/submit data
Data location URL parameters Request body
Data size Limited Unlimited
Cacheable Yes No
Safe Yes No
Idempotent Yes No

35.4 Basic Weather Data Retrieval

Here is a more complete example retrieving weather data for multiple cities using the GET method.

import pandas as pd
import requests
import os
from dotenv import load_dotenv
from datetime import datetime

class WeatherAPI:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.openweathermap.org/data/2.5"
    
    def get_current_weather(self, city):
        """Get current weather for a city"""
        url = f"{self.base_url}/weather"
        params = {
            'q': city,
            'appid': self.api_key,
            'units': 'metric'
        }
        
        response = requests.get(url, params=params)
        
        if response.status_code == 200:
            return response.json()
        else:
            print(f"Error: {response.status_code}")
            return None

# Usage example
load_dotenv() 
api_key = os.environ.get('OPENWEATHER_API_KEY')
weather = WeatherAPI(api_key)

# Get weather for multiple cities
cities = ['London', 'New York', 'Tokyo', 'Sydney']
weather_data = []

for city in cities:
    data = weather.get_current_weather(city)
    if data:
        weather_info = {
            'city': city,
            'temperature': data['main']['temp'],
            'humidity': data['main']['humidity'],
            'description': data['weather'][0]['description'],
            'timestamp': datetime.now()
        }
        weather_data.append(weather_info)

# Convert to DataFrame for analysis
df = pd.DataFrame(weather_data)
print(df)
       city  temperature  humidity      description                  timestamp
0    London        23.70        43    broken clouds 2025-06-26 12:04:18.359999
1  New York        21.74        64  overcast clouds 2025-06-26 12:04:18.485026
2     Tokyo        25.41        79    broken clouds 2025-06-26 12:04:18.620959
3    Sydney         8.64        66  overcast clouds 2025-06-26 12:04:18.748444

Advanced Weather Data Analysis

import matplotlib.pyplot as plt

def get_forecast_data(api_key, city, days=5):
    """Get weather forecast data"""
    url = f"https://api.openweathermap.org/data/2.5/forecast"
    params = {
        'q': city,
        'appid': api_key,
        'units': 'metric'
    }
    
    response = requests.get(url, params=params)
    
    if response.status_code == 200:
        data = response.json()
        forecasts = []
        
        for item in data['list'][:days*8]:  # 8 forecasts per day (3-hour intervals)
            forecasts.append({
                'datetime': datetime.fromtimestamp(item['dt']),
                'temperature': item['main']['temp'],
                'humidity': item['main']['humidity'],
                'description': item['weather'][0]['description']
            })
        
        return pd.DataFrame(forecasts)
    else:
        return None

# Get and visualize forecast data
forecast_df = get_forecast_data(api_key, 'London')

if forecast_df is not None:
    plt.figure(figsize=(12, 6));
    plt.plot(forecast_df['datetime'], forecast_df['temperature'], marker='o');
    plt.title('London Temperature Forecast');
    plt.xlabel('Date/Time');
    plt.ylabel('Temperature (°C)');
    plt.xticks(rotation=45);
    plt.grid(True);
    plt.tight_layout();
    plt.show();

Error Handling and Best Practices

import time
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

class RobustWeatherAPI:
    def __init__(self, api_key):
        self.api_key = api_key
        self.base_url = "https://api.openweathermap.org/data/2.5"
        self.session = self._create_session()
    
    def _create_session(self):
        """Create a session with retry strategy"""
        session = requests.Session()
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504],
        )
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        return session
    
    def get_weather_with_error_handling(self, city):
        """Get weather with comprehensive error handling"""
        url = f"{self.base_url}/weather"
        params = {
            'q': city,
            'appid': self.api_key,
            'units': 'metric'
        }
        
        try:
            response = self.session.get(url, params=params, timeout=10)
            response.raise_for_status()  # Raises HTTPError for bad responses
            
            return {
                'success': True,
                'data': response.json()
            }
            
        except requests.exceptions.HTTPError as e:
            return {
                'success': False,
                'error': f'HTTP Error: {e.response.status_code}'
            }
        except requests.exceptions.ConnectionError:
            return {
                'success': False,
                'error': 'Connection Error: Unable to connect to API'
            }
        except requests.exceptions.Timeout:
            return {
                'success': False,
                'error': 'Timeout Error: Request timed out'
            }
        except requests.exceptions.RequestException as e:
            return {
                'success': False,
                'error': f'Request Error: {str(e)}'
            }

35.5 Building Your Own REST API for Model Scoring

Now let’s create a REST API that serves a random forest model for predictions. This is a common pattern in machine learning deployment.

Step 1: Train and Save a Model

First, let’s create and train a random forest model:

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
import joblib

# Generate sample data (in practice, use your real dataset)
X, y = make_classification(
    n_samples=1000,
    n_features=10,
    n_informative=5,
    n_redundant=2,
    n_classes=2,
    random_state=42
)

# Create feature names
feature_names = [f'feature_{i}' for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=feature_names)
df['target'] = y

# Split the data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Train the model
rf_model = RandomForestClassifier(
    n_estimators=100,
    random_state=42,
    max_depth=10
)
rf_model.fit(X_train, y_train)

# Save the model and feature names
joblib.dump(rf_model, 'random_forest_model.pkl')
joblib.dump(feature_names, 'feature_names.pkl')

print(f"Model trained! Accuracy: {rf_model.score(X_test, y_test):.3f}")
print(f"Feature names saved: {feature_names}")
Model trained! Accuracy: 0.955
Feature names saved: ['feature_0', 'feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_6', 'feature_7', 'feature_8', 'feature_9']

Step 2: Create the Flask API

# app.py
from flask import Flask, request, jsonify
import joblib
import numpy as np
import pandas as pd
from datetime import datetime
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = Flask(__name__)

# Load the trained model and feature names
try:
    model = joblib.load('random_forest_model.pkl')
    feature_names = joblib.load('feature_names.pkl')
    logger.info("Model and features loaded successfully")
except Exception as e:
    logger.error(f"Error loading model: {str(e)}")
    model = None
    feature_names = None

@app.route('/')
def home():
    """Health check endpoint"""
    return jsonify({
        'message': 'Random Forest Model API',
        'status': 'healthy',
        'timestamp': datetime.now().isoformat(),
        'model_loaded': model is not None
    })

@app.route('/predict', methods=['POST'])
def predict():
    """Make predictions using the Random Forest model"""
    
    if model is None:
        return jsonify({
            'error': 'Model not loaded',
            'status': 'error'
        }), 500
    
    try:
        # Get JSON data from request
        data = request.get_json()
        
        if not data:
            return jsonify({
                'error': 'No data provided',
                'status': 'error'
            }), 400
        
        # Handle single prediction or batch predictions
        if isinstance(data, dict) and 'features' in data:
            # Single prediction
            features = data['features']
            predictions = make_single_prediction(features)
        elif isinstance(data, list):
            # Batch predictions
            predictions = make_batch_predictions(data)
        else:
            return jsonify({
                'error': 'Invalid data format. Expected {"features": [...]} or [{"features": [...]}, ...]',
                'status': 'error'
            }), 400
        
        return jsonify({
            'predictions': predictions,
            'status': 'success',
            'timestamp': datetime.now().isoformat()
        })
        
    except Exception as e:
        logger.error(f"Prediction error: {str(e)}")
        return jsonify({
            'error': str(e),
            'status': 'error'
        }), 500

def make_single_prediction(features):
    """Make a single prediction"""
    # Validate features
    if len(features) != len(feature_names):
        raise ValueError(f"Expected {len(feature_names)} features, got {len(features)}")
    
    # Convert to numpy array and reshape
    X = np.array(features).reshape(1, -1)
    
    # Make prediction
    prediction = model.predict(X)[0]
    probability = model.predict_proba(X)[0].tolist()
    
    return {
        'prediction': int(prediction),
        'probability': {
            'class_0': probability[0],
            'class_1': probability[1]
        },
        'confidence': max(probability)
    }

def make_batch_predictions(data_list):
    """Make batch predictions"""
    predictions = []
    
    for item in data_list:
        if 'features' not in item:
            raise ValueError("Each item must have 'features' key")
        
        features = item['features']
        pred_result = make_single_prediction(features)
        predictions.append(pred_result)
    
    return predictions

@app.route('/model-info', methods=['GET'])
def model_info():
    """Get information about the model"""
    if model is None:
        return jsonify({
            'error': 'Model not loaded',
            'status': 'error'
        }), 500
    
    return jsonify({
        'model_type': 'RandomForestClassifier',
        'n_estimators': model.n_estimators,
        'max_depth': model.max_depth,
        'n_features': len(feature_names),
        'feature_names': feature_names,
        'classes': model.classes_.tolist(),
        'status': 'success'
    })

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

Step 3: Test the API

Create a test script to validate your API (Reitz and Python Software Foundation 2024):

# test_api.py
import requests
import json
import numpy as np

# API base URL
BASE_URL = "http://localhost:5000"

def test_health_check():
    """Test the health check endpoint"""
    response = requests.get(f"{BASE_URL}/")
    print("Health Check:")
    print(json.dumps(response.json(), indent=2))
    print()

def test_model_info():
    """Test the model info endpoint"""
    response = requests.get(f"{BASE_URL}/model-info")
    print("Model Info:")
    print(json.dumps(response.json(), indent=2))
    print()

def test_single_prediction():
    """Test single prediction"""
    # Generate random features (in practice, use real data)
    features = np.random.randn(10).tolist()
    
    data = {
        "features": features
    }
    
    response = requests.post(
        f"{BASE_URL}/predict",
        json=data,
        headers={'Content-Type': 'application/json'}
    )
    
    print("Single Prediction:")
    print(json.dumps(response.json(), indent=2))
    print()

def test_batch_prediction():
    """Test batch prediction"""
    # Generate multiple random feature sets
    batch_data = []
    for i in range(3):
        features = np.random.randn(10).tolist()
        batch_data.append({"features": features})
    
    response = requests.post(
        f"{BASE_URL}/predict",
        json=batch_data,
        headers={'Content-Type': 'application/json'}
    )
    
    print("Batch Prediction:")
    print(json.dumps(response.json(), indent=2))
    print()

if __name__ == "__main__":
    print("Testing Random Forest API...\n")
    
    test_health_check()
    test_model_info()
    test_single_prediction()
    test_batch_prediction()

Step 4: Enhanced API with Input Validation

# enhanced_app.py
from flask import Flask, request, jsonify
from marshmallow import Schema, fields, ValidationError
import joblib
import numpy as np
import pandas as pd
from datetime import datetime
import logging

# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Input validation schemas
class PredictionSchema(Schema):
    features = fields.List(fields.Float(), required=True)

class BatchPredictionSchema(Schema):
    predictions = fields.List(fields.Nested(PredictionSchema), required=True)

app = Flask(__name__)
app.config['JSON_SORT_KEYS'] = False

# Load model
try:
    model = joblib.load('random_forest_model.pkl')
    feature_names = joblib.load('feature_names.pkl')
    logger.info("Model loaded successfully")
except Exception as e:
    logger.error(f"Error loading model: {str(e)}")
    model = None
    feature_names = None

@app.route('/predict', methods=['POST'])
def predict_enhanced():
    """Enhanced prediction endpoint with validation"""
    
    if model is None:
        return jsonify({
            'error': 'Model not available',
            'status': 'error'
        }), 503
    
    try:
        data = request.get_json(force=True)
        
        # Validate input data
        if 'features' in data:
            # Single prediction
            schema = PredictionSchema()
            validated_data = schema.load(data)
            predictions = [make_prediction_with_validation(validated_data['features'])]
        else:
            # Batch prediction
            batch_schema = BatchPredictionSchema()
            validated_data = batch_schema.load({'predictions': data})
            predictions = [
                make_prediction_with_validation(item['features']) 
                for item in validated_data['predictions']
            ]
        
        return jsonify({
            'predictions': predictions,
            'status': 'success',
            'timestamp': datetime.now().isoformat(),
            'count': len(predictions)
        })
        
    except ValidationError as e:
        return jsonify({
            'error': 'Validation failed',
            'details': e.messages,
            'status': 'error'
        }), 400
    except Exception as e:
        return jsonify({
            'error': str(e),
            'status': 'error'
        }), 500

def make_prediction_with_validation(features):
    """Make prediction with input validation"""
    if len(features) != len(feature_names):
        raise ValueError(f"Expected {len(feature_names)} features, received {len(features)}")
    
    # Check for invalid values
    if any(not isinstance(f, (int, float)) or np.isnan(f) or np.isinf(f) for f in features):
        raise ValueError("Features must be finite numeric values")
    
    X = np.array(features).reshape(1, -1)
    prediction = model.predict(X)[0]
    probabilities = model.predict_proba(X)[0]
    
    return {
        'prediction': int(prediction),
        'probabilities': {
            f'class_{i}': float(prob) 
            for i, prob in enumerate(probabilities)
        },
        'confidence': float(max(probabilities))
    }

if __name__ == '__main__':
    app.run(debug=True, host='0.0.0.0', port=5000)

35.6 Deployment Considerations

Local Development

For local development and testing (Pallets Projects 2025):

# Install dependencies
pip install flask scikit-learn joblib marshmallow numpy pandas

# Run the application
python app.py

# Test in another terminal
python test_api.py

Production Deployment Options

2. Docker Containerization

# Dockerfile
FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 5000

CMD ["gunicorn", "--bind", "0.0.0.0:5000", "--workers", "4", "app:app"]
# Build and run Docker container
docker build -t rf-api .
docker run -p 5000:5000 rf-api

3. Cloud Deployment

AWS Deployment (Elastic Beanstalk)

  1. Install AWS CLI and EB CLI
  2. Create application.py (AWS expects this name)
  3. Deploy with eb init and eb deploy

Google Cloud Platform (Cloud Run)

  1. Build Docker image
  2. Push to Google Container Registry
  3. Deploy to Cloud Run

Heroku Deployment

  1. Create Procfile: web: gunicorn app:app
  2. Push to Heroku Git repository

Production Best Practices

# production_config.py
import os
from flask import Flask
import logging
from logging.handlers import RotatingFileHandler

def create_production_app():
    app = Flask(__name__)
    
    # Configuration
    app.config['DEBUG'] = False
    app.config['TESTING'] = False
    app.config['SECRET_KEY'] = os.environ.get('SECRET_KEY', 'dev-key-change-in-prod')
    
    # Logging
    if not app.debug:
        file_handler = RotatingFileHandler(
            'logs/api.log', 
            maxBytes=10240000, 
            backupCount=10
        )
        file_handler.setFormatter(logging.Formatter(
            '%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]'
        ))
        file_handler.setLevel(logging.INFO)
        app.logger.addHandler(file_handler)
        app.logger.setLevel(logging.INFO)
        app.logger.info('API startup')
    
    return app

35.7 Conclusion

In this comprehensive tutorial, we’ve covered:

  1. Why REST exists: Understanding the problems REST solves and its core principles (R. T. Fielding 2000)
  2. HTTP and JSON foundations: How REST builds on these web standards (R. Fielding and Reschke 2014a, 2014b)
  3. HTTP methods: Practical differences between GET and POST requests
  4. Real-world API usage: Working with OpenWeatherMap API for data science applications (OpenWeather Ltd 2025)
  5. Building your own API: Creating a production-ready random forest scoring service (Pedregosa et al. 2025; Pallets Projects 2025)
  6. Deployment strategies: From local development to cloud deployment

Key Takeaways

  • REST APIs provide a standardized way for systems to communicate (Richardson and Ruby 2007)
  • HTTP methods have specific purposes: GET for retrieval, POST for data submission (R. Fielding and Reschke 2014b)
  • JSON is the preferred data format for modern APIs
  • Error handling and input validation are crucial for production APIs
  • Proper deployment strategies ensure scalability and reliability

Next Steps

  1. Practice: Build APIs for your own machine learning models
  2. Security: Learn about API authentication and rate limiting
  3. Monitoring: Implement logging and performance monitoring
  4. Documentation: Use tools like Swagger/OpenAPI for API documentation
  5. Testing: Write comprehensive unit and integration tests

Additional Resources