Fraud detection using Machine Learning
Nathan RanchinThe problem
Financial fraud is a growing concern for banks and financial institutions worldwide. As part of a Crédit Suisse hackathon, we were tasked with developing a fraud detection system that could identify salary payment fraud patterns from transaction data.
Our approach
Instead of treating each transaction individually, we grouped payments between the same sender and receiver to identify patterns that might not be apparent in isolated transactions. This approach allowed us to leverage collective behavior rather than individual payment characteristics.
Data preparation
We started by creating a new column called is_salary
that categorized transactions into three distinct classes:
0
: Definitely not a salary payment1
: Uncertain if it's a salary payment2
: Definitely a salary payment
# Transactions where we're sure they are salary payments
data["is_salary"] = data["payment_reason"].isin(["salarzahlung", "Salary", "Salario"]) + 1
# Transactions where we're sure they are not salary payments
data.loc[data["payment_reason"].isin(["Rent", "Mieten"]), "is_salary"] = 0
We applied our domain knowledge to further refine the categorization:
# Salary payments can only be from Organizations to Individuals
data.loc[data["bene_client_type"] == "ORG", "is_salary"] = 0
data.loc[data["org_client_type"] == "IND", "is_salary"] = 0
Feature engineering
For our model, we created group-level features based on transaction patterns between specific senders and receivers:
- Quality: Number of transactions between the same sender and receiver
- Amount Repetition: How often the same amount is transferred (max frequency / total transactions)
- Amount Variance: Variance in the transaction amounts (0 for single transactions)
X = []
y = []
for v in train.groupby(["org_account_id", "bene_account_id"]).groups.values():
quality = len(v)
rows = train.loc[v]
if quality == 1:
X.append([1, quality, rows["amount_in_USD"].value_counts().max() / quality, 0])
else:
X.append([1, quality, rows["amount_in_USD"].value_counts().max() / quality, rows["amount_in_USD"].var()])
y.append(int(rows["is_salary"].mean()))
Model training
We used XGBoost, a powerful gradient boosting framework, to build our classification model:
clf = XGBClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
f1_score(y_test, y_pred)
The model achieved strong F1 scores on our validation data, indicating good performance in classifying salary payments.
Making predictions
Finally, we applied our model to the uncertain transactions (those labeled as 1 in our initial classification) to predict whether they were genuine salary payments:
pred_for_test = clf.predict(test_df)
# Create submission file
submission = data[["transaction_id", "is_salary"]]
for i, v in enumerate(test_ids):
submission.loc[v, "is_salary"] = pred_for_test[i]
Conclusion
Our approach demonstrates how group-level analysis can be more effective than transaction-level analysis for certain fraud detection tasks. By focusing on the relationship between senders and receivers rather than individual transactions, we were able to identify patterns that might otherwise be missed.
The simplicity of our features—just three group-level metrics—combined with the power of XGBoost allowed us to build an effective fraud detection system that performed well in the hackathon environment.
This simple yet effective approach won us first place in the hackathon, outperforming even Credit Suisse's own internal solution. It demonstrates how sometimes elegant, feature-engineered solutions can be more powerful than complex models when domain knowledge is properly applied.