Creating a simple Copy Data in Azure Data Factory can gives headaches sometimes. Getting an exception type of Microsoft.DataTransfer.Common.Shared.PluginRuntimeException. dataConsistencyVerification says VerificationResult is Unsupported. There was nothing fancy setup. It was a copy data with a source of anonymous api endpoint with an x-api-key header attached. It was sinked to a data lake landing zone with one-to-one mapping and settings that enables staging to the data lake storage account.
"Message": "ErrorCode=AzureDatabricksCommandError,Hit an error when running the command in Azure Databricks. Error details: org.apache.spark.SparkException: Job aborted.\r\nCaused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 120749.0 failed 1 times, most recent failure: Lost task 0.0 in stage 120749.0 (TID 121214) (... executor driver):
org.apache.spark.SparkException: Task failed while writing rows.\r\n
Caused by: com.databricks.sql.io.FileReadException: Error while reading file abfss:REDACTED_LOCAL_PART@...dfs.core.windows.net/staging/.../AzureDatabricksDeltaLakeImportCommand/data_...txt.\r\nCaused by: org.apache.spark.SparkException: Malformed records are detected in record parsing.
Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.\r\n
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: org.apache.spark.sql.catalyst.csv.MalformedCSVException: Malformed CSV record\r\nCaused by: org.apache.spark.sql.catalyst.csv.MalformedCSVException: Malformed CSV record\r\nCaused by: org.apache.spark.SparkException: Task failed while writing rows.\r\nCaused by: com.databricks.sql.io.FileReadException: Error while reading file abfss:REDACTED_LOCAL_PART@...dfs.core.windows.net/staging/.../AzureDatabricksDeltaLakeImportCommand/data_...txt.\r\n
Caused by: org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'.\r\n
Caused by: org.apache.spark.sql.catalyst.util.BadRecordException: org.apache.spark.sql.catalyst.csv.MalformedCSVException: Malformed CSV record\r\nCaused by: org.apache.spark.sql.catalyst.csv.MalformedCSVException: Malformed CSV record."
Going through the error message it’s not directly clean at first glance what’s going wrong besides it identified some malformed csv record.
But finding the course of error:
Task failed while writing rows.\r\n
It turned out one of the fields could contain new line escape character. Which Data Factory can’t handle unfortunately.
There are two ways to go about this. Either do the whole request and mapping in Databricks instead or change the Sink setting of your Copy Data pipeline to use a json or whatever format you have it in instead.
It doesn’t get ride of the problem, but now we at least have the data stored in our staging storage account. Here on out we can not handle these new line escape character in Databricks or Data Flow pipeline.