CodingPandas

Merge json files using Pandas

Quick demo for merging multiple json files using Pandas –

import pandas as pd
import glob
import json
file_list = glob.glob("*.json")
>>> file_list
['b.json', 'c.json', 'a.json']

Use enumerate to assign counter to files.

allFilesDict = {v:k for v, k in enumerate(file_list, 1)}
>>> allFilesDict
{1: 'b.json', 2: 'c.json', 3: 'a.json'}

Append the data into list –

>>> data = []
for k,v in allFilesDict.items():
    if 1 <= k <= 2:
        with open(v, 'r') as d:
            jdata = json.load(d)
            if jdata:
                data.append(jdata)
...
data
[{'creator': 'NickFury', 'last_modifier': 'NickFury', 'title': 'Avengers', 'view_count': 300000}, {'creator': 'TonyStark', 'last_modifier': 'IronMan', 'title': 'Iron Man 3', 'view_count': 1899000}]

Create pandas dataframe –

df = pd.DataFrame(data)

Write the dataframe to json

df.to_json(r'/tmp/data_output_default.json')
dev-dsk% cat /tmp/data_output_default.json | jq
{
  "creator": {
    "0": "NickFury",
    "1": "TonyStark"
  },
  "last_modifier": {
    "0": "NickFury",
    "1": "IronMan"
  },
  "title": {
    "0": "Avengers",
    "1": "Iron Man 3"
  },
  "view_count": {
    "0": 300000,
    "1": 1899000
  }
}
df.to_json(r'/tmp/data_output_records.json', orient='records')
dev-dsk % cat /tmp/data_output_records.json|jq
[
  {
    "creator": "NickFury",
    "last_modifier": "NickFury",
    "title": "Avengers",
    "view_count": 300000
  },
  {
    "creator": "TonyStark",
    "last_modifier": "IronMan",
    "title": "Iron Man 3",
    "view_count": 1899000
  }
]

One caveat of using dataframe is that its memory intensive. So if you have alot of files to merge, it can consume good amount of memory on host.
More about jq and to_json
One issue with above merged file is it doesn’t read the data as individual records –

dev-dsk % cat /tmp/data_output_records.json | wc -l

Using jq you can rewite it as –

dev-dsk % cat /tmp/data_output_records.json| jq -c '.[]' > /tmp/data_output_records_new.json
dev-dsk % cat /tmp/data_output_records_new.json|jq
{
  "creator": "NickFury",
  "last_modifier": "NickFury",
  "title": "Avengers",
  "view_count": 300000
}
{
  "creator": "TonyStark",
  "last_modifier": "IronMan",
  "title": "Iron Man 3",
  "view_count": 1899000
}
dev-dsk % cat /tmp/data_output_records_new.json | wc -l
2

Leave a Reply